The Risk of Bias in AI Sourcing/Matching Solutions

Over the years there’s been quite a bit written about bias in the hiring process, but not much attention has been paid to bias specifically at the top of the funnel when people are sourcing for talent.

As such, I decided to write an article about solutions such as Eightfold.ai, Entelo, Spire and SeekOut that offer configurable blind review and selection so that users are not able to view names, pictures, years of experience past 10, education and/or graduation date and many other attributes to effectively mitigate unconscious bias from creeping into sourcing and initial selection process. As a result of the article, I learned about the Unbiasify Chrome extension which you may want to check out in addition to the other solutions I listed if you’re interested in a blind review.

Using AI in search and match solutions has increasingly been sold as a way to reduce bias in the hiring process – after all, if you remove the human from the equation, you should eliminate bias, right?

Not necessarily.

In that vein, back in July of 2016, I read an article on LinkedIn that suggested a combination of data science and algorithms could be used to remove bias from the matching (searching/sourcing) process, and over the past 2 years there has been an increasing amount of content in support of that claim. However, as I commented on that article from 2016, while matching algorithms cannot, of course, be biased based on uniquely human elements such as background, cultural environment or personal experiences (all of which fuel unconscious bias), that does not mean algorithms are immune from bias. In fact, algorithms can and do unintentionally “favor” certain factors or people at the expense of others. You’ve probably already seen at least an article or two on algorithms doing an abysmal job with gender and racial bias, but most of the examples cited don’t have anything to do with sourcing or hiring.

With matching algorithms, biased results can come from the data set on which they operate (which is arguably never complete or genuinely representative of the total target talent population), and bias can also unintentionally result from whatever algorithms “learn” or come to “know” about the data set. This is especially scary because we’re talking about judging the potential match of a person between search terms or a job description and the text a person happened to share when creating their resume, social media profile, application, etc. – and the data is typically limited on both sides of the equation.

I think bias is an especially fascinating area to explore because while humans can be trained to be aware of, recognize, and mitigate unconscious bias (although the results are not guaranteed), algorithms cannot be “aware” of any unintended biases they may have, primarily if they are not operating with demographic data on any level, and humans are highly unlikely to be able to identify algorithmic bias within results.

As mentioned in this Wired article on AI and neural networks, “With machine learning, the engineer never knows precisely how the computer accomplishes its tasks. The neural network’s operations are largely opaque and inscrutable. It is, in other words, a black box.” If the engineer doesn’t know how the algorithm accomplishes its task, how can the user? It is indeed a challenge for humans to anticipate and factor for algorithmic bias when there is no real way to know what an algorithm will “learn” and base “decisions” on.

Perhaps unsurprisingly, some studies have already found that machine learning can amplify bias.

So where can bias come from?

Algorithmic systems can develop biases as a result of pre-existing expectations of the engineers developing the systems, technical limitations of their design, user feedback, and the data the algorithms work with based on what is missing or is too abundant in the data they’re trained on.

Bias can be embedded into nearly any data set, private (e.g., your ATS) or public. You may be surprised to learn that “Wikipedia is incredibly biased and the underrepresentation of women in science is particularly bad” – this comes from Jessica Wade, a physicist who has personally written Wikipedia entries for nearly 300 women scientists over the past year. Wikipedia is the 5^th most visited website in the world, and yet only 18% of Wikipedia biographies are of women.

Similarly, if you look to Stack Overflow to source IT professionals, you need to be aware that out of a survey of over 57,000 Stack Overflow users, only 6.9% were women. If you’re using a matching solution that leverages data from Stack Overflow or even searching it manually yourself, you need to be aware that its data set is skewed. In comparison, a recent HackerRank study surveyed 14,616 software developers, and nearly 2,000 respondents were women, nearly double the percentage of Stack Overflow.

Any solution that leverages and learns from user feedback on results faces a bias challenge because the algorithms can and will learn from unconscious biases of users. User feedback loops can produce unintentional algorithmic bias if they make use of implicit or explicit signals sent to the algorithm when users indicate results are “good” or “bad,” or that they want results “more like this” and “less like this.”

For example, what if users end up consciously or unconsciously ranking white males as good results and women and other races or ethnicities poorly? I have to say this would be almost impossible to achieve if you were using blind review and selection – which is why I believe it should be a standard feature of any solution used to find and select talent! Of course, matching algorithms don’t “know” the race or gender of results (unless, of course, the data is labeled on the back end), but that does not mean that there are not unintended consequences of the patterns of positive and negative signals being sent, essentially training the matching algorithm on what constitutes “good results.”

Yet another potential source of algorithmic bias comes from solutions that use historical data as training inputs (interviews, hires, etc.). If the historical data itself happens to have an embedded bias (e.g., 90% of hires for a specific role were white males between 25-30 years old), there is a risk of the algorithms learning from these biased signals of “success.”

Cathy O’Neil, the author of “Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy,” believes “When we blithely train algorithms on historical data, to a large extent, we are setting ourselves up to merely repeat the past. We’ll need to do more, which means examining the bias embedded in the data.”

In the Financial Times article, AI Risks Replicating Tech’s Ethnic Minority Bias Across Business, Michael Sippitt, a director of Forbury People; a UK based HR consultancy, says “Tech races ahead of people working out how to use it.” He predicts there will be lawsuits citing discrimination in the future because of bias in automated hiring. This is because AI algorithms learn from historic data sets, he adds, so they are more likely to hire in the image of previous staff instead of helping to tackle unfair under-representation. For example, a survey of the existing educational background, age or experience of staff in a particular industry could encourage machine learning technology to exclude candidates that did not fit a particular profile.

“A lot of the CVs and historic profiles will be of one kind of candidate,” says Kriti Sharma, vice-president of artificial intelligence at Sage, the UK’s largest listed technology company. “If you were hiring a chief technology officer for a company and the algorithm was learning from historic data sets then what would you expect?”

But we don’t use demographic data!

Responsible AI matching solutions will tell you they don’t use any names, race, gender, pictures, age, etc., data in their matching processes (and they shouldn’t!), and on the surface that might make you (and them) feel comfortable about bias. However, unless you run comparative studies with a data set labeled with demographic data, there is actually no way to know if the matching algorithms do or do not produce any adverse/disparate impact – which in employment refers to selection practices that appear to be neutral but actually have a discriminatory effect on one or more protected groups.

Some talent-focused AI companies are aware of this. Pymetrics CEO Frida Polli believes, “We have to be really cautious when applying AI to hiring because untested and unchecked it can actually make diversity a lot worse.” That’s why they test their algorithm against a candidate pool of about 50,000 people for whom it has demographic data.

However, Pymetrics isn’t in the talent sourcing game, which is what I’m talking about here. Even so, it appears they have set the bar, at least publicly. If you are aware of any sourcing/matching solutions that test their algorithms against a statistically relevant set of people with demographic data to measure for adverse impact, please let me know.

If you really want to get in the weeds on this subject, you could begin to wonder (as I do) if there is any difference in the way that women write their resumes and social media profiles compared to men, or how members of a specific race or ethnicity represent themselves in their resumes and online profiles vs. other races or ethnicities. As we can see from Textio’s excellent work in the job posting space, there is a difference in how women read and respond to the text in job descriptions, and using Textio can result in a 23% increase in female job applicants. Conversely, if you’re not using Textio, it could mean that your ATS is skewed towards male applicants and not very representative of the total target talent pool.

What if we were to discover that women or people of a particular race or ethnicity tend to use fewer words and synonymous terms to describe their qualifications vs. men or another race or ethnicity? I think it would be dangerous to assume that there are no differences. If a search and match solution uses keyword/concept frequency at all, even if only as one of many factors in scoring/ranking results, this opens the door to not only algorithmic bias for AI-powered matching solutions, but also bias in a basic keyword search. Again, if you’re not evaluating results against a candidate pool with demographic data, you have no way of knowing if using your manual or automated search and match solution produces a disparate impact.

What can we do?

Don’t worry – the sky is not falling. I didn’t write this article to be alarmist – I merely want to call awareness to the potential risks of bias in AI-powered sourcing and matching solutions, as we will no doubt continue to see artificial intelligence applied to sourcing and matching and many other layers of the hiring funnel.

The use of AI in recruitment is still relatively new, so it’s not unexpected for us to always be figuring out the fair, responsible and ethical application of AI to employment and hiring, which of course starts with finding people in the first place, either through attracting applicants or through proactive sourcing.

To solve for algorithmic bias, some computer scientists believe that AI solutions can be modified to correct for bias embedded in data sets and algorithms. For example, you could attempt to correct for bias by introducing constraints so that an algorithm selects as many people from each gender or ethnicity or the same fraction of applicants in each subgroup. However, this remedy can be controversial and may be unlawful in some jurisdictions when taken to the extreme.

Wendy Hall, a professor of computer science at Southampton University and the author of a review into artificial intelligence commissioned by the UK government, mentions that “There’s a huge problem of bias in the [technology] workforce, but if you correct for it, you are manipulating things. Dealing with this is a big issue for how artificial intelligence is designed.”

Earlier this year, some of the world’s leading researchers of algorithmic bias convened in Halifax to discuss fairness, accountability, and transparency in machine learning (FAT/ML), seeking to develop and promote algorithmic transparency and accountability to counter the possibility of discrimination. I believe governments, solutions providers, our industry, and the broader community must do more to encourage this kind of work, whether in exploratory stages or advanced implementations.

I commend Pymetrics for open-sourcing their Audit AI tool for detecting bias in algorithms, now available on GitHub, which can be used to mitigate discriminatory patterns that exist within training data sets which influence or improve the probability of a population being selected by a machine learning algorithm.

More broadly, Microsoft and IBM appear to be developing algorithms for detecting, rating, and correcting bias and discrimination, although not necessarily specific to hiring.

When it comes to the data algorithms are using, think about “completeness,” not just regarding whether specific fields are populated, but whether the data set is completely relative to the target population. Wherever practically possible, seek to reduce the bias embedded in the data you gather and use in sourcing, recruiting, hiring and internal mobility. As Wendy Hall notes, “Now, with AI, we talk about bias in, bias out.” With the spread of artificial intelligence to employment functions such as recruitment, she says, bad inputs can mean biased outputs, which led to repercussions for women, the disabled and ethnic minorities.

Rachel Thomas, who works at fast.ai, a non-profit research lab that partners with the University of San Francisco’s Data Institute to provide training in deep learning to the developer community, raised a number of insightful questions to ask about AI-powered solutions during her “Analyzing & Preventing Unconscious Bias in Machine Learning” keynote presentation at QCon.ai 2018:

What bias is in the data? There’s some bias in all data, and we need to understand what it is and how the data was created.
Can the code and data be audited? Are they open source? There’s a risk when closed-source proprietary algorithms are used to decide things in healthcare and criminal justice and who gets hired or fired.
What are the error rates for different subgroups? If we don’t have a representative datas et, we may not notice that our algorithm is performing poorly on some subgroup. Are our sample sizes large enough for all subgroups in your data set? It’s important to check this, just like Pro Publica did with the recidivism algorithm that looked at race.
What is the accuracy of a simple rule-based alternative? It’s imperative to have a good baseline, and that should be the first step whenever we’re working on a problem because if someone asks if 95% accuracy is good, we need to have an answer. The correct answer depends on the context. This came up with the recidivism algorithm, which was no more effective than a linear classifier of two variables. It’s good to know what that simple alternative is.
What processes are in place to handle appeals or mistakes? We need a human appeals process for things that affect people’s lives. We, as engineers, have relatively more power in asking these questions within our companies.
How diverse is the team that built it? The teams building our technology should be representative of the people that are going to be affected by it, which increasingly is all of us.

I would recommend asking the above questions when evaluating and selecting any AI-powered HR technology. Also, you should ask if they can explain precisely why and how their solution ranks results in the order they are presented, how they specifically address the risk of algorithmic bias if they can verify that the use of their solution produces no adverse impact. Of course, if you’re already using such solutions, and you haven’t already asked those questions, now might be a good time.

Although there are risks of algorithmic bias when using AI for sourcing/matching, the reality is the manual search and non-AI powered methods we’ve been using for the past two decades may also be prone to unintended bias. The same applies to other applications of AI to the hiring process, such as AI video interview analysis vs. standard video or in-person interviewing. If you don’t have demographic data, there is no way to measure and say with any certainty that your baseline processes, including basic keyword search, do not, under certain circumstances, produce biased results that end up driving candidate selection and thus hiring outcomes.

At this stage of development, it is likely to be unrealistic to expect the perfection of 0% probability of adverse impact from AI-powered talent acquisition solutions, especially if we don’t inspect and expect it from all of our legacy systems and processes. However, I believe combining awareness with efforts designed to specifically address the key areas that can give rise to bias across the entire hiring funnel can easily improve diversity and inclusion above what we are achieving today, while we continue to work towards solutions that can eliminate bias altogether.