Many modern software systems in finance, e-commerce, manufacturing, transportation and other industries include elements of AI technology. Applications that use AI -- or, specifically, machine learning technology -- present a new type of problem to testers: systematic bias.
Algorithm bias, much like human bias, stems from knowledge acquired through past experience. In the case of AI-based software, past experience comes in the form of data, known as validation sets, that is used to train these systems. But developers and architects can inadvertently select or influence the input of data so that it doesn't accurately represent the problem domain -- and, subsequently, can create algorithm bias that taints an AI system's output.
For example, many websites and search engines try to identify user characteristics to offer users more appropriate products and services. But, in doing so, data analysts often make assumptions that skew results.
For example, if males are more likely to visit and view code from software development websites, the system could miscategorize female users as male because they visited these sites. This inaccuracy could influence an AI-based application's decisions.
Additionally, it's all too easy to conflate correlation with causation. For example, AI training data could suggest a strong correlation between IQ score and income level. But a high IQ score could also indicate that a person has a higher education level -- a factor that could be a more important causative variable for income than IQ score is.
Incomplete data for a problem domain can create inaccuracies that lead to algorithm bias in the form of pernicious feedback loops. The real-world implications of bias in the software's training data are profound. When no one questions their results, predictive AI-based applications can unconsciously exacerbate their own biases over time.
Pernicious feedback loops
A prime example of the sort of pernicious feedback loop that can lead to algorithm bias is laid out in mathematician Cathy O'Neil's book Weapons of Math Destruction. She points to systems that score convicted defendants' risks of recidivism. Humans, in this case, a judge who determines a prisoner's sentence, accept the software's predictions as true -- or at least true enough to be useful -- and rarely question its decisions.
Use the right data and tests
Testers must ferret out algorithm and related biases in AI-based applications to prevent the legal and public relations ramifications caused by inaccurate results. They look for bugs, logic errors and even fundamental flaws within the software or data.
However, it is near-impossible to reverse-engineer a prediction and go through an algorithm to see why it generated that particular result. Algorithms are too complex for that sort of analysis, which is why people assume the veracity of machine learning results.
To determine the accuracy of an AI-based application, collect data about the algorithm bias -- what it got wrong or ignored -- and then use that information to analyze and improve it. Don't only collect data from accurate predictions. Accept and expect the likelihood of error.
Also, carefully examine the training data and validation sets used to build up an AI-based application's body of knowledge. Scrutinize what the training data actually measures, as well as whether it has both a logical and practical relationship to the end result.
Sort fact vs. fiction with vendor pitches
Software products in diverse industries highlight their AI capabilities as a means to make work easier for the end user. Of course, some vendors simply try to capitalize on the buzz of AI, while others can truly give their customers a competitive advantage via AI.
Don't start a proxy war
In mathematical models, data scientists sometimes elect to use proxy measures -- or metrics -- when fresher and more relevant data isn't available. There are limitations when you use one kind of data to substitute for predictions of something else. Often, that data isn't 100% applicable to a model's prediction.
Many people don't adequately check whether there is enough of a relationship between proxy metrics and the prediction itself. And, if they do, they often assume that a correlation is sufficient.
QA professionals must also examine common test cases and edge cases to ensure that the right decisions are being made. These test cases can help them to determine whether the results of an AI algorithm are correct and unbiased.
It can be difficult to rehabilitate or scrap an entire algorithm or the training data fed into it, as that means the AI model has to be entirely rethought -- an expensive proposition. But that outcome is better than a software product that returns incorrect results without explanation.
Testers must expand their role from traditional functionality and user story requirements checks to include algorithm soundness, as well. Whoever tests for algorithm bias must examine what went wrong, the effects of newly input data and the relationships between separate systems. In this context, testers might have a lot of discretion around the determination of software quality and whether the AI algorithms make the best possible decisions across a range of cases.