New Age of Testing: The Necessity of a New Approach in the AI Era
We are living in a time of dynamic AI development. It is already clear that this technology will, in some time, transform our lives. Consequently, the approach to testing must also evolve.
What will change for sure? First of all, our expectations regarding AI go much further than in the case of traditional algorithmic software.
EXAMPLES SHOWING SHORTCOMINGS OF TRADITIONAL TESTING
Let’s consider a chatbot based on LLM. If we want to prepare a set of questions and check answers, we will face the following problems. The first problem is that for a general-purpose chatbot, it is difficult to choose the universally optimal set of questions. The second problem is that any ‘defect’ can be quickly masked by new fine-tuning with our test data. The problem with the quality of reasoning will not be fixed, but the answers to our testing questions will be correct. When we continuously train the model, we are actually engaged in an endless chase to find new testing questions and teach the right answers without a deeper understanding of the problems.
Let’s delve further and consider AI-based generators of graphics or movies. Here, an additional problem arises: aesthetics. Our expectation is that the generated output will be visually appealing and leave an impression. Traditional testing theory states that requirements should be testable. Is an impression testable? How do we test it? How do we measure impression? How can we determine, for example, if a generated picture is creepier or less creepy?
The more human-like AI becomes, the more human-like behaviors we expect and want from them to ensure. Also, the more difficult it becomes for traditional testing.
FUNDAMENTALS OF TESTING IN MODERN TIMES: EXPECTATIONS AND LIMITATIONS
Every software product is considered based on the needs it fulfills. This forms the foundation of all considerations regarding testing. An AI-based solution is also a type of product and needs to be treated as such.
What are our expectations for AI-based solutions? We tend to expect that AI should behave similarly to humans. This means that we anticipate AI perceiving things as humans do, reasoning like humans do, and ultimately producing output similar to what a human could produce.
Furthermore, we anticipate that AI-based solutions will be creative, meaning they will be capable of addressing completely new situations and suggesting innovative solutions.
NEW AREAS OF SECURITY TESTING
LLMs are designed to be safe. They should refuse to answer certain questions and withhold certain information. However, there have been cases where specially crafted prompts led to bypassing such security measures. For instance, activation keys were generated in one instance, and a napalm recipe was generated as if it were a grandmother’s bedtime story in another. Such issues necessitate new approaches in security testing.
Some ‘jailbreaking’ prompts are already widely available on the internet. Certain researchers suggest that these vulnerabilities might even be an inherent aspect of all LLMs.
Moreover, the problem can become even more complex when extending beyond textual conversation into richer domains like the visual layer. We must consider the consequences, such as deceiving AI-based autonomous cars. There have already been experiments in this area. Certainly, this requires further investigation, development, and, of course, testing.
CLASSICAL VS. NEW APPROACH TO SOFTWARE TESTING
Classical software must execute predefined tasks in a predetermined manner. In contrast, generative AI must function like an engineer, showcasing creativity in solving given tasks that aren’t predefined.
Applying the classical approach to AI is akin to giving a student a list of exam questions along with their corresponding answers. The student may pass the exam, but when faced with a new real-world problem, they could feel lost. Evaluating the quality of generative models must resemble assessing students using a diverse set of previously unknown practical tasks.
This new approach necessitates drawing upon knowledge from exam administration within the QA field. Furthermore, evaluating model quality will involve a more probabilistic approach compared to the traditional binary one.
Another analogy illustrating this issue is the examination of future drivers. When we think about autonomous AI-controlled vehicles, we expect them to navigate roads with a level of safety at least comparable to that of humans. Therefore, it’s worth examining how we test humans.
Before an individual becomes a driver, they must pass both a theoretical and practical exam. The theoretical exam comprises a finite number of questions, akin to unit tests (or more precisely, a subset of such tests). The practical exam involves observing the individual as they handle real challenges on the road, which are influenced by random events. In one instance, the examinee might encounter more favorable conditions, while in another, less favorable conditions. Is this fair? Is this type of exam sufficient? This is where the principles of mathematical statistics and probability theory come into play.
In mathematical statistics, each outcome stems from a random experience. Thus, any conclusion drawn from empirical data carries a probability of being incorrect. Consider a question with two possible answers, “true” or “false.” A randomly given response will be correct with a probability of 1/2. If we repeat this experiment 10 times, the probability of acing a 10-question exam flawlessly is (1/2)^10, which is less than 1/1000. This concept is the Bernoulli’s scheme known well from school. As evident, increasing the sample size serves as a means of reducing the likelihood of obtaining correct results purely by chance. When applied to humans, a significant increase in attempts can lead to fatigue. Fortunately, machines don’t tire, and they can perform just as effectively after 10 hours of continuous operation.
In the Bernoulli scheme, the scope of variability is extremely narrow (true/false). If we consider multiple-choice tests with more than two possible answers, the probability of randomly getting correct results diminishes.
Multiple-choice tests primarily assess knowledge. But what about assessing skills? Let’s look at recruitment tasks for programmers. These tasks often manifest as challenges requiring the creation of a small program. Such a program can be analyzed functionally (unit tests) as well as non-functionally (performance, code length, code complexity).
THE FUTURE OF QA: WHAT LIES AHEAD?
To envision the future of QA in the realm of generative AI testing, we must scrutinize how model evaluation is conducted and how human knowledge is assessed. Numerous articles already compare LLM models across specific task types, with rankings that incorporate various parameters. This field is in its nascent stage, but some trends are already discernible.
WHAT SHOULD BE EVALUATED?
As we know, there are numerous models trained using different datasets and for different purposes. These models can exhibit varied performance across tasks. For instance, a model primarily trained on Git repositories might excel in coding tasks but fall short in social interactions. Just as humans possess diversified skills, not everyone can be suited for every job.
Modern-era testers’ crucial competency will lie not only in understanding evaluation methods but also in identifying the appropriate evaluation domain for a given application.
Currently, research papers encompass the following evaluation areas (see, for example, [1]):
- Natural language processing
- Robustness / Ethics / Biases / Trustworthiness
- Social science
- Natural science & engineering
- Medical applications
EXISTING EVALUATION TOOLS
We can divide existing tools into quantitative and qualitative. The following quantitative tools are available:
- OpenAI evals library,
lm-evaluation-harness
python package by EleutherAI- HELM — Holistic Evaluation of Language Models
Quantitative evaluation is still the area of future development. It is also more dependent on the specific needs of given application.
FUNDAMENTAL CHALLENGES IN MODEL SELECTION
As a graduate in applied mathematics, I find statistics and knowledge of statistical hypothesis testing to be fundamental in natural sciences. Let me briefly share some elementary insights in this area.
Two types of errors exist:
- Type I error
- Type II error
We cannot simultaneously control both types of errors by altering test parameters. We must decide which type of error concerns us more. Sometimes this is self-evident, while other times it is not.
Similarly, when we expect answers from models, we cannot control both types of errors in those answers. Our model might be more averse to one type of error over the other, and both approaches can prove useful in specific situations.
Consider a testing scenario: testing software for a computer-controlled radiation therapy machine. If the model disregards the risk of a defect’s presence due to fearing false positive alarms, someone could be exposed to excessive radiation, resulting in grave consequences (cf. the infamousTherac-25 case).
Conversely, in non-safety-critical applications, a certain level of defects might be tolerable, and only glaring critical defects are scrutinized and resolved. In such cases, the problem of software delivery delays outweighs that of minor defects. Excessive defect reporting could prove entirely ineffective.
Choosing which side is more crucial is again not obvious, and such a decision (even if supported by AI) must lead to selecting a specific model with a stricter or more lenient approach.
META-LEVEL CHALLENGES: ENSURING DATA QUALITY IN TRAINING
Once again, let’s compare AI to humans. Even an intelligent individual, when starting from incorrect assumptions, can arrive at erroneous conclusions. The same applies to AI. If a model is trained on poor-quality data, its output can be biased and harmful. The new approach to software testing must account for this.
Quality assurance needs to devise methods and tools for selecting and rectifying high-quality training data. This concept aligns with the idea of shift-left testing. The higher the quality of training data, the better the model’s output will be.
SOCIAL CHALLENGES PERTAINING TO MODEL EVALUATION
Consider medical applications of LLM. Clearly, a responsible approach in this domain demands extreme caution, especially when providing medical advice to individuals lacking medical education.
Simultaneously, people’s expectations might lean towards LLMs serving as cost-effective alternatives to expensive medical assistance. This could lead to the emergence of conspiracy theories suggesting that the medical establishment opposes AI development to safeguard their business interests.
The story was originally created by me, but it may contain parts that were created with AI assistance. My original text has been corrected and partially rephrased by Chat Generative Pre-trained Transformer to improve the language.
If you like this article and would be interested to read the next ones, the best way to support my work is to become a Medium member using this link:
If you are already a member and want to support this work, just follow me on Medium.