Testing the Unpredictable: Strategies for AI-Infused Applications

The rise of AI-infused applications, particularly those leveraging Large Language Models (LLMs), has introduced a major challenge to traditional software testing: non-determinism. Unlike conventional applications that produce fixed, predictable outputs, AI-based systems can generate varied, yet equally correct, responses for the same input. This unpredictability makes ensuring test reliability and stability a daunting task.

A recent SD Times Live! Supercast, featuring Parasoft evangelist Arthur Hicken and Senior Director of Development Nathan Jakubiak, shed light on practical solutions to stabilize the testing environment for these dynamic applications. Their approach centers on a combination of service virtualization and next-generation AI-based validation techniques.

Stabilizing the LLM’s Chaos with Virtualization

The core problem stems from what Hicken called the LLM’s capriciousness, which can lead to tests being noisy and consistently failing due to slight variations in descriptive language or phrasing. The proposed solution is to isolate the non-deterministic LLM behavior using a proxy and service virtualization.

“One of the things that we like to recommend for people is first to stabilize the testing environment by virtualizing the non-deterministic behaviors of services in it,” Hicken explained. “So the way that we do that, we have an application under test, and obviously because it’s an AI-infused application, we get variations in the responses. We don’t necessarily know what answer we’re going to get, or if it’s right. So what we do is we take your application, and we stick in the Parasoft virtualized proxy between you and the LLM. And then we can capture the time traffic that’s going between you and the LLM, and we can automatically create virtual services this way, so we can cut you off from the system. And the cool thing is that we also learn from this so that if your responses start changing or your questions start changing, we can adapt the virtual services in what we call our learning mode.”

Hicken said that Parasoft’s technique involves placing a virtualized proxy between the application under test and the LLM. This proxy can capture a request-response pair. Once learned, the proxy provides that fixed response every time the exact request is made. By cutting the live LLM out of the loop and substituting it with a virtual service, the testing environment is instantly stabilized.

This stabilization is crucial because it allows testers to revert to using traditional, fixed assertions, he said. If the LLM’s text output is reliably the same, testers can confidently validate that a secondary component, such as a Model Context Protocol (MCP) server, displays its data in the correct location and with the proper styling. This isolation ensures a fixed assertion on the display is reliable and fast.

Controlling Agentic Workflows with MCP Virtualization

Beyond the LLM itself, modern AI applications often rely on intermediary components like MCP servers for agent interactions and workflows—handling tasks like inventory checks or purchases in a demo application. The challenge here is two-fold: testing the application’s interaction with the MCP server, and testing the MCP server itself.

Service virtualization extends to this layer as well. By stubbing out the live MCP server with a virtual service, testers can control the exact outputs, including error conditions, edge cases or even simulating an unavailable environment. This ability to precisely control back-end behavior allows for comprehensive, isolated testing of the main application’s logic. “We have a lot more control over what’s going on, so we can make sure that the whole system is acting in a way that we can anticipate and test in a rational manner, enabling full stabilization of your testing environment, even when you’re using MCPs.”

In the Supercast, Jakubiak demoed booking camping equipment through a camp store application.

This application has a dependence on two external components: an LLM for processing the natural language queries and responding, and an MCP server, which is responsible for things like providing available inventory or product information or actually performing the purchase.

“Let’s say that I want to go on a backpacking trip, and so I need a backpacking tent. And so I’m asking the store, please evaluate the available options, and suggest one for me,” Jakubiak said. The MCP server finds available tents for purchase and the LLM provides suggestions, such as a two-person lightweight tent for this trip. But, he said, “since this is an LLM-based application, if I were to run this query again, I’m going to get slightly different output.”

He noted that because the LLM is non-deterministic, using a traditional approach of fixed assertion validating won’t work, and this is where the service virtualization comes in. “Because if I can use service virtualization to mock out the LLM and provide a fixed response for this query, I can validate that that fixed response appears properly, is formatted properly, is in the right location. And I can now use my fixed assertions to validate that the application displays that properly.”

Having shown how AI can be used in testing complex applications, Hicken assured that humans will continue to have a role. “Maybe you’re not creating test scripts and spending a whole lot of time creating these test cases. But the validation of it, making sure everything is acting as it should, and of course, with all the complexity that’s built into all these things, constantly monitoring to make sure that the tests are keeping up when there are changes to the application or scenarios change.”

At some level, he asserted, testers will always be involved because someone needs to look at the application to see that it meets the business case and satisfies the user. “What we’re saying is, embrace AI as a pair, a partner, and keep your eye on it and set up guardrails that let you get a good assessment that things are going, what they should be. And this should help you do much better development and better applications for people that are easier to use.”