
AI is transforming the software landscape, with many organizations integrating AI-driven workflows directly into their applications or exposing their functionality to external, AI-powered processes. This evolution brings new and unique challenges for automated testing. Large language models (LLMs), for example, inherently produce non-deterministic outputs, which complicate traditional testing methods that rely on predictable results matching specific expectations. Repeatedly verifying LLM-based systems leads to repeated calls to these models—and if the LLM is provided by a third party, costs can quickly escalate. Additionally, new protocols such as MCP and Agent2Agent (A2A) are being adopted, enabling LLMs to gain richer context and execute actions, while agentic systems can coordinate between different agents in the environment. What strategies can teams adopt to ensure reliable and effective testing of these new, AI-infused applications in the face of such complexity and unpredictability?
Real-World Examples and Core Challenges
Let me share some real-world examples from our work at Parasoft that highlight the challenges of testing AI-infused applications. For instance, we integrated an AI Assistant into SOAtest and Virtualize, allowing users to ask questions about product functionality or create test scenarios and virtual services using natural language. The AI Assistant relies on external large language models (LLMs) accessed via OpenAI-compatible REST APIs to generate responses and build scenarios, all within a chat-based interface that supports follow-up instructions from users.
When developing automated tests for this feature, we encountered a significant challenge: the LLM’s output was nondeterministic. The responses presented in the chat interface varied each time, even when the underlying meaning was similar. For example, when asked how to use a particular product feature, the AI Assistant would provide slightly different answers on each occasion, making exact-match verification in automated tests impractical.
Another example is the CVE Match feature in Parasoft DTP, which helps users prioritize which static analysis violations to address by comparing code with reported violations to code with known CVE vulnerabilities. This functionality uses LLM embeddings to score similarity. Automated testing for this feature can become expensive when using a third-party external LLM, as each test run triggers repeated calls to the embeddings endpoint.
Designing Automated Tests for LLM-Based Applications
These challenges can be addressed by creating two distinct types of test scenarios:
- Test Scenarios Focused on Core Application Logic
The primary test scenarios should concentrate on the application’s core functionality and behavior, rather than relying on the unpredictable output of LLMs. Service virtualization is invaluable in this context. Service mocks can be created to simulate the behavior of the LLM, allowing the application to connect to the mock LLM service instead of the live model. These mocks can be configured with a variety of expected responses for different requests, ensuring that test executions remain stable and repeatable, even as a wide range of scenarios are covered.
However, a new challenge arises with this approach: maintaining LLM mocks can become labor-intensive as the application and test scenarios evolve. For example, prompts sent to the LLM may change when the application is updated, or new prompts may need to be handled for additional test scenarios. A service virtualization learning mode proxy offers an effective solution. This proxy routes requests to either the mock service or the live LLM, depending on whether it has previously encountered the request. Known requests are sent directly to the mock service, avoiding unnecessary LLM calls. New requests are forwarded to the LLM, and the resulting output is captured and updated in the mock service for future use. Parasoft development teams have been using this strategy to stabilize tests by creating stable mocked responses, keeping the mocks up to date as the application changes or new test scenarios are added, and reducing LLM usage and associated costs.
- End-to-End Tests that Include the LLM
While mock services are valuable for isolating business logic, achieving full confidence in AI-infused applications requires end-to-end tests that interact with the actual LLM. The main challenge here is the nondeterministic nature of LLM outputs. To address this, teams can use an “LLM judge”—an LLM-based testing tool that evaluates whether the application’s output semantically matches the expected result. This approach involves providing the LLM that is doing the testing with both the output and a natural language description of the expected behavior, allowing it to determine if the content is correct, even when the wording varies. Validation scenarios can implement this by sending prompts to an LLM via its REST API, or by using specialized testing tools like SOAtest’s AI Assertor.
End-to-end test scenarios also face difficulties when extracting data from nondeterministic outputs for use in subsequent test steps. Traditional extractors, such as XPath or attribute-based locators, may struggle with changing output structures. LLMs can be used within test scenarios here as well: by sending prompts to an LLM’s REST API or using UI-based tools like SOAtest’s AI Data Bank, test scenarios can reliably identify and store the correct values, even as outputs change.
Testing in the Evolving AI Landscape: MCP and Agent2Agent
As AI evolves, new protocols like Model Context Protocol (MCP) are emerging. MCP enables applications to provide additional data and functionality to large language models (LLMs), supporting richer workflows—whether user-driven via interfaces like GitHub Copilot or autonomous via AI agents. Applications may offer MCP tools for external workflows to leverage or rely on LLM-based systems that require MCP tools. MCP servers function like APIs, accepting arguments and returning outputs, and must be validated to ensure reliability. Automated testing tools, such as Parasoft SOAtest, help verify MCP servers as applications evolve.
When applications and test scenarios depend on external MCP servers, those servers may be unavailable, under development, or costly to access. Service virtualization is valuable for mocking MCP servers, providing reliable and cost-effective test environments. Tools like Parasoft Virtualize support creating these mocks, enabling testing of LLM-based workflows that rely on external MCP servers.
For teams building AI agents that interact with other agents, the Agent2Agent (A2A) protocol offers a standardized way for agents to communicate and collaborate. A2A supports multiple protocol bindings (JSON-RPC, gRPC, HTTP+JSON/REST) and operates like a traditional API with inputs and outputs. Applications may provide A2A endpoints or interact with agents over A2A, and all related workflows require thorough testing. Similar to MCP use cases, Parasoft SOAtest can test agent behaviors against various inputs, while Parasoft Virtualize can mock third-party agents, ensuring control and stability in automated tests.
Conclusion
As AI continues to reshape the software landscape, testing strategies must evolve to address the unique challenges of LLM-driven and agent-based workflows. By combining advanced testing tools, service virtualization, learning proxies, techniques to handle nondeterministic outputs, and testing of MCP and A2A endpoints, teams can ensure their applications remain robust and reliable—even as the underlying AI models and integrations change. Embracing these modern testing practices not only stabilizes development and reduces risk, but also empowers organizations to innovate confidently in an era where AI is moving to the core of application functionality.




