The Shift from Chaos to Controlled Reliability Testing

Chaos engineering, the practice of proactively injecting failure to test system resilience, has evolved. For enterprises today, the focus has shifted from chaos to reliability testing at scale.

“Chaos testing, chaos engineering is a little bit of misnomer,” Kolton Andrus, founder and CEO of Gremlin, told SD Times about the term with which he launched the company. “It was cool and hot for a little while, but a lot of companies aren’t really interested in chaos. They’re interested in reliability.”

For large enterprises, disaster recovery testing—such as a data center evacuation or testing the failure of a cloud region—is a massive undertaking. Customers have spent hundreds of engineering man-months to put these exercises together, resulting in infrequent tests. This leaves organizations vulnerable to risks that only appear under load.

The new focus is on building scaffolding to make this testing repeatable and easy to run across a whole company by clicking a few buttons. Andrus noted that a crucial element is safety, with Gremlin integrating into system health signals to ensure that if anything goes wrong, the changes are cleaned up, rolled back, or reverted immediately, preventing actual customer risk.

How to Test Against a Cloud Data Center

A key question for any company is how to simulate a major failure—like an AWS data center outage. “Ultimately, we are doing some disruption in production because that’s what you’re testing,” Andrus explained. Gremlin’s tooling can essentially create a network partition around a data center or availability zone. “So if I’ve got three zones, I can make one zone a true split brain. It can only see itself, it can only talk to itself.” By doing testing at the network layer, he said, organizations benefit by having the ability to undo things quickly if things are going wrong. “We’re not making an API call to AWS and saying ‘Shut down Dynamo, and remove these buckets.’ Or, shut down all my EC2 instances in this zone for an hour, because that’s hard to revert and you might get throttled by the AWS API when you’re bring it back up.” To address this issue, Andrus said Gremlin was built to be zone redundant from the beginning, so if one zone’s data centers fail, the application can keep running in another zone.

While the direct revenue impact—calculated by looking at the estimated number of expected orders versus the drop in actual orders—is the floor of an outage’s cost, the total impact is much greater. This includes a substantial engineering cost: teams spending days finding, fixing, triaging, and then figuring out the root cause, followed by meetings and follow-up work.

When tests fail, the remediation is guided by reliability intelligence, which draws from millions of previous experiments run through Gremlin to deduce likely causes and provide concrete, concise recommendations on how to fix the issues.

The biggest risks are often not the network itself, but the resulting failures in microservices. Subtle points like running in multiple regions but relying on a database in only one, or not distributing state among zones, can cause issues like lost customer carts or transactions. The company-wide testing is focused on the “glue and all the wiring” that connects services—DNS, traffic routing, and propagating important data across zones.

Ultimately, Andrus said, it’s about “finding those risks and fixing them so when the real thing happens, you don’t get surprised by this alternate behavior.”