Chaos Engineering

Chaos engineering is the discipline of experimentally injecting failures into a system in order to build confidence in the system's capability to withstand unexpected disruptions in production.

Goal

The goal of chaos engineering is to identify weaknesses and potential points of failure in software systems before they lead to outages, ensuring the system's resilience and reliability.

Context

As systems grow in complexity, the interconnectedness of components increases, and it becomes increasingly difficult to predict how they will behave under failure conditions. Chaos engineering provides a structured approach to understanding system behaviour under stress and identifying areas for improvement.

Chaos Test Types

Type	Description	Examples
Fault Injection	This is the core technique of chaos engineering, where you deliberately introduce faults or errors into the system. Tools like Gremlin, Chaos Monkey (Netflix), or Hey!Chaos can be used to automate fault injection at various layers of the system (network, infrastructure, application).	Latency: Simulate slow network connections or delays in processing requests. Packet Loss: Introduce data loss during communication between system components. Resource Exhaustion: Deplete resources like CPU, memory, or disk space to test how the system handles overload. Process Termination: Simulate unexpected crashes or restarts of applications or services.
State Transitions	This approach focuses on manipulating the state of system components to trigger unexpected behaviours.	Sudden Scaling: Rapidly scale system resources (up or down) to observe how the system adapts to changing loads. Database Corruption: Introduce temporary corruption in data stores (in a controlled environment) to test data integrity and recovery mechanisms. Cache Invalidation: Invalidate cache entries to test how the system fetches fresh data from the origin.
Network Partitioning	This technique simulates network disruptions by isolating parts of the system from each other. This can expose issues with communication or distributed functionalities.	Simulating Network Outages: Isolate specific components or entire sections of the network to test how the system handles communication failures. Partitioning Services: Temporarily disconnect communication between services or microservices to assess their resilience and failover mechanisms.
Limit Testing	This approach involves pushing the system beyond its normal operating limits to identify potential bottlenecks or resource constraints.	Load Testing: Simulate high volumes of traffic or concurrent requests to test system performance and scalability under heavy load. - Data Floods: Introduce large amounts of data into the system to test its ability to handle data ingestion and processing efficiently. - Resource Starvation: Deliberately limit available resources (CPU, memory) to observe how the system handles resource constraints and prioritises tasks.
Lack of Observability: Failing to properly monitor system behaviour during experiments can prevent the identification of issues.	Overlooking Post-Experiment Analysis: Not conducting thorough analysis after experiments misses opportunities for learning and improvement.	Overly aggressive experiments: Conducting experiments that are too disruptive can lead to unnecessary risk and potential damage to the system.

Goal

Context

Chaos Test Types

Want to learn more?

Prefer events?