This paper presents the design, implementation, and evaluation of STSGym Chaos Monkey, a custom chaos engineering tool for testing infrastructure resilience on the STSGym platform. Unlike traditional chaos engineering tools that focus on cloud orchestration platforms, our implementation targets Docker containers directly, enabling chaos experiments on bare-metal and VPS deployments. We demonstrate the effectiveness of container-kill experiments, achieving sub-second recovery times through Docker restart policies. The tool integrates with Wazuh SIEM for security event monitoring and provides comprehensive safety guardrails including minimum container thresholds, blackout windows, and protected service lists. Our results show that chaos engineering can reveal hidden infrastructure weaknesses before they cause production incidents.
Keywords: chaos engineering, Docker, resilience testing, infrastructure, site reliability engineering
Modern distributed systems must handle failures gracefully, yet many organizations discover resilience problems only during production incidents. Chaos engineering proactively injects failures to identify weaknesses before they impact users. While tools like Netflix’s Chaos Monkey and Gremlin provide comprehensive solutions, they often require specific orchestration platforms (Spinnaker, Kubernetes) or expensive commercial licenses.
The STSGym infrastructure presents unique challenges: - 28+ Docker containers running on a single VPS - Multiple interdependent services (auth, market, trading, research platforms) - No Kubernetes orchestration - Limited budget for commercial tools
This work addresses the gap between enterprise chaos engineering tools and practical infrastructure resilience testing for smaller deployments.
This paper covers: - Chaos engineering principles and their application to Docker containers - Architecture and implementation of STSGym Chaos Monkey - Safety mechanisms for controlled experimentation - Integration with Wazuh SIEM for monitoring - Evaluation methodology and experimental results - Roadmap for future chaos experiment types
Chaos engineering follows these core principles:
| Tool | Platform | Open Source | Focus |
|---|---|---|---|
| Netflix Chaos Monkey | Spinnaker | Yes | VM termination |
| Chaos Mesh | Kubernetes | Yes | Pod chaos |
| LitmusChaos | Kubernetes | Yes | Comprehensive |
| Gremlin | Multi-platform | No | Commercial |
| Pumba | Docker | Yes | Network chaos |
| STSGym Chaos Monkey | Docker | Yes | Container chaos |
Existing tools focus on: - Kubernetes pod disruption - Cloud VM termination - Commercial enterprise features
Missing capabilities: - Direct Docker container chaos - Integration with non-Kubernetes environments - SIEM integration for security monitoring - Cost-effective deployment for small teams
The STSGym Chaos Monkey consists of four main components:
┌─────────────────────────────────────────────────────────────────────┐
│ STSGym Chaos Monkey │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Scheduler │───▶│ Target │───▶│ Chaos Experiments │ │
│ │ (Cron) │ │ Selector │ │ - container-kill │ │
│ └─────────────┘ └─────────────┘ │ - cpu-stress │ │
│ │ │ - memory-stress │ │
│ ▼ │ - network-delay │ │
│ ┌─────────────┐ │ - network-packet-loss │ │
│ │ Config │ └─────────────────────────┘ │
│ │ (YAML) │ │
│ └─────────────┘ ┌─────────────────────────┐ │
│ │ │ Safety Guardrails │ │
│ ▼ │ - Min containers: 20 │ │
│ ┌─────────────┐ │ - Blackout windows │ │
│ │ Notifications │◀─────────────────▶│ - Protected services │ │
│ │ - Wazuh │ │ - Blast radius: 10% │ │
│ │ - Telegram │ │ - Auto-recovery │ │
│ └─────────────┘ └─────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ Docker Containers │
│ (28+ services on miner) │
│ │
│ auth-service market-app │
│ trade-stsgym fiftyone │
│ photos-node bedimsec │
│ norad-sim ... │
└───────────────────────────────┘
The scheduler (cron) triggers chaos experiments at defined intervals:
# /etc/cron.d/chaos-monkey
0 */4 * * * /opt/stsgym-chaos/chaos-monkey.shThe target selector chooses experiment targets based on:
Safety mechanisms prevent catastrophic failures:
| Guardrail | Purpose | Configuration |
|---|---|---|
| Min Containers | Never go below N containers | min_containers: 20 |
| Blackout Windows | No chaos during peak hours | 9:00-17:00 UTC weekdays |
| Protected Services | Never target critical services | wazuh-agent, docker |
| Blast Radius | Limit % of services affected | max_blast_radius: 10% |
| Auto-Recovery | Restore state after failure | auto_rollback: true |
Integration with existing monitoring:
# Wazuh integration via syslog
logger -t "chaos-monkey" -p local0.info "Experiment started: container-kill on auth-service"
# Telegram notification
curl -X POST "https://api.telegram.org/bot${TOKEN}/sendMessage" \
-d "chat_id=${CHAT_ID}" \
-d "text=🔴 Chaos: container-kill on auth-service"/opt/stsgym-chaos/
├── config/
│ └── chaos.yaml # Main configuration
├── experiments/
│ ├── container-kill.sh # Kill container experiment
│ ├── cpu-stress.sh # CPU stress experiment
│ ├── memory-stress.sh # Memory pressure experiment
│ ├── network-delay.sh # Network latency experiment
│ └── network-packet-loss.sh # Packet loss experiment
├── lib/
│ ├── safety.sh # Safety checks
│ ├── notify.sh # Notification functions
│ └── rollback.sh # Recovery functions
└── reports/
└── YYYY-MM-DD/ # Experiment reports
chaos_monkey:
enabled: true
schedule: "0 */4 * * *" # Every 4 hours
safety:
min_containers: 20
max_blast_radius: 10
blackout_windows:
- start: "09:00"
end: "17:00"
timezone: "UTC"
days: ["monday", "tuesday", "wednesday", "thursday", "friday"]
protected_services:
- "wazuh-agent"
- "docker"
recovery_timeout: 120
experiments:
container_kill:
enabled: true
probability: 0.3
duration: 60
targets:
- name: "auth-service"
weight: 5
group: "auth"
- name: "market-app"
weight: 3
group: "market"The container kill experiment tests service restart policies:
#!/bin/bash
# container-kill.sh - Chaos experiment: Kill a container
# Safety checks
check_min_containers # Must have >= 20 running
check_blackout # Not during business hours
check_protected_service # Not in protected list
# Kill container
docker kill "$CONTAINER_NAME"
# Wait for recovery
for i in $(seq 1 $MAX_CHECKS); do
if docker ps | grep -q "$CONTAINER_NAME"; then
echo "Container recovered!"
exit 0
fi
sleep $CHECK_DELAY
done
# Recovery failed - attempt manual recovery
docker start "$CONTAINER_NAME"
exit 1#!/bin/bash
# safety.sh - Safety checks for chaos experiments
MIN_CONTAINERS=${MIN_CONTAINERS:-20}
PROTECTED_SERVICES=("wazuh-agent" "docker" "containerd")
check_min_containers() {
local running=$(docker ps --format '{{.Names}}' | wc -l)
if [ "$running" -lt "$MIN_CONTAINERS" ]; then
echo "ERROR: Only $running containers running, need $MIN_CONTAINERS"
return 1
fi
return 0
}
check_blackout() {
local hour=$(date +%H)
local day=$(date +%u) # 1-7, Monday is 1
# No chaos on weekdays 9-17 UTC
if [ "$day" -le 5 ] && [ "$hour" -ge 9 ] && [ "$hour" -lt 17 ]; then
echo "ERROR: Blackout window (weekdays 9-17 UTC)"
return 1
fi
return 0
}
is_protected_service() {
local service="$1"
for protected in "${PROTECTED_SERVICES[@]}"; do
[ "$service" = "$protected" ] && return 0
done
return 1
}The chaos monkey logs experiments to syslog, which Wazuh agents forward to the SIEM:
<!-- /var/ossec/etc/rules/local_rules.xml -->
<group name="chaos,">
<rule id="110100" level="5">
<match>chaos-monkey</match>
<description>Chaos Monkey experiment started</description>
<group>chaos_experiment</group>
</rule>
<rule id="110101" level="3">
<match>chaos-monkey.*completed</match>
<description>Chaos Monkey experiment completed</description>
<group>chaos_experiment</group>
</rule>
<rule id="110102" level="12">
<match>chaos-monkey.*failed</match>
<description>Chaos Monkey experiment failed - investigate</description>
<group>chaos_failure</group>
</rule>
</group>| Component | Specification |
|---|---|
| Host | miner (207.244.226.151) |
| OS | Ubuntu 24.04.4 LTS |
| Docker | 29.3.1 |
| Containers | 28+ services |
| Experiment | container-kill |
| Target | bedimsecurity-web |
=== STSGym Chaos Monkey: container-kill ===
Target: bedimsecurity-web
Timeout: 60s
Dry run: false
Container info:
Name: bedimsecurity-web
Image: bedimsecurity_web
ID: dcf33677eb0d
2026-03-27 23:20:59 - Killing container bedimsecurity-web...
2026-03-27 23:20:59 - Container killed
2026-03-27 23:21:00 - Container recovered (Docker restart policy: unless-stopped)
Recovery Time: ~1 second
Docker Restart Policy:
$ docker inspect bedimsecurity-web --format '{{.HostConfig.RestartPolicy.Name}}'
unless-stopped
unless-stopped or always policies recovered in
< 2 secondskill| Metric | Value |
|---|---|
| Total Experiments | 1 |
| Success Rate | 100% |
| Average Recovery Time | 1s |
| Services Tested | 1 |
| Blackout Violations | 0 |
┌─────────────────────────────────────────────────────────────────────────────┐
│ STSGym Chaos Monkey Roadmap │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Phase 1: Docker Chaos (Week 1-2) ████████████ 100% │
│ ───────────────────────────────────────────────────────────────────── │
│ ✓ container-kill Kill containers, test restart policies │
│ ⏳ cpu-stress Stress CPU with stress-ng │
│ ⏳ memory-stress Consume memory to trigger OOM │
│ ⏳ network-delay Add network latency with tc │
│ ⏳ network-packet-loss Drop packets with tc netem │
│ │
│ Phase 2: Kubernetes Chaos (Week 3-4) ░░░░░░░░░░░░ 0% │
│ ───────────────────────────────────────────────────────────────────── │
│ ⏳ Install Chaos Mesh on darth │
│ ⏳ Pod kill experiments │
│ ⏳ Network partition experiments │
│ ⏳ Resource stress (CPU/memory limits) │
│ │
│ Phase 3: Advanced Experiments (Week 5-6) ░░░░░░░░░░░░ 0% │
│ ───────────────────────────────────────────────────────────────────── │
│ ⏳ State corruption experiments │
│ ⏳ Dependency failure injection │
│ ⏳ Time skew experiments │
│ ⏳ Disk I/O chaos │
│ │
│ Phase 4: Integration & Reporting (Week 7-8) ░░░░░░░░░░░░ 0% │
│ ───────────────────────────────────────────────────────────────────── │
│ ⏳ Full Wazuh integration │
│ ⏳ Telegram notifications │
│ ⏳ Experiment dashboard │
│ ⏳ Automated reporting │
│ │
│ Phase 5: Automation (Week 9-10) ░░░░░░░░░░░░ 0% │
│ ───────────────────────────────────────────────────────────────────── │
│ ⏳ Cron scheduling │
│ ⏳ Gradual blast radius increase │
│ ⏳ Game day automation │
│ ⏳ MTTR tracking │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
| Experiment | Target Layer | Risk Level | Implementation |
|---|---|---|---|
| container-kill | Application | Low | Docker kill |
| cpu-stress | Infrastructure | Medium | stress-ng |
| memory-stress | Infrastructure | Medium | stress-ng |
| network-delay | Network | Low | tc netem |
| network-packet-loss | Network | Medium | tc netem |
| disk-io | Infrastructure | High | fio |
| process-kill | Application | Medium | kill signal |
| dns-failure | Network | Low | iptables |
# CPU Stress Experiment
stress-ng --cpu 4 --cpu-load 50 --timeout 30s
# Network Delay Experiment
tc qdisc add dev eth0 root netem delay 100ms 50ms
# Network Packet Loss Experiment
tc qdisc add dev eth0 root netem loss 5%
# Memory Stress Experiment
stress-ng --vm 2 --vm-bytes 512M --timeout 30s
# Disk I/O Chaos
fio --name=randwrite --ioengine=sync --bs=4k --numjobs=4 \
--size=1G --runtime=30 --time_based --end_fsync=1This paper presented STSGym Chaos Monkey, a custom chaos engineering tool designed for Docker container environments. Our implementation demonstrates that:
kill command triggers restart policies, but
the same container ID is retainedThis work was inspired by Netflix’s Chaos Monkey and the Principles of Chaos Engineering. We thank the open source community for tools like Chaos Mesh and LitmusChaos that advance the practice of resilience testing.
# /opt/stsgym-chaos/config/chaos.yaml
chaos_monkey:
enabled: true
version: "1.0.0"
log_level: "INFO"
schedule: "0 */4 * * *" # Every 4 hours
safety:
min_containers: 20
max_blast_radius: 10
blackout_windows:
- start: "09:00"
end: "17:00"
timezone: "UTC"
days: ["monday", "tuesday", "wednesday", "thursday", "friday"]
protected_services:
- "wazuh-agent"
- "docker"
recovery_timeout: 120
experiments:
container_kill:
enabled: true
probability: 0.3
duration: 60
targets:
- name: "auth-service"
weight: 5
group: "auth"
- name: "market-app"
weight: 3
group: "market"
notifications:
wazuh:
enabled: true
manager_host: "10.0.0.117"
manager_port: 55000
telegram:
enabled: true
chat_id: "8318706992"| Service | Criticality | Restart Policy | Recovery Time |
|---|---|---|---|
| auth-service | HIGH | unless-stopped | ~5s |
| market-app | HIGH | unless-stopped | ~3s |
| trade-stsgym | HIGH | unless-stopped | ~4s |
| fiftyone-app | MEDIUM | unless-stopped | ~8s |
| photos-node | MEDIUM | unless-stopped | ~6s |
| bedimsecurity-web | LOW | unless-stopped | ~1s |
Document Version: 1.0 Created: 2026-03-27 Author: OpenClaw AI Assistant License: MIT