Bug Report
Summary
The ec2-terminate-by-tag experiment consistently fails with "instance is not in stopped state" error during the WaitForEC2Down phase, even though the target EC2 instance is confirmed to be in stopped state via the AWS Console.
Environment
- LitmusChaos version: 3.28 (ChaosCenter on EKS)
- go-runner image:
litmuschaos/go-runner:3.28.0
- AWS Region: ap-northeast-2
- Target: EC2 instances managed by Auto Scaling Group (ASG)
- EC2 instances are tagged and running a Go web server connected to RDS PostgreSQL
- Chaos was injected directly into an EC2 instance, not into an EC2 instance that was provisioned as an EKS node
Steps to Reproduce
- Set up 2 EC2 instances in an ASG with tag
ChaosTarget:True
- Configure
ec2-terminate-by-tag experiment with:
- Total Chaos Duration: 300
- Chaos Interval: 300
- Instance Affected Perc: 100 (also tested with 50)
- Sequence: parallel (also tested with serial)
- Managed Nodegroup: enable (also tested with disable)
- Default Health Check: false
- TIMEOUT: 300
- DELAY: 10
- Run the experiment
Expected Behavior
The experiment should:
- Stop the target EC2 instance(s)
- Detect the
stopped state via WaitForEC2Down
- Wait for chaos duration
- Complete successfully (or start the instance back if Managed Nodegroup is disabled)
Actual Behavior
- The EC2 instance is successfully stopped (confirmed via AWS Console showing
stopped state)
- The experiment pod logs show it is polling for the stopped state
- However,
WaitForEC2Down reports failure with "instance is not in stopped state"
- The experiment ends with
CHAOS_INJECT_ERROR
Error Log
time="2026-04-28T02:29:58Z" level=info msg="The instance state is stopped"
time="2026-04-28T02:30:00Z" level=info msg="The instance state is stopped"
time="2026-04-28T02:30:02Z" level=info msg="The instance state is stopped"
time="2026-04-28T02:30:04Z" level=info msg="[Probe]: {Actual value: 2}, {Expected value: 0}, {Operator: >=}"
time="2026-04-28T02:30:04Z" level=info msg="The instance state is stopped"
time="2026-04-28T02:30:06Z" level=error msg="Chaos injection failed: could not run chaos in parallel mode\n --- at /litmus-go/chaoslib/litmus/ec2-terminate-by-tag/lib/ec2-terminate-by-tag.go:63 (PrepareEC2TerminateByTag) ---\nCaused by: ec2 instance failed to stop\n --- at /litmus-go/chaoslib/litmus/ec2-terminate-by-tag/lib/ec2-terminate-by-tag.go:188 (injectChaosInParallelMode) ---\nCaused by: {"errorCode":"CHAOS_INJECT_ERROR","reason":"instance is not in stopped state","target":"{EC2 Instance ID: i-xxxxx, Region: ap-northeast-2}"}"
Configurations Tested (all produced the same error)
| Setting |
Values Tested |
| Sequence |
parallel, serial |
| Managed Nodegroup |
enable, disable |
| Instance Affected Perc |
50, 100 |
| Total Chaos Duration |
120, 300, 900 |
| Chaos Interval |
30, 60, 300 |
| TIMEOUT (env var) |
default, 300 |
| DELAY (env var) |
default, 10 |
| Default Health Check |
false |
Additional Context
- When
Managed Nodegroup: disable, LitmusChaos starts the stopped instance after chaos duration, causing conflict with ASG which has already launched a replacement instance. ASG then terminates the excess instance.
- When
Managed Nodegroup: enable, the same WaitForEC2Down error occurs.
- The actual chaos injection (EC2 stop) works correctly. The issue is only in the status verification logic within
WaitForEC2Down.
- A similar but separate issue exists for
rds-instance-stop where the RDS Instance Identifier and Region parameters appear to be swapped internally during serial mode execution.
- With the ASG's desired, minimum, and maximum instances all set to 2, I injected chaos into one of them. Even though I can manually confirm the instance transitioned to the stopped state, Litmus Chaos fails to verify the successful termination and throws the error message: 'instance is not in stopped state'.
Suspected Root Cause
The WaitForEC2Down function may have an internal polling limit or state check logic that fails when ASG is involved, possibly because ASG changes the instance state (e.g., terminates the stopped instance) before WaitForEC2Down can confirm the stopped state.
Bug Report
Summary
The
ec2-terminate-by-tagexperiment consistently fails with "instance is not in stopped state" error during theWaitForEC2Downphase, even though the target EC2 instance is confirmed to be instoppedstate via the AWS Console.Environment
litmuschaos/go-runner:3.28.0Steps to Reproduce
ChaosTarget:Trueec2-terminate-by-tagexperiment with:Expected Behavior
The experiment should:
stoppedstate viaWaitForEC2DownActual Behavior
stoppedstate)WaitForEC2Downreports failure with "instance is not in stopped state"CHAOS_INJECT_ERRORError Log
time="2026-04-28T02:29:58Z" level=info msg="The instance state is stopped"
time="2026-04-28T02:30:00Z" level=info msg="The instance state is stopped"
time="2026-04-28T02:30:02Z" level=info msg="The instance state is stopped"
time="2026-04-28T02:30:04Z" level=info msg="[Probe]: {Actual value: 2}, {Expected value: 0}, {Operator: >=}"
time="2026-04-28T02:30:04Z" level=info msg="The instance state is stopped"
time="2026-04-28T02:30:06Z" level=error msg="Chaos injection failed: could not run chaos in parallel mode\n --- at /litmus-go/chaoslib/litmus/ec2-terminate-by-tag/lib/ec2-terminate-by-tag.go:63 (PrepareEC2TerminateByTag) ---\nCaused by: ec2 instance failed to stop\n --- at /litmus-go/chaoslib/litmus/ec2-terminate-by-tag/lib/ec2-terminate-by-tag.go:188 (injectChaosInParallelMode) ---\nCaused by: {"errorCode":"CHAOS_INJECT_ERROR","reason":"instance is not in stopped state","target":"{EC2 Instance ID: i-xxxxx, Region: ap-northeast-2}"}"
Configurations Tested (all produced the same error)
Additional Context
Managed Nodegroup: disable, LitmusChaos starts the stopped instance after chaos duration, causing conflict with ASG which has already launched a replacement instance. ASG then terminates the excess instance.Managed Nodegroup: enable, the sameWaitForEC2Downerror occurs.WaitForEC2Down.rds-instance-stopwhere the RDS Instance Identifier and Region parameters appear to be swapped internally during serial mode execution.Suspected Root Cause
The
WaitForEC2Downfunction may have an internal polling limit or state check logic that fails when ASG is involved, possibly because ASG changes the instance state (e.g., terminates the stopped instance) beforeWaitForEC2Downcan confirm thestoppedstate.