Skip to content

ec2-terminate-by-tag: WaitForEC2Down reports "instance is not in stopped state" even when EC2 is confirmed stopped in AWS Console #5491

@Park-Jeong-Hyeok

Description

@Park-Jeong-Hyeok

Bug Report

Summary

The ec2-terminate-by-tag experiment consistently fails with "instance is not in stopped state" error during the WaitForEC2Down phase, even though the target EC2 instance is confirmed to be in stopped state via the AWS Console.

Environment

  • LitmusChaos version: 3.28 (ChaosCenter on EKS)
  • go-runner image: litmuschaos/go-runner:3.28.0
  • AWS Region: ap-northeast-2
  • Target: EC2 instances managed by Auto Scaling Group (ASG)
  • EC2 instances are tagged and running a Go web server connected to RDS PostgreSQL
  • Chaos was injected directly into an EC2 instance, not into an EC2 instance that was provisioned as an EKS node

Steps to Reproduce

  1. Set up 2 EC2 instances in an ASG with tag ChaosTarget:True
  2. Configure ec2-terminate-by-tag experiment with:
    • Total Chaos Duration: 300
    • Chaos Interval: 300
    • Instance Affected Perc: 100 (also tested with 50)
    • Sequence: parallel (also tested with serial)
    • Managed Nodegroup: enable (also tested with disable)
    • Default Health Check: false
    • TIMEOUT: 300
    • DELAY: 10
  3. Run the experiment

Expected Behavior

The experiment should:

  1. Stop the target EC2 instance(s)
  2. Detect the stopped state via WaitForEC2Down
  3. Wait for chaos duration
  4. Complete successfully (or start the instance back if Managed Nodegroup is disabled)

Actual Behavior

  • The EC2 instance is successfully stopped (confirmed via AWS Console showing stopped state)
  • The experiment pod logs show it is polling for the stopped state
  • However, WaitForEC2Down reports failure with "instance is not in stopped state"
  • The experiment ends with CHAOS_INJECT_ERROR

Error Log

time="2026-04-28T02:29:58Z" level=info msg="The instance state is stopped"
time="2026-04-28T02:30:00Z" level=info msg="The instance state is stopped"
time="2026-04-28T02:30:02Z" level=info msg="The instance state is stopped"
time="2026-04-28T02:30:04Z" level=info msg="[Probe]: {Actual value: 2}, {Expected value: 0}, {Operator: >=}"
time="2026-04-28T02:30:04Z" level=info msg="The instance state is stopped"
time="2026-04-28T02:30:06Z" level=error msg="Chaos injection failed: could not run chaos in parallel mode\n --- at /litmus-go/chaoslib/litmus/ec2-terminate-by-tag/lib/ec2-terminate-by-tag.go:63 (PrepareEC2TerminateByTag) ---\nCaused by: ec2 instance failed to stop\n --- at /litmus-go/chaoslib/litmus/ec2-terminate-by-tag/lib/ec2-terminate-by-tag.go:188 (injectChaosInParallelMode) ---\nCaused by: {"errorCode":"CHAOS_INJECT_ERROR","reason":"instance is not in stopped state","target":"{EC2 Instance ID: i-xxxxx, Region: ap-northeast-2}"}"

Configurations Tested (all produced the same error)

Setting Values Tested
Sequence parallel, serial
Managed Nodegroup enable, disable
Instance Affected Perc 50, 100
Total Chaos Duration 120, 300, 900
Chaos Interval 30, 60, 300
TIMEOUT (env var) default, 300
DELAY (env var) default, 10
Default Health Check false

Additional Context

  • When Managed Nodegroup: disable, LitmusChaos starts the stopped instance after chaos duration, causing conflict with ASG which has already launched a replacement instance. ASG then terminates the excess instance.
  • When Managed Nodegroup: enable, the same WaitForEC2Down error occurs.
  • The actual chaos injection (EC2 stop) works correctly. The issue is only in the status verification logic within WaitForEC2Down.
  • A similar but separate issue exists for rds-instance-stop where the RDS Instance Identifier and Region parameters appear to be swapped internally during serial mode execution.
  • With the ASG's desired, minimum, and maximum instances all set to 2, I injected chaos into one of them. Even though I can manually confirm the instance transitioned to the stopped state, Litmus Chaos fails to verify the successful termination and throws the error message: 'instance is not in stopped state'.

Suspected Root Cause

The WaitForEC2Down function may have an internal polling limit or state check logic that fails when ASG is involved, possibly because ASG changes the instance state (e.g., terminates the stopped instance) before WaitForEC2Down can confirm the stopped state.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions