Jmeter-Terraform - Dealing with AWS ELB IP changes during load testing

We ran into an issue when trying to load test our 'new' production environment - the ELB IP addresses change as it silently auto-scales.  And since you're throwing load at it, of course those IPs will change!

When we started the tests, we had updated the hosts file on the master/slave Jmeter nodes, but of course at some point during the day the IPs changed and we got a pile of timeouts.  After much gnashing of teeth, we found the DNS Cache Manager in Jmeter, and figured there were two solutions:

  1. Use the DNS Cache Manager to point to a custom DNS server (that we set up), and have that DNS server deal with keeping the IPs up to date.
  2. Write a cron job to run on each node that updates the hosts file, then use the DNS Cache Manager to 'clear cache on each iteration'
We elected to do #2 (since our Jmeter infra is built/destroyed a lot, it didn't make sense to have another server in the mix), and here's what we ended up with.

Terraform applies userdata to each EC2 instance, so we simply added a file and a command to that. (don't judge me on Linux security, k? :) )

- encoding: b64
content: base64encodedgoobliehere
owner: root:root
path: /root/updatedns.sh
permissions: 0644

Then in the runcmd section:
- (crontab -l ; echo "* * * * * /bin/sh /root/updatedns.sh") | sort - | uniq - | crontab -

The file contents:
grep -n 'myrealproddomain.com' /etc/hosts && sed -i "s/\(.*\)myrealproddomain.com/$(dig elb-name-and-id.aws-region.elb.amazonaws.com | grep "IN A" | awk '{ print $5 }' | head -n 1) myrealproddomain.com/g" /etc/hosts || echo $(dig elb-name-and-id.aws-region.elb.amazonaws.com | grep "IN A" | awk '{ print $5 }' | head -n 1) myrealproddomain.com >> /etc/hosts

Where 'myrealproddomain.com' is your production domain name (in our case, we only had to deal with one domain to falsify), and 'elb-name-and-id.aws-region.elb.amazonaws.com' is of course the A record that AWS provides you for the ELB.

It's ghetto, but darn it if it doesn't work.  You still have the chance that the IPs will change and you might have to wait up to 60s for that to remediate, but that's better than an entire test run bombing out.

Comments

Popular posts from this blog

Fixing duplicate SPNs (service principal name)

DFSR - eventid 4312 - replication just won't work

Logstash to Nagios - alerting based on Windows Event ID