Terraform/Jmeter performance testing - practical experience
Over the last while we've had the opportunity to put our new Jmeter learnings to work.
- Bug came up that was only evident under load - we were able to reproduce it in our dev environments! The dev ran Jmeter off his laptop, and it was enough load to generate the bug.
- I think I mentioned last time about how the simple act of mapping out a Jmeter script revealed excess calls to our middleware - tickets were created to address this.
- QA has used it to help draw out issues with a new production environment, but...
...the other day they ran out of steam on their laptops. So we got to come back to the Terraform/Jmeter setup we built a few months back. Thankfully everything still worked, and we were quickly (15m) on our feet with 1 master and 6 slaves (c4.large) raising heck.
This is where I will talk about the lessons we learned today...
- Terraform is amazing and was totally worth the time investment
- If you are testing a cold production environment - ASK ABOUT HOSTS FILE CHANGES!! We hammered 'real' production for a good 5 minutes before realizing that. "Huh, where are the logs?" After the cold sweats passed, everyone had a good giggle.
- Killing the Jmeter master test run doesn't also kill the slave processes! Oops.
- Manually updating the master/slave nodes quickly gets old (just git pulls, but definitely worth some automation)
- Manually uploading the reports/data dumps to S3 quickly gets old (even if it's pasted cmds)
- Manually linking S3 reports/dumpfiles quickly gets old (even if it's just 3 files)
- Not having a parameterized Jmeter script (i.e. for the thread count variable) leads to wasted time (although not sure how thread count #s work in a master/slave setup - probably won't help that much)
- There is a ceiling of thread count/thread groups that will cause Jmeter to blow up - GC overload, heap dump, etc - we had 15 thread groups with 3-4 requests each, and accidentally put in 417 threads per group. Oops. (fwiw, 417 threads with 5 thread groups was ok)
- If you are using AWS ELB IPs and hosts file changes on the slaves - check the IP validity before each test run! We had an IP expire and didn't realize - corrupted two of our test runs. Better yet, find a better solution than hosts file changes!!
- Don't use test accounts/organizations/users for performance testing! Use real data! Better yet, log replay!!! (this is our next big step, has to happen)
- Start with a CLEAR idea of the targets you are testing against. Don't assume you can easily translate a prod figure into a Jmeter figure! Better yet, use log replay!!
- It's hard to learn while performing!
- Performance testing is hard
Our next big steps are going to probably be:
- Put a front-end on the terraform/jmeter automation, make it accessible to dev/qa
- Log replay. Log replay. Log replay.