Using our everyday dev tools for effective Load and Performance testing

This is a repost from the RealEstate.com.au tech blog, the original post is by Andrew Midgley.

Previously at REA we’d had very special tools for Load and Performance testing that were quite expensive, very richly featured but completely disconnected from our every day development tools. The main outcome of this was that we ended up with a couple of engineers who were quite good at L & P testing with our enterprise tools while the majority of engineers found the barriers too great. We have moved to an approach which is far more inclusive and utilises many of the tools our engineers are working with on a daily basis. I’ll talk about how we did this for the most recent project I worked on.

Developing an application simulation model

Our project was for a brand new application so we didn’t have hard numbers that we could use for simulating expected traffic. But we were able to look at similar public facing apps and use these as our basis. It’s important for a number of reasons that we can closely simulate actual production traffic. It will allow us to better tailor/tweak our application stack, have confidence it’ll hold up under peak loads and not require us to over resource it. There are two main metrics we need to gather to create an application simulation model which we can get through our regular tools:

  1. Transaction Rate (requests per min). We need to work out what the transaction rate is for our different web pages during peak load. In the past I’d used New Relic for this, but had at times found it problematic matching individual requests to the controllers shown there. Using Splunk proved far more profitable, but other ways of analyzing your access log files can work nicely. There is something very nice about dealing with the raw requests and being able to query on them.
  2. Concurrency. It’s all good and well knowing how many individual transactions we need to simulate but it’s also important knowing how many concurrent users are required to generate this load. Matching the expected concurrent user levels will mean we accurately simulate things like open sessions on our servers and TCP ports on our network devices. We have end user stats collected for us by Omniture, and using these we could establish our peak hourly unique visitors and average session duration. Using this simple equation we can work out our peak concurrency: hourly unique visitors / (60 minutes / average session duration)

Writing a user friendly L&P script

We used a DSL provided by the ruby-jmeter gem to capture rather succinctly a representative user flow. Throughput percentages again were calculated based on Splunk data. It would be possible to script this up directly in JMeter itself but having the script written in this DSL is beneficial for these reasons:

  1. It goes nicely into source control, unlike JMeter’s JMX (XML) files.
  2. I find it far cleaner and easier to understand than JMeter itself. The DSL presumes various common sense defaults and keeps you away from some of the more arcane elements of JMeter.
  3. I find the L & P knowledge easier to share in this format. It makes it really easy to share and copy snippets of L & P script logic.
  4. Our developers are generally familiar with Ruby, but not necessarily so with JMeter. Being in Ruby the script can also do smart things outside the L & P script itself to setup test data etc.

Here is a slightly cut down version of the script we used

test do  
 
  defaults domain: 'www.realestate.com.au',
    image_parser: false
  with_user_agent :ie9
 
  header [
    { name: 'Accept-Encoding', value: 'gzip,deflate,sdch' },
    { name: 'Accept', value: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' }
  ]
 
  csv_data_set_config filename: 'agents.csv',
    variableNames: 'primary_agent,secondary_agent'
 
  threads count: 1000, rampup: 600, scheduler: false, continue_forever: true do
    random_timer 60_000, 40_000
    head name: 'Check Primary Agent Is Published', url: '/agent/${primary_agent}', protocol: 'https'
 
    Throughput percent: 24 do
      head name: 'Check Secondary Agent Is Published', url: '/agent/${secondary_agent}', protocol: 'https'
    end
 
    Throughput percent: 5 do
      get name: 'Get Agent Profile Page', url: '/agent/${primary_agent}', protocol: 'https' do
        assert substring: 'Share this agent', scope: 'main'
      end
    end
  end
 
end.flood  

Running the script and diagnostics

Generally we’ll generate load from nodes in the cloud and we use a service from flood.io to make this easier for us. The service takes care of provisioning load generators (in our AWS account or theirs) and aggregating results, and it provides pretty handy reporting and statistics. These reports are great at letting us know if something is going wrong, but we’ll generally simultaneously use other tools such as New Relic and CloudWatch to monitor what’s happening on our servers. Here is an example failed L&P test:
Load Test results from Flood.io

Generally we’d expect the transaction rate to track the ramping concurrency, but in this case the servers couldn’t keep up and as a result the response time blew out. We were able to align these blips with garbage collection events identified by using new relic.

Ultimately we were able to establish that the default JVM thread count used by our Scala application was far too low and limited what our servers should have been able to handle. By changing this and rerunning the test we were able to prove that our servers would handle expected peak load fine. This was by far a better solution than just simply provisioning extra servers.

Reporting and Source Control

We try to keep everything from the load script (and accompanying test data etc) to the test plan/reports in source control. By writing the plan/reports in Markdown these can easily be tracked in git and presented on our GitHub appliance.

When it’s time for a fresh L&P testing session I’ll create a new directory (with date or other meaningful string in the name) and the test plan/report as well as script can adapt over time.

This is a repost from the the REA tech blog, you can read the original post here

Start Load Testing Now with Flood IO