Performance tuning with Flood IO and New Relic part 3

In our first post of this series, we introduced the basic concepts of performance tuning and demonstrated how you can simulate load using Flood IO and analyse performance using New Relic.

The second post used slow transaction and database traces to help identify and tune obvious problems in our application under test.

In this last post we use New Relic to fix the remaining problems and confirm all are fixed in Flood IO.

Custom Instrumentation

New Relic includes an API you can use to collect additional metrics about your application. If you see large "Application Code" segments in transaction trace details, custom metrics can give a more complete picture of what is going on in your application.

Flood IO is still showing problems around caching and looking at the source code we see an existing tracer method in the CachingController.

This lets us create custom dashboards within New Relic that consumes this data. It's evident that no matter how many times this method is called, the minimum response time is always +30ms.

Looking at the code we can see this method is trying to make use of the Rails.cache however closer inspection identifies differences in the key name being read, and key name being written. Therefore the cache is never read from.

We can quickly deploy a fix and confirm success manually with a single browser session.

External Services

The tweets transaction is also slow and further investigation shows the majority of time spent in a call to an External service: Net::HTTP[twitter.com]: GET

Outbound calls to Twitter from the TweetsController are going to be expensive under concurrent load.

By simply caching at the page level, we can get away with not having to execute the controller code for every request, thereby limiting the amount of outbound calls made to an external service.

Errors

Last but not least, we want to track down error events. New Relic makes this easy in their event monitor. We can get an idea of the error rate and when they are occurring under load.

We can also get a breakdown of the types of errors that occurred.

The stack trace pinpoints exactly where in our application code things are going wrong.

These are simply functional errors, but the cost of serialising stack traces and handling those errors in a production environment can still be high. So a suitable outcome here is to resolve the division by zero error being reported.

Confirmation

The last part of a performance tuning test effort is to confirm all the iterative changes made to date hang together.

In our last baseline we see much better response time averages across the board, and are now easily satisfying the 4s target. We've also eliminated any errors under load.

Now that we've whipped the application under test into shape, it's time to start load testing. We choose an arbitrary concurrency of 1000 users with a response time target of less than 4s. We scale out with 6x Heroku dynos and 3x grid nodes in Flood IO across the US East and West Coast as well as Australia. We also add the Flood IO dashboard to New Relic.

The great thing about this is we have all the information in one place. Now that we've 'fixed' the initial round of performance defects, we can identify new problems under sustained load. Looks like request queuing is happening on the Heroku dynos, but resolution of that is for a different blog post.

Flood IO and New Relic is a powerful combination. We hope you get to use both platforms for your next performance test effort.