Sunday 20 July 2008

Tech Tip du Jour: Benchmarking

Non-geeks may want to look away now.

Yes, kids, today we're going to talk about benchmarking. And I know you're all salivating at the prospect. If there is one thing that is guaranteed to get every Informix DBA moist and emotional, it's benchmarking.

Why doesn't IBM release a benchmark? Why doesn't IBM release this benchmark, or that benchmark?

But I'm not talking about that! I'm talking about when last you benchmarked your own environment. Have you ever? In 99 out of every 100 sites I've been to, the answer is "no".

So when users come up to and say: "the system is slower than it was last week", can you refute them? Do you have a representative suite of tests that you can run that reflect the consequences of increased user load and of increased data size? Can you plot a trend to show when you're likely to break your SLA so that you can get kit upgraded in time?

No? You answered "no" to all of the above?

Well, then, perhaps you need to think about writing a set of benchmarks -- they don't need to be complex, some timed dbaccess scripts will be fine! -- that will slow down as you database grows. You need to set up a separate, static database that you can use to show the consequences of increased user load.

So: take a number of your most common queries, preferably including some that aggregate all your data (what is our average order value, what is our most valuable customer?) as well as some that pull out OLTP data (which items belong to order 4285, what product is associated with product code 1037?) Write dbaccess scripts to time these, execute the scripts and record the times.

Then, create a reasonable subset database, which is static for all time. Execute the same queries against the subset database and records the times.

These performance tests should be run every day, via cron, when the system is idle, when it's busy and when load is general average. Make sure the data is kept in a way that it's easy to graph. Review the graphs regularly, you might be surprised at what you find.

(If you've got 11.1 or 11.5, you'll find nice ways of integrating this with OAT and dbcron. That's also good. But if you're on 10.0, 9.40, 7.31, 5.20 or SE, do it the old-fashioned way.)

Do your users use ODBC or JDBC? Make sure you have a full suite of network-based tests as well. I can't tell you how many times having a ready suite of network-based tests (simple things, like SQL statements in cells in Excel connected via ODBC) saved my bacon when people came round and said "The database is slow." By showing them a quick test of the Excel suite and comparing it to historical data, it not only showed that the database was fine, but that the network was fine, too. So that meant that the problem was in the application code. Quelle horreur!

I can't be more specific than that, because each database is different, each application requirement is different. But when you have done it, you will have a tool that is worth a thousand TPC benchmarks to your organisation.

Update: I managed to raise the ire of Fernando Nunes over at the Informix Technology blog, he wrote on comp.databases.informix:

I would not disagree with the points you make, but one particular point almost makes me angry... Let me quote you:

"the system is slower than it was last week", can you refute them?


Why the hell should we refute users claims? This reminds me of my "daily wars" with people from all areas (DB, developers, OS, network, SANs etc). Everybody's concern is to make clear that it's not their fault, when everyone should provide information that would make it easier to find out what is causing the issues...I strongly believe that the users don't complaint without a reason... So what we should "benchmark" are activity metrics... disk reads, full scans etc.

On IDS 11.x it's easier to collect them, but it's relatively easy to collect them in any informix version. Some simple onstats or sysmaster queries can be very valuable...


My responses are:
Why the hell should we refute users claims?


Mostly, because they're talking out of their arseholes. You have to understand this is only my experience, but after twenty-five years of doing this for a living, I can count the number of genuine instances of the system actually being demonstrably slower on the fingers of one hand.

By contrast, I get told that "the system is slower" about once a week. So, on average, one out of every 260 times I get told the system is slower, it actually IS slower. The other 259 times, the user is talking shit.

Everybody's concern is to make clear that it's not their fault, when everyone should provide information that would make it easier to find out what is causing the issues...


Well, I think I highlighted that in my example: I exonerated the database and the network by demonstrating something that even a CIO can understand: how long it takes to do a repeatable operation. This was important for finding the root cause.

what we should "benchmark" are activity metrics... disk reads, full scans etc.


onstat information might mean something to you and me, but it don't mean shit to a CIO. Time is something everyone can understand. Both things have their place.

In an ideal world Fernando would be completely right and I'd be completely wrong. But in my experience, we're not in an ideal world, and you're probably better off listening to both of us, rather than one of us.

No comments: