In the last post, we looked at Minitest::Benchmark as a high-level concept and sketched out a basic structure for how we might test with them. I don’t know about anyone else, but I’ve gotta say: the process of researching and writing that post gave me a lot of perspective on what Benchmarks do and where my own tests could benefit from their use. There’s a limit, though, to what you can learn from looking at technology from 10,000 feet. What usually delivers the most bang for my buck is seeing how a thing is used in practice, trying it out myself, and seeing where I’m getting value from it.
In this post, we’re going to take a first crack at a practical benchmarking example to get started down that path. When we’re finished, we’ll reflect on what we’ve observed and what, if anything, we’ve learned from it.
The code under test
Recall from the previous post that Benchmarks run the code in a given block against progressively larger workloads. So as much as I didn’t want it to come to this, the example that I chose for this post is sort algorithms. Why?
- They’re familiar to a lot of programmers from CS courses they took years ago. (Or in some cases, even decades…)
- Their performance characteristics are comprehensible and predictable.
- The workload is easily quantifiable as the size of the list of items to be sorted.
For demonstration purposes, it’s enough to implement just one or two sorting algorithms, preferably with differing performance characteristics, and to build out a practical example project. In this case, I took insertion sort and merge sort as examples.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
(For those who are interested, all of the main source files for this post can be found in a gist.)
Getting set up to test this code is similar to other projects you may have seen. First, we’ll want to create a Rakefile with tasks for running our tests. While it’s possible to run all tests with a single task, it’s not going to be very useful. Why? Because when you’re using unit tests to guide your development, you want to keep the feedback loop between running tests and writing code as tight as possible. Not unlike acceptance tests in your Rails applications, a properly conceived Benchmark will most likely take more time to execute than you’ll want to spend during regular development, TDD or otherwise. For this reason, I’d recommend setting up a separate
:bench task that you’ll use to run your benchmarks as shown below.
1 2 3 4 5 6 7 8 9 10 11 12 13
Next, you’ll want to set up the test helper file that will be required at the top of each of your tests. This will be used both for regular unit tests and performance benchmarks, so you’ll want to make sure that you explicitly require
minitest/benchmark which is not otherwise required by
For the purposes of this example, I’ve also added a RandomArrayGenerator helper module with that I can use to generate lists of randomized numbers and Strings which my tests will use as targets for sorting.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
At this point, I realized that the Benchmarks I’m about to write don’t mean much if the algorithms I’ve implemented don’t work as expected, so I threw together a simple unit test that checks the results of my Sorter classes against the results of Ruby’s
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
Once we’ve seen that this test passes, we’re ready to implement our Benchmark. As a first step, we need to define the range of values and test data that can be used across the various runs. As mentioned in the previous post, you define the range by overriding the
bench_range class method as we’ve done below. The resulting values will be:
[10, 100, 1_000, 10_000, 25_000].
1 2 3 4 5 6 7 8 9
Practically speaking, it takes a little trial and error to find the right range for each Benchmark. In some cases the default range (powers of 10 up to 10,000) might be just what you need, but my experiments so far have shown that while larger values may produce better regression function fit, they can also be prohibitively slow to run. For example, an attempt at running bubble sort for arrays of 100,000 elements took several minutes to execute. Ain’t nobody got time fo dat.
Next, we’ll initialize collections of test data with the number of elements required for each value from our
bench_range. To do that, we want to use a normal Minitest
setup method implementation and the helper mixin that we wrote into the test helper.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
We’re finally ready to specify our benchmark tests, but first we need to have a hypothesis about the performance characteristics for each algorithm. Fortunately for us, both of these have been studied by generations of CS students, so we already have some expectations:
- Insertion Sort - varies as the square of the number of elements (O(n²)) →
- Merge Sort - varies as the logarithm of the number of elements (O(n log n)) →
So our tests end up looking like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Let’s see how our assumptions measure up to reality:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
It looks like our merge sort implementation has matched expectations, but insertion sort is off. There are plenty of possible reasons for the results we’re getting here:
- The algorithm isn’t optimal. In order to know that though, we’d need to better understand whether we’re overperforming or underperforming with respect to the regression function.
- The test data isn’t as random as expected. Insertion sort performs better (closer to linear) with data sets that are better sorted. (And in fact, further testing showed a better correlation with linear functions than power funtions for the values tested, though still not with an R-squared of 0.99 or more.)
- Natural variability in randomized data and different runs will produce different results. That’s unlikely to overcome the kind of gap you see here with our insertion sort, but it could result in failures on certain test runs for merge sort.
- External and interpreter-specific factors (e.g. garbage collection) could affect results.
And even though I’ve got a failing test here, I’m still getting some valuable feedback from it. I can see how closely my results map to a specific fit type which gives me some insight into how well the code is likely to perform with larger and larger inputs. Like any failing test, it could be telling me that my code needs to improve or it could be telling me that my tests need to improve.
I’ll be honest with you: I’m just getting my feet wet with Minitest::Benchmark. I still have plenty to learn, but I can see opportunities to include it in my suites in the future as:
- An initial target for defining code performance characteristics
- A defensive measure to catch potential future regressions
In the next installment of this series, I’m going to take a look at how we might use this kind of testing to run benchmarks against a typical Rails application.