Automated testing has become ubiquitous in large software systems to the point where it's considered bad practice to not have automated testing.
Automated testing brings several benefits to a project. Not only does it help prevent bugs, and regressions where bugs are reintroduced, but it also forces you to organize the code base in such a way that it can be tested which typically reduces coupling, so you can test pieces of the software in isolation this is called unittesting which is the main focus of the article.
There is also integration testing which aims to test the whole system end-to-end. If unittesting is testing that the bricks perform to specification, integrationtesting tests that the bricks are glued together correctly.
However, automated testing is still mostly absent when developing machine learning models. There are good reasons for this. Usually, machine learning routines are explorative, and are frequently changed to test different hypotheses. And when you have a model that you're happy with, you finalize the model and integrate it into a production pipeline which is unit- and integration tested.
Moreover, there are also social factors. Data Scientists, who develop new machine learning models, often don't have a background in software engineering, so may be less familiar with the practices of automated testing. This is completely fine, but I do think there might be some value to be gained by exploring automated testing when developing models.
My experience with automated testing in machine learning
For a project I'm working on, I tried to use unittesting, and to a smaller extent integrating testing, to validate the assumptions I make.
Deep learning routines are often painful to debug, due to their nature of being largely numeric manipulations. You're passing around batches of matrices which are not necessarily immediately interpretable.
Usually, when I find a bug, I manually try to verify the assumptions I made that might have been invalidated by a bug. For example, I think that the shape of a tensor is [batch, channel, width height]
, but it's actually permuted, or something else entirely.
The manual checking is tedious and takes a lot of time, and if there are a lot of assumptions, I sometimes forget what something I have already checked, and have to check it again. Or maybe I have changed some things in the code, and want to make really sure that it hasn't changed something else. Good ol' print debugging often comes in handy here.
But for this project, I tried to code my assumptions into automatic tests that I can quickly run using a single keyboard shortcut. That way I can check all the things I usually would spend a lot of time doing manually instantly, and I would have confidence that I didn't oversee anything. This way I can quickly find the error or regression, or I can vastly narrow down the things I need to manually verify to locate the error.
Some examples of of unittests I have which can be used in most projects:
- Does the Dataset produce the data shapes that I expect?
- Does the DataLoader collate the individual samples correctly into batches?
- Does the model input shape match the expected data shape. Are there any symmetries in the model that should be adhered to? For example in an Autoencoder, the input shape should be equal to the output shape.
I'm also working with some symmetries within the data. A symmetry is generally a value that is conserved under some transformation for example the energy-momentum relation.
I have unittests that verifies that those the values are conserved in the data up to numerical precision.
These tests help me have and keep confidence in that I compute the symmetries correctly, so if I see that the model doesn't do well on these symmetries, I can more quickly conclude that it's a problem with the model as opposed to a bug in the code.
The tests also serve as a bit of an integration test that I am loading the data correctly. If I had permuted some of the columns, then the symmetries would no longer hold, but the primary objective of the tests is to verify that I compute the symmetries correctly.
The tests also help deal with experimenter's regress which is the uncertainty in what to attribute an unexpected result. Is it due to an error in the experiment, or is my theory wrong?
All this is well and good, but does it actually help write better code?
A concrete example where unittests helped me
I tried to impose a constraint on the latent space of an Autoencoder such that it would be forced onto a spherical manifold which I had hypothesized would be useful with respect to some symmetries in the data.
However, when I implemented the spherical constraint, I reduced along a wrong axis so that I would try to force each coordinate have a certain length independently instead of constraining the length of the vector as a whole which is required by the circular constraint. In 3 dimensions, a sphere is defined as
Instead of summing cross the coordinates, I forced each coordinate to have the constraint
which greatly limits the model's expressivity.
However, with unittesting, I could spot that the constraint wasn't satisfied for points that were on the sphere, and could therefore quickly correct the error. Much faster than if I just saw that the model performed poorly during training, and would have find the root cause without the hint that something was wrong with the constraint.
Exploring what integrating testing could look like in machine learning
Arguably, the distinction between integration- and unittesting is not as clear as I first made it seem.
For example, testing that the data has certain symmetries can be seen as an integration test for the data loader. In principle, you could mock the data which is often done in traditional software engineering, but mocking also carries the risk of introducing bugs in the mocking. If you verify that the mocked data has a certain symmetry, have you shown that the dataset has the correct symmetries, or have you verified that you mocked the data correctly?
Keeping in mind the unclear distinction, I think it's important to keep unittests fast, so you can run them frequently - maybe even on every save. Integration tests on the other hand may take longer to run as you need to run a larger part of the system, or maybe even the whole system. So what are some examples of integration tests in machine learning?
- Does the training routine run at all?
- Does running the training routine once on a small subset of the data reduce the loss?
- Does the distribution of predictions/data/parameters follow what you expect?
- Are the various metrics being generated and reported?
- Can you save and load checkpoints?
How to get started with automated testing
You might now be curious about automated testing, but you're unsure how to start, and how to fit it into your workflow.
A gentle way of getting started which I used for my project, is to write a test whenever I verify some property using the debugger, so I don't have to verify it manually again. Most of my tests are added this way, but if I'm writing something that I think have a high likelihood of having bugs (a tricky function for example), I would write some tests even before I verify the properties in the debugger. This is how I was able to catch the bug in the example from before.
Taking the preemptive approach to the extreme, you would get something similar to Test Driven Development (TDD). In TDD, you write all your tests before you write any implementation, and then do the implementation to satisfy the tests. The idea is then that the tests should cover the entire range of behaviors and input expected as any implementation satisfying the tests is considered correct.
The benefit of TDD is that it forces you to write your code in a testable way, but dogmatic adherence to it can also lead to situations where you're writing the code in a highly time-inefficient manner by having to assert all the properties before you test the idea. It might also lead to overtesting - testing even trivial properties. This may be useful when writing enterprise software, but may be less desirable when doing explorative experimentation.
This is why I tend towards the gentler method during exploration to keep the development velocity high, and progressively add more tests to cut down on debugging time.
It's a trade-off between investing the time now, or in the future. Writing tests can save time in the future by making debugging easier, but if the code is short lived and experimental, you have to be careful to consider which tests will be worth implementing. However, my experience from this project is that even for highly exploratory code, having some tests is better than having none.