Ignore Django, write testable code

21 May 2013

One of the best talks at PyCon 2013 was Gary Barnhardt's Boundaries talk. One aspect of the talk was avoiding mocking by making the data interchange format based around value types. This is distinctly separate from entity types (where the identity of the object is more than the value). The canonical example value types is money. A $5 bill is a value type because it can be evenly exchanged with any other $5 bill. Entity types on the other hand are closely correlated with their identity. Even if a user has all the same preferences as another user, it's profile is distinct.

I attempted to use value types when building GitStreams and it resulted in some interesting code patterns. The core of the idea is to use namedtuples for the main data interchange format. The goals of this were a few fold:

Goals

I had 2 simple goals with this change:

I wanted faster tests.
I wanted a clearly documented API for interchange

Faster Tests

To achieve faster tests, this meant one thing: avoid the database as much as possible. We all know by now that the slowest thing in test suites is creation, setup, and teardown of a database. This can easily increase the runtime of tests by an order of magnitude. At Google, we had a concept of test sizes which set reasonable expectations for what you were allowed to do with the system. Database access put the smallest test a level higher into the "medium" category, alongside sleep statements, filesystem access and multiple threads.

Clearly documented interchange API

On the goal of interchange, I was interested in playing around with the idea of clearly documented interfaces. While a bit verbose, the notion that you clearly document (in code) the protocol you expect has a few nice advantages. First, you get a clearly defined surface to test against. You'll know what the code you're calling into is allowed to expect from you, rather than having to read the implementation for pointers.

Interfaces are also nice because it allows users to conform to them in ways that make sense to them. Maybe some future user of your app is going to be powered by MongoDB. If they eventually switch off of that data store, they merely need to adapt their new persistence layer into the value objects that your system already speaks. You don't have to make large sweeping changes to the codebase to accompany this. I've done this very thing (swapping storage layers) in GitStreams already.

In Practice

The approach I took was to use namedtuples. Namedtuples are one of those things I had heard about in passing, but not really paid much attention to. Turns out, they're tuples (immutable sequences) which can be addressed by property names (rather than by index) given to them beforehand. They can't be updated, so they need to be built all at once. That said, they provided the nice access mechanism (just like normal Python objects) that I was looking for.

  from collections import namedtuple

  UserT = namedtuple("User", 
                     "email last_mailed followed_users followed_repos mail_interval")

  # convencience function for building new UserT's 
  def gen_user_t(email=None, last_mailed=None,
                 followed_users=None, followed_repos=None, mail_interval=None):
      return UserT(email=email, last_mailed=last_mailed,
                   followed_repos=followed_repos, followed_users=followed_users,
                   mail_interval=mail_interval)

class UserProfile(models.Model):
    # More code here...

    def to_user_t(self):
        return UserT(
            email=self.user.email,
            last_mailed=self.last_email_received,
            followed_users=[],
            followed_repos=[],
            mail_interval=self.max_time_interval_between_emails
        )

From here, I gave my Django models (in this case the one for UserProfile) a method to generate these namedtuples. I then pass these around in my system like normal Python objects. One example of this is the function which determines if we should send an email out to the user.

def should_mail_user(user_t, seed_time=None):
    """Returns True if the  is within the interval of
    max_time_interval_between_emails"""
    if seed_time is None:
        seed_time = now()
    if user_t.last_mailed is None:
        return True
    td = timedelta(days=interval_to_days(user_t.mail_interval))

    # Last sync happened before n days ago.
    return user_t.last_mailed < seed_time - td

If I were writing Django code, I'd likely encode this business logic in an ORM call. Something like the code snippet below. Unfortunately, this encodes business logic in a SQL statement which is very slow to unit test.

min_threshold = now() - timedelta(days=interval_to_days(user_t.mail_interval))
User.objects.filter(Q(last_mailed__isnull=True) | Q(last_mailed__lt=min_threshold))

My application code now does generic SQL queries (eg Give me all users) in the main method of the app (a management command, in my case), converts these entries into the namedtuples above and my application code deals with those instead, filtering as necessary in plain python code.

As I joked with a few folks I was talking with this about, the path to faster Django tests is to not use any of the things in Django. Push the use of the ORM to the edges of your application code and deal with simpler types to make things easier to test.

The result of this exercise was easily (and quickly) testable code which isolated the ORM from me such that I could add and change fields with little chance of breaking the underlying code. Mission accomplished.