First and Second order metrics
As part of our efforts of continual improvement, we have been diving deep on the metrics that we want to track at an organizational level. This has been happening at all levels of the organization, which has resulted in multiple contemporary conversations around which metrics we should track. In these discussions with engineers and leaders, we’ve lacked some wording to classify the metrics we were discussing. I believe that thinking about metrics as either first-order metrics or second-order metrics is a useful distinction.
A first-order metric is one that is tied specifically to outcomes for our customers. This is the reason we’re here to do work. They answer questions like: When I open the mobile app, does it crash? When I go to walmart.com, does it give me an error page? If I load an item page, does it just hang there forever loading things?
Second order metrics serve as a leading indicator towards our outcomes. You could also think of these as internal metrics, which we use to better understand our sociotechnical systems. If we want to improve latency, we might categorize how much latency is contributed by our system as compared to upstream systems. If we want a lower defect rate, we believe that increasing code coverage percentage correlates with that. Code coverage isn’t an outcome. We aren’t hired to increase code coverage, and customers don’t purchase something because the software had 100% test coverage. That said, it’s still useful as a shorthand to understand how testable (and, hopefully by extension, how well-written) our software is.
These second-order metrics also extend to team-oriented metrics. We can’t decrease the number of high-priority bugs if we don’t have a team that works well together. These second-order metrics include how consistent teams are at delivering features as well as softer, NPS-like scorings for how they view their own team, and more.
Second order metrics are often time limited. While we track code coverage as a quality barometer today, we’ll eventually hit a point where everyone pretty much understands that they need to write unit tests (and thereby increase coverage) as their normal way of working. It’s reasonable to then drop this metric and pursue other metrics in service of our eventual outcome of reducing production bugs. As a counterpoint, we’ll never stop tracking high-priority bugs in production or the mobile crash rate which is a signal that they’re tied to outcomes.
These have been a useful framework for us to think about metrics. In general, we want to focus primarily on the first-order, outcome-oriented metrics. We can use second-order metrics as they are useful to us, but we won’t lose sight of what we’re here for: to provide a wonderful customer experience to the folks using our products.
Note: This originally appeared on the Walmart Global Tech Blog.