Understanding Etsy's search service migration

15 Nov 2022

Etsy published a recent article which documented how they migrated from their old search service to one which was based on deep learning. The majority of that article made little sense to me, so I wanted to take a moment to document what the heck that actually meant.

Originally, Etsy's search service was based on a gradient-boosted decision tree model, which drove personalized search. They had to manually engineer the features of the tree model and their relevancy gains began to plateau. Deep learning models offered several exciting potential improvements, especially because they could embed features directly, incorporate multi-modal data and alter the network architecture itself. The search team ran two passes of ranking and their work was primarily targeted at the second layer.

Starting from the beginning, a decision tree is a supervised learning algorithm (definition). The "supervised" keyword means that it takes a human to pre-categorize data so that we know how to bucket things correctly. From there, the machine learning algorithm will extrapolate the results to larger data sets.

Each node of the graph is either a root (the starting point), an internal node (also known as decision nodes) and leaf notes (aka terminal nodes). The intent is to find a path from the root to a leaf node based on a variety of predicates which let you know which outgoing edge to follow. By the time we get into leaf nodes, we want the supervised results to be homogenous.

As an example, we may have a simple question like "Should I wear a hat today?". There are clear cases when you should definitely wear a hat: It's sunny at noon in the desert, or it's going to be rainy. The question "What's the day's weather forecast?" isn't adequate to firmly bucket you into "yes, wear a hat" or "no, don't wear a hat". If it's going to be sunny today but you're going out at 9pm.. the answer is unlikely to be the same as if you're going out at noon.

Figuring out these decision points is a key aspect of decision tree models. The algorithm determines how to reach a decision through a process called "node splitting". There are a variety of algorithms for this which are rooted in math and statistics. The general idea, however, is that you split the known values up based on their features (i.e. properties) and seek to split them up based on those properties until you get homogenous results. You may only be able to get it more homogenous.. at which point you continue to do it until you reach some final homogenous grouping.

There's a real risk of overfitting with this approach. There are a few ways to address it, including one called "random forests". Overfitting feels out of scope for understanding the original article though.

The "boosted" adjective applies outside the context of decision trees in general. It's a machine learning term which seems to mean "take a bunch of guesses and apply them one-after-another until you get the right answer". In the algorithm's case, it means "take a bunch of crappy learning algorithms and apply them in series". The result is a better algorithm.

"Gradient", in this context, refers to gradient decent. It's a mechanism used to determine the minimum value of a function. In our case, the function is something like an error-rate function. We want to find the minimum error rate that we can within the boosting process.

For the "gradient-boosted decision tree" example, you take the output of one kinda janky decision tree and create another decision tree which corrects any errors found in the first one. You continue this process until you have a better model than you started with. These errors that are left over from a decision tree are called "residuals" and we use a loss function to detect hem (similar to other classification tasks). If you add a bunch of these decision trees in series.. you similarly risk overfitting.

Okay, so we know that a gradient descent decision tree is a bunch of smaller decisions combined to result in a low error rate, which in our algorithm's terms, means a bunch of leaf nodes which will correctly determine the output of something. Awesome.

When building out their deep learning model, they considered combining their decision tree with a deep neural-network (DNN). The process of combining two models into a super-model is called "ensembling". After testing, however, they didn't see meaningful improvements over the raw decision tree, so they dropped the combined approach.

Dropping ensembling was really helpful because DNNs are much faster to train than gradient-boosted decision trees (GBDT). They got it to 1/8th the speed, thanks to the use of GPUs. Not only that, the runtime latency is better with strict DNNs.

To start the migration, they brought over many of their engineered features to the new model (listing price, click rate, etc), then they had to normalize them. Normalizing is a process where you take multiple features and scale them down to the same scale. I believe this makes it so the model can learn faster because it isn't jumping all over the number line, but I'm not certain.

Not all of their features were numbers. They also had things like text-based features they needed to incorporate. To solve this, they created "custom embeddings", which appear to be a way to turn words into numbers. They noted that they had to build it themselves because most embeddings were trained on prose, not item data for an ecommerce website.

As they began to roll the feature out, they ran into a bunch of common software engineering issues, like migrating a live system. Ultimately, the DNN approach feels a bit like a black box to me, but I better understand the GBDT case and have some interesting areas to look into going forward.

Some remaining open questions:

Why are search results ranked in multiple passes? Are there limits to the number of passes that improve things? Is it slower to use more passes? How much?
How does search ranking work? Normalized discounted cumulative gain (NDCG) seems to be a common way of doing it, but I don't know anything about it.
How does custom embedding work? That seems like magic.