How To Evaluate The Models You Train
I promise we'll get back to building recommender systems around models in the next post or two, but I want to take a brief interlude to talk about evaluation. Once you've found some interaction data and figured out how to train a model (e.g. matrix factorization), the next question that arises will be "Is this model any good?"
For a great introduction to and overview of how to answer that question, I recommend the Coursera "Recommender Systems: Evaluation and Metrics" class. There are still some practical issues and potential pitfalls though, and it seems like the more comfortable someone is with machine learning in other areas, the more likely they are to fall into them. The questions to ask are mostly the same as in other ML domains, but the answers are a bit different for recommender systems:
- How should we test our models?
- What metrics should we use?
- Which baselines should we compare against?
- How do we know which model is better?
As we'll quickly see, the recurring theme here is "recommender system evaluation is hard and offline evaluation should be taken with several grains of salt."
How To Test?
The first thing most people will reach for when evaluating recommendation models is the standard ML practice of randomly splitting the available data into training/tuning/testing sets. Recommender systems introduce a few additional wrinkles that complicate that process:
- The examples aren't independent. What we observe is interaction with items but what we're interested in learning are user preferences. This implicitly links the examples for each user together with each other, to which a random split is oblivious.
- Time matters. Predicting past behavior based on future behavior is easier than the reverse, and a random split leaks data from the future into predictions about the past.
In order to avoid those issues, we want to split our dataset such that every user has examples in each of the subsets, and we compute evaluation metrics on the more recent examples.
We could accomplish the first part—every user has examples in every set—using leave-one-out cross-validation. The idea is that we randomly select one example per user to use for evaluating performance, train a model on the rest of the data, compute our metrics, and repeat the process with different random seeds to get a more robust estimate. However, cross-validation is fairly expensive since it requires training multiple models.
We could accomplish the second part—evaluation metrics are computed on recent data—by using a temporal split, so that all the training data comes from before a particular moment in time and all the tuning or test data comes from after. This also has some issues: we're not guaranteed to get examples for every user in any of the splits, and there's now a gap between the end of our training data and the present (which is not ideal for a model we'd like to deploy and use to make real recommendations.)
Combining those two ideas, we could instead use a last-one-out evaluation procedure, where we reserve the most recent interaction for each user as test data and train the model on everything else. This is a fairly pragmatic hybrid: we're guaranteed to get an example for every user, we're evaluating on the most recent data for each user, but we accept that some future information from other users' interactions leaks into our training set and influences the model. It also limits the size of the test set to the number of users in the dataset, which may not be large enough to reliably detect performance differences between models (depending on the number of users.)
Any of these methods—random split, leave-one-out, or last-one-out—has drawbacks that result in biased evaluations. More sophisticated evaluation approaches (like sliding window evaluation) have been proposed, but not widely adopted.
Another wrinkle to be aware of is that implicit feedback data (like clicks) can end up biasing evaluations (and models!) toward click-bait items that appear interesting on the surface but don't lead to deeper engagement. I'll mention two quick hacks for dealing with this:
- Apply a "depth of engagement" cutoff when determining what counts as a positive example. A raw threshold (e.g. watched for at least 5 minutes) can work, but if possible, consider defining the threshold as a fraction of the length of the content (i.e. watched at least half) to avoid inadvertently promoting longer content and penalizing shorter content.
- Weight the examples based on the depth of engagement, so that longer interactions count more and shorter interactions count less. (The same idea about interaction length relative to the length of the content applies here too.)
Taking all this into account, it's fair to say that there are no perfect answers for constructing training and tests sets. For practical recommender system evaluation, the best thing to do is choose an approach that's feasible in your context and remember what the biases and drawbacks are when interpreting your evaluations.
Because recommender systems serve a variety of user and business goals, there is no best metric to use across the board, and you shouldn't expect a single metric to capture all there is to know about model (and system) performance. Nonetheless, some potential metrics have more desirable properties than others, and it's worth knowing what those are.
If you're going to read one paper on this topic, it should probably be "On the Robustness and Discriminative Power of Information Retrieval Metrics for Top-N Recommendation". (If you're going to read more than one, check out papers from the RecSys REVEAL workshops to go deeper.)
The authors evaluate a variety of metrics at different ranked list lengths on their resilience to sparsity and their ability to accurately reflect statistically significant differences in performance. They find that higher list lengths (>50) work better (even if the interface will display less items), and that two metrics are more robust and discriminative than others: Precision@k and NDCG@k.
Both metrics are measured on a scale of 0 to 1.0, and Precision@k seems like the simpler of the two at first. However, it has a drawback that creates some hidden complexity: At higher list lengths, many users won't have enough relevant items for the model to fill the first k positions with relevant items, so the maximum possible score isn't actually 1.0. It's not too hard to compute what the maximum is (by averaging the number of relevant items for each user divided by the list length), but it's another thing to do in order to make your metrics interpretable.
In comparison, NDCG is...well, normalized...so it already takes the number of relevant items per user into account and the maximum score really is 1.0. It's also rank-aware, so placing relevant items higher on the list produces a better score, which is especially desirable with longer ranked lists. Although the NDCG computation is a bit more complex, there are off-the-shelf packages that will do the computation, and it ends up being a bit easier to interpret.
So: in the beginning, you should probably use NDCG. As the system architecture evolves, there are reasons for particular parts to specialize in optimizing different metrics (e.g. recall for candidate selection), but NDCG is a good place to start.
Once a recommender system is up and running, you'll be able to compare metrics for new models to previous models. In the beginning though, there won't yet be previous models, so you'll need some other baselines to compare against. Two fairly easy-to-implement baselines are random and popularity.
The random baseline selects items uniformly at random. It's a terrible recommender that models should beat by a wide margin—if not, the model isn't learning anything useful, and you should go back to look for issues in the model, features, or training process.
The popularity baseline selects items according to the number of interactions with them in the dataset. Popularity is a more difficult baseline to beat, but it's still possible with classic models like matrix factorization, an appropriate choice of loss function, and some tuning. Don't despair if you don't get it right away—there's a reason it has been known as a tough baseline to beat. (Once your model is beating the "popularity by total number of interactions" strategy, there are some time-based improvements you can make to get an even stronger popularity baseline.)
There are various other relatively simple models that make good baselines too. You might implement a few different methods while you're trying to choose a general approach, but there's no need to go crazy and try everything before committing. As you eventually move beyond simple models, keep them around as baselines for your later efforts and don't forget to keep tuning them.
Suppose I have two models with NDCG@100 scores that look like this:
- Model A: 0.21
- Model B: 0.18
Which model is better? Pretty easy, it's A, right? Okay, how about if the metrics look like this:
- Model A: 0.21
- Model B: 0.20
Still A by a little bit, right? Okay, how about these:
- Model A: 0.21
- Model B: 0.21
Hmm, hard to tell. Could be either one, yeah?
This is where experience with other machine learning domains may lead you astray, because with recommenders, the most accurate model isn't necessarily the best. The answer in every case above is "not enough information has been provided to answer the question." There are several traps here:
- Offline evaluation metrics computed from historical data tend to be biased toward models that make recommendations similar to those used in the data collection process.
- Offline metrics tend to be biased toward popular or trendy items.
- Offline evaluation metrics are only loosely correlated with online performance metrics measured when real users are exposed to the recommendations.
- It matters how the model's performance is distributed across different sub-groups of the user base and item catalog.
- Accuracy is only one of the many qualities that make effective recommendations.
There are ways to mitigate these issues, but the most important thing when you're getting started is just to know that they exist. As a result, you can use metrics like NDCG@k as a "first cut" evaluation to narrow down which models are likely to be among the best, but you can't be entirely confident that the model that scores highest on an offline metric is necessarily the model you should deploy.
Beyond holding offline accuracy metrics lightly, you should also be prepared to trade off some accuracy for other qualities that improve the user experience or further business interests. For example, you might care about coverage, the percentage of the item catalog that is actually recommended. Returning to the baselines from above, the popularity baseline is fairly accurate (on average) but has terrible coverage, while the random baseline has excellent coverage but is wildly inaccurate. An ideal recommender would have both, achieving high accuracy and coverage by finding the right audience for each item and the right items for each audience. In practice though, there's going to be a trade-off between the two, resulting in a possibility frontier along which you can choose a point that best meets the needs of your particular product.
Do your best to build sensible offline evaluation metrics, but don't go overboard with how much faith you put in them. They're helpful for getting pointed in the right general direction, but you won't know how a recommender really performs until you put the recommendations in front of real users.
Instead of blindly shipping whichever model has the highest offline accuracy metrics, here a few steps you can take:
- Take some time to think about what other qualities of the recommendations might be important to measure in your particular domain or business context, and find or craft metrics that reflect them.
- Test several of the best performing models at a time and compare the results of offline and online evaluation. This allows you to evaluate how closely correlated they are, which provides a sense of how much trust to place in your offline metrics.
- Track what's being recommended in addition to capturing aggregated performance metrics, so that you can analyze what your recommender is actually doing. This provides a good foundation for discovering unexpected behavior and motivating further iterations on your approach.
- Segment your online and offline metrics by available user and item attributes to see how performance is distributed (e.g. frequent users vs occasional users, dramas vs comedies.) This helps you to identify underserved groups and make relevant improvements for them.
Go forth and evaluate models! And, if you found this useful, consider buying me a coffee? :)