SHOPPER: a Probabilistic Model of Consumer Choice with Substitutes and Complements
Imagine that you are the manager of a grocery store. Every day you collect valuable data from your customers, for example via fidelity cards. How can you exploit this data to make better business decisions? In this post, we will describe SHOPPER [1], a model that analyses basket data to extract precious insights about both products and consumer behavior.
The Model
In Machine Learning, we can often distinguish two levels of abstraction: modeling and optimizing. In the former, a human introduces some unknown parameters and decides their relation with the data. In the latter, we pick an optimization algorithm such that the computer can infer the optimal parameters (according to some metrics). In this post, we will focus on the modeling phase only, and we refer to the paper for details about the inference.
SHOPPER assumes that we observe data in the following form
where, for each element of the dataset, y is a vector containing the purchased items, u is the ID of the customer, w is the week of the purchase and r is a vector with the prices of the items in the basket.
The main goal of SHOPPER is modeling the available data to answer counterfactual queries, i.e. queries such as “What happens if I discount product X?”, “What happens if I permanently change the price of product Y?”. To achieve this goal, the authors of the paper needed to estimate the likelihood for each possible basket. If you are not familiar with the concept of likelihood, think of it as the probability that a certain customer buys a certain basket given some additional information (the customer ID, the week of the year, and the current prices).
A more careful analysis suggests that predicting the aforementioned likelihood is a difficult task: the number of possible baskets grows exponentially in the number of available items, and we have to estimate the likelihood for each of them starting from a relatively small set of example purchases. Note that this does not mean that our dataset is small in absolute terms, but for a large number of available items it will be smaller than the number of possible baskets. For this reason, the authors of the paper propose a heuristic that, within certain limits, reflects the behavior of a customer that walks into our grocery store.
SHOPPER assumes that a customer enters the grocery store and, more or less unconsciously, assigns a utility value to each available product. The item with the highest utility will be picked and added to the basket. Then he/ she reassigns the utilities to the products conditioning on his/ her first choice and, again, adds to the basket the product with the highest utility. The process of reassigning utility conditioning on the elements in the basket and picking the best product is applied iteratively, until the customer decides that paying and leaving the store is the best choice. Of course, this is just a heuristics: it certainly does not reflects human behavior perfectly, but still, it gives a satisfying performance.
At this point we still have to describe the core of SHOPPER: how does a customer assigns the utility to each of the available items, given the set of items that are already in the basket? The model assumes that the consumer’s behavior has a non-deterministic part. In other words, we will use tools from probability theory. The formula for the probability that the customer chooses c as i-th element of a basket y already in the basket is given by:
Wait, wait, wait: that’s a lot! Let’s digest it together, one piece at a time.
Given the elements that we have already chosen, the probability that we will add c as next item (i.e. the probability that c will have the highest utility among all available items) is proportional to the sum of several factors. Now we are going to look at the interpretation of each factor.
- λ models the item popularity: if an item is more popular, we assume it has an higher chance of being chosen.
- θ is a latent vector of the client that we are considering. Each dimension describes “something” about its preferences. Similarly, α is a latent vector of item c. This scalar products describes the preferences of user u regarding item c. A large scalar product suggests that the user likes this particular item, a small scalar product suggests that the user does not like it.
- γ and β are latent parameters that model the price sensitivity of the customer with respect to product c. Of course we also need to introduce the price of item c and we use the logarithmic function (which is monotonic, and so increasing the price never increases the probability of buying c). This last assumption may be somewhat open for debate, since for luxury products it may not always be the case.
- δ is a latent factor of the week in which the shopping takes place, while μ takes into account the seasonal popularity of item c. This scalar product is used to model seasonal effects. For example if we consider a Christmas desert, we expect this factor to be large in December, and low in the middle of the summer.
- The last scalar product is used to consider complements and substitutes. Since this is stated also in the title of the paper, we can guess that this is a crucial part for the performance of our model. Here we introduce an additional vector ρ for the interaction coefficients between item c and the arithmetic mean of the latent representation of the elements already in the basket. When this scalar product is large we say that c is a complement of what is already in the basket, when it is small we say that the elements already in the basket contain some substitutes of c.
Note that the previous formula uses a trick called matrix factorization, which makes everything more efficient than using huge (and sparse) matrices. In particular, this allows the model to scale to larger datasets.
Now that we have decided how to specify the probability of item c given the elements already in the basket, we just need to find the optimal set of parameters Θ:={α, ρ, λ, Θ, γ, β, μ, δ} according to some metric. For simplicity, let’s consider the likelihood of the dataset. This would mean that we need an algorithm that, using the observation in the training data, solves the following problem:
Optimizing the formula above is not easy, but the author of the paper proposed an efficient algorithm (based on variational inference, just in case you like fancy terms) that solves the task. We will omit the details here and refer the curious reader to the supplementary material of the paper.
Sample of Results
In the previous section, we described SHOPPER. Now we want to discuss the quality of the set of parameters {α, ρ, λ, Θ, γ, β, μ, δ} returned by the proposed algorithm. We mention two types of results: the prediction’s performance on the test data’s likelihood, and the interpretation of the latent parameters.
The following table compares the log-likelihood of test data obtained with SHOPPER with the one obtained by other models (including Hierarchical Poisson Factorization, the previous state-of-the-art). Observe that a higher value implies better performance (where a log-likelihood of zero would be the perfect score).
We observe that SHOPPER performs better than the previous models. In particular, when we look at baskets containing items with prices that differ from their average. Since we want a model able to answer counterfactual queries, performing well on “unusual” prices is crucial. This is one of the main properties that make SHOPPER shine.
The paper also studies the interpretation of the latent factors found by the algorithm. This is particularly interesting since these vectors are informative, they provide insights about both products and customers. Note that this is an interesting property that is not common to every Machine Learning model. For example, neural networks, the model at the heart of Deep Learning, provide state-of-the-art performance in many tasks, but are in general difficult to interpret. In this context, interpretability means for example that customers with similar latent representation θ will also have similar preferences. The same happens with products with similar values of α, ρ, and μ. Note that here we say that two vectors are similar if their distance (e.g. the Euclidean distance) is lower than a certain threshold. Another interesting interpretation comes from looking at the values of the scalar product between ρ and α: the authors propose metrics to decide whether two products are complements/ substitutes by just looking at their values for these parameters. The paper includes several experiments to show the high quality of the found parameters on a particular dataset. In the remainder of this post we will assume that we have parameters of high quality, and we discuss how to use them to make business decisions.
How to apply the model for Business Decisions
Before we dive into the last section, let’s wrap-up what we have discussed so far.
- We have presented SHOPPER. In particular, we have discussed an heuristic where each customer walks into the store and add the highest utility item to its basket, conditioning on what he/ she has already decided to buy.
- We have mentioned that it is possible to find the best parameters of the model. Here best is relative to some metric (above we have talked about the likelihood, truth to be told the goal is to maximize the so called MAP estimate, I just hided the details for the convenience of the non-technical reader). We haven’t discussed how the parameters can be estimated, but the paper proposes an efficient algorithm that scales to large datasets.
- We have seen that the parameters that we have estimated from the data are informative. They provide many interesting insights about our data such as which products are complements/ substitutes, what is the role of the seasonality effects, which items are more/ less popular, which items are more similar to each other and what are the preferences of the customers (including their price sensitivity to different products).
That’s very nice, but now the question is… what can we do with these parameters? How do they help us taking business decisions? The paper only scratches the surface in answering these questions, but since I think that these are very interesting questions (after all, Machine Learning is only a tool to solve some real-world problems), I will share my personal considerations.
- The main goal of SHOPPER is answering counterfactual questions: what happens if… This goal is achieved: once we have a good estimation of the parameters, we can easily change the price of some products and see how the probability of observing a certain basket changes. However, it is not clear if we can answer questions like “what is the best price for product X?”, “When should I discount product Y? And by how much?”. Of course there are alternative models (e.g. that analyse the demand and supply curves) to determine prices, and so already the fact that SHOPPER answers canterfactuals is a success.
- The paper shows how one can interpret the latent parameters. However, the insights they gained from their dataset are, in my opinion, not too exciting. For instance, they showed that hot dogs and hot dog buns are complements, or that turkey is more popular near Thanksgiving. Getting this kind of insights would be more interesting in other contexts, for example in a huge online store (this is at the heart of a recommendation system). However, it is not clear whether the heuristic used by SHOPPER generalizes for such datasets.
- It could be beneficial to find out whether there are complement products that are physically located far away in the store. As you can read here, complements should be located close to each other. By discovering such insights, the manager could reorganize part of the store in a way that allows the customers to easily find products that they might want to co-purchase. Note, however, that by doing this the (unobserved) true underlying probability distribution of the baskets would change, and hence one would need to collect a new dataset in order to properly perform an analysis with Machine Learning tools. For this reason, we would go towards a reinforcement learning approach, where one understands which actions should be taken in order to maximize a reward function.
Conclusion
SHOPPER is a success story of collaboration between different scientific communities: the power of Machine Learning at the service of econometrics. The paper elegantly describes an interpretable model that scales to large datasets and leads to high quality results. Using the model for taking business decisions still imposes some challenges, but analyzing counterfactuals remains a promising line of research.
References
[1] Francisco J. R. Ruiz, Susan Athey, David M. Blei, SHOPPER: a Probabilistic Model of Consumer Choice with Substitutes and Complements, 2020