What attributes have the highest influence on a person’s choice to respond to a certain Starbucks offer?
Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks. Moreover, not all users receive the same offer.
The task is to determine which demographic groups respond best to which offer type. The dataset we use is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.
Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days.
We are given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer.
We have to keep in mind that someone using the app might make a purchase through the app without having received an offer or seen an offer. Therefor it is critical to do a good job preprocessing the data.
Imagine a customer is going to make a 10 dollar purchase with or without an offer anyway. From a business perspective, you wouldn’t want to send a buy 10 dollars get 2 dollars off offer to this specific customer. Therefor, the question is:
What attributes have the highest influence on a person’s choice to respond to a certain Starbucks offer?
We are going to build a machine learning model to predict if a person responds to a certain offer. The performance of the model will be measured using the f1-score. This metric seems appropriate since it combines precision and recall.
The data is contained in three files:
- portfolio.json — containing offer ids and meta data about each offer (duration, type, etc.)
- profile.json — demographic data for each customer
- transcript.json — records for transactions, offers received, offers viewed, and offers completed
Here is the schema and explanation of each variable in the files:
- id (string) — offer id
- offer_type (string) — type of offer ie BOGO, discount, informational
- difficulty (int) — minimum required spend to complete an offer
- reward (int) — reward given for completing an offer
- duration (int) — time for offer to be open, in days
- channels (list of strings)
- age (int) — age of the customer
- became_member_on (int) — date when customer created an app account
- gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
- id (str) — customer id
- income (float) — customer’s income
- event (str) — record description (ie transaction, offer received, offer viewed, etc.)
- person (str) — customer id
- time (int) — time in hours since start of test. The data begins at time t=0
- value — (dict of strings) — either an offer id or transaction amount depending on the record
In order to analyse the data, we first need to understand and clean the data. We take a look at the 3 different datasets that were read in and converted to pandas dataframes (short: df).
1. Portfolio (10 rows, 6 columns)
- There are no missing values and the data types are correct
- We want to convert the duration from days to hours (because the time stamps in the transaction dataset are also in hours)
- We also want to use a one-hot encoding for the channels column (which is better for a future machine learning model).
Doing so, yields the cleaned portfolio dataframe
2. Profile (17,000 rows, 5 columns)
- Taking a closer look at the data we notice “age=118” always corresponds to “gender=None” and “income=NaN”. We want to remove those rows as they are not useful.
- We also want to transform the “became_member_on” column to values that live on a “ratio scale”. We chose to count the days of membership, where the 0 of the scale corresponds to the latest observation in the dataset.
- Eventually, we use a one-hot encoding for the gender column
Applying all these steps yields
The “days_of_membership” column has the following distribution
3. Transcript (306534 rows, 4 columns)
- The value column can contains “offer_id”, “amount” and “reward”. We are not interested in the “amount” (only “responded to offer” = yes/no” is of interest for us) and the “reward” information is already contained in the “portfolio df”.
Doing so and ordering the observations by “person” and “time” yields
We notice there are still some missing values in the “offer_id” column which always appear for “event=transaction”. However, we do not care for these observations because our focus will be on “event=offer_completed” instead of “event=transaction”.
We consider the following groups of people
Group 1: people that are influence by the offer
- offer received → offer viewed → offer completed
Group 2: people that are NOT influence by the offer
- offer received → offer viewed
Group 3: people that receive an offer but take no action
- offer received
Group 4: people that buy products regardless of an offer
- offer received → offer completed ( → offer viewed)
Our focus lies on group 1 (responded to the offer) and group 2 (does not respond to the offer). Group 3 doesn’t yield much information. Group 4 buys products regardless of an offer which means they should not receive offers (from a business perspective).
Therefor, in the next step we want to assign labels to the corresponding person-offer combinations if they belong to group 1 or group 2. Other groups are omitted.
Merge all 3 df into 1 single df
First of all, we merge all 3 mentioned df’s into 1 single df containing all the information. Due to many columns of this df, I had to split it into 2 images here. Now we have an impression of what this df looks like.
Assign labels to group1 and group 2
Assigning the labels needs a few steps and includes a lot of different “groupby” and “merge” commands.
- Mark observations where the offer was completed after it was viewed. This is important because having seen the offer differs group 1 from group 4. An example is shown just below (again in 2 separate images due to many columns).
2. Mark observations where the offer was only received and viewed but not completed (group 2).
3. Group the observations by “person” and “offer_id” and add a label if a person is in group 1 or not .
However, this is not completely correct yet because observations should only count towards group 1 if they completed the offer within the given duration of the offer. Otherwise we can assume they do not care for the offer and should belong to group 2!
4. Find time difference between “offer received” and “offer completed” of the corresponding offer for each person and add a column “time_diff_rec_compl”.
5. Add a column that contains a value of 1 if the time difference between “offer received” and “offer completed” is less or equal to the duration of the offer and a 0 otherwise
6. Add a column “responded_to_offer” if an offer was completed after it was viewed and if it was completed within the duration of the offer. This column with values of either 0 or 1 will later be our response variable. But first we need to do one last step.
7. Group by “person” and “offer_id” to get a response (responded_to_offer) only once per person and offer_id.
In the table above, each person-offer_id combination only appears once (either with a positive or negative response).
This df that we obtain now has 31782 rows and 17 columns. We split the data according to different offer types (bogo and discount). “Offer_type=informational” is removed. We also remove the column “channel_email” because it always contains a value of 1 for any observation. Therefor this information is useless.
Split data according to “offer_type”
We split the data into “offer_type=bogo” and “offer_type=discount” and create a classificaton model for each of the 2 datasets.
Take a look a the distribution of some attributes
Now take a look at the distribution of the genders and the response variable “responded_to_offer”.
The same is done for the discount data. The 3 distribution plots from bogo look very similar to the discount data which we do not show them here again.
Each dataset contains roughly half the observations. In the bogo dataset the response is almost equally distributed (58% to 42%), whereas in the discount dataset it is more unevenly distributed (74% to 26%). Therefor, we might expect both models to better predict a positive response (responded_to_offer=1). “gender=M” and “gender=F” are almost evenly distributed. “gender=O” only appears a few times.
Machine learning model
Now, we have finally prepared the data such that we can train a machine learning model and make predictions. We decide to use a Random Forest Classifier (in combination with the f1-score as a performance metric) because I have had good experiences with that and it seems appropriate to classify this binary-response data here. We use the standard steps in machine learning
- Split data into training and test set
- Scale data (I used a MinMaxScaler)
- Initialize the classifier
- Train the classifier using the training data
- Predict values with the test data
- Score predictions (using f1-score)
Model for bogo data
Using the standard parameters, we achieve the following result for the bogo data
We are not satisfied with the result and want to improve the result using GridSearch for the following parameters
Using the best parameters from the GridSearch, we can improve the model to a f1-score of 0.72 (which is 0.04 better than above). For the positive response we even get a score of 0.78
0.72 is not bad but not great either. As expected, the model predicts a positive response better than a negative. However, this should not be a problem because sending offers to people who will not use them anyways is not as bad as not sending offers to people, who would have made a purchase only if they had had an offer.
We can also display the influence of each variable on the response
Although the model is not super great, we still get an impression of the feature importance. The attributes that have by far the most influence on the outcome are “days_of_membership” and “income”.
Model for discount data
For the discount data we take the same steps as above (model with default parameters and GridSearch for fine-tuning)
Again, we notice the predictions are better for the positive response which is not problematic (same argumentation as above). The optimised model shows a f1-score of 0.73 but for the positive response we obtain a value of 0.85.
The feature importance shows a similar picture as above.
“Days_of_membership”, “income” and “age” are the attributes that have the most influence on the outcome. The remaining variables seem to have little influence on the response.
We came up with two different models, one for “”bogo” and one for “discount”. Both had similar distributions of the main attributes and roughly the same amount of observations. The result were quite similar with “days_of_membership” clearly being the largest influence on the response. “Income” was found to be the second largest influence.
Both models better predict a positive response (responded_to_offer=’yes’) which is ok because sending offers to people who will not use them anyways is not as bad as not sending offers to people, who would have made a purchase only if they had had an offer.
Predicting human behaviour is both interesting and difficult because it does not always follow a clear pattern. I still liked the project as it was quite a challenge for me, especially the preprocessing steps of assigning the response labels. Feel free trying to come up with a smart or better solution ;-).
Comments and outlook
- Mabye other models (e.g. logistic regression) would yield better results.
- We could also try to group the age variable (e.g. young adults/adults/seniors) and the income variable (e.g. low/medium/high income). Maybe this would improve the result.
- Another possibility could be changing the response to (amount of completed offer) / (amount of received offers) and run a regression model on this response.
As we can see there are several other possibilities that could be tested to try to find a better model. Personally, I think any of the suggestions might be at least be worth trying. Feel free to build up on these ideas.
On the one hand, we could probably do better (see Comments and outlook), on the other hand it can also be difficult to predict human behaviour because it does not always follow a clear pattern or structure. People with a similar time of membership, income and age might react differently to the same offer. Therefor, it is probably impossible to find a model that is almost perfect.
I really liked this project but I found some preprocessing steps really challenging, especially assigning group/response labels to observations. It took me a little while to figure all the steps out.