What attributes have the highest influence on a person’s choice to respond to a certain Starbucks offer?


Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks. Moreover, not all users receive the same offer.

The task is to determine which demographic groups respond best to which offer type. The dataset we use is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days.

We are given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer.

We have to keep in mind that someone using the app might make a purchase through the app without having received an offer or seen an offer. Therefor it is critical to do a good job preprocessing the data.

Project Goal

Imagine a customer is going to make a 10 dollar purchase with or without an offer anyway. From a business perspective, you wouldn’t want to send a buy 10 dollars get 2 dollars off offer to this specific customer. Therefor, the question is:

What attributes have the highest influence on a person’s choice to respond to a certain Starbucks offer?

We are going to build a machine learning model to predict if a person responds to a certain offer. The performance of the model will be measured using the f1-score. This metric seems appropriate since it combines precision and recall.

The Data

The data is contained in three files:

  • portfolio.json — containing offer ids and meta data about each offer (duration, type, etc.)
  • profile.json — demographic data for each customer
  • transcript.json — records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:


  • id (string) — offer id
  • offer_type (string) — type of offer ie BOGO, discount, informational
  • difficulty (int) — minimum required spend to complete an offer
  • reward (int) — reward given for completing an offer
  • duration (int) — time for offer to be open, in days
  • channels (list of strings)


  • age (int) — age of the customer
  • became_member_on (int) — date when customer created an app account
  • gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
  • id (str) — customer id
  • income (float) — customer’s income


  • event (str) — record description (ie transaction, offer received, offer viewed, etc.)
  • person (str) — customer id
  • time (int) — time in hours since start of test. The data begins at time t=0
  • value — (dict of strings) — either an offer id or transaction amount depending on the record

Data Inspection

In order to analyse the data, we first need to understand and clean the data. We take a look at the 3 different datasets that were read in and converted to pandas dataframes (short: df).

1. Portfolio (10 rows, 6 columns)

Portfolio df


  • There are no missing values and the data types are correct
  • We want to convert the duration from days to hours (because the time stamps in the transaction dataset are also in hours)
  • We also want to use a one-hot encoding for the channels column (which is better for a future machine learning model).

Doing so, yields the cleaned portfolio dataframe

Cleaned portfolio df

2. Profile (17,000 rows, 5 columns)


  • Taking a closer look at the data we notice “age=118” always corresponds to “gender=None” and “income=NaN”. We want to remove those rows as they are not useful.
  • We also want to transform the “became_member_on” column to values that live on a “ratio scale”. We chose to count the days of membership, where the 0 of the scale corresponds to the latest observation in the dataset.
  • Eventually, we use a one-hot encoding for the gender column

Applying all these steps yields

Cleaned Profile df

The “days_of_membership” column has the following distribution

“Days_of_membership” distribution

3. Transcript (306534 rows, 4 columns)

Transcript df


  • The value column can contains “offer_id”, “amount” and “reward”. We are not interested in the “amount” (only “responded to offer” = yes/no” is of interest for us) and the “reward” information is already contained in the “portfolio df”.

Doing so and ordering the observations by “person” and “time” yields

Cleaned Transcript df

We notice there are still some missing values in the “offer_id” column which always appear for “event=transaction”. However, we do not care for these observations because our focus will be on “event=offer_completed” instead of “event=transaction”.

Grouping observations

We consider the following groups of people

Group 1: people that are influence by the offer

  • offer received → offer viewed → offer completed

Group 2: people that are NOT influence by the offer

  • offer received → offer viewed

Group 3: people that receive an offer but take no action

  • offer received

Group 4: people that buy products regardless of an offer

  • offer received → offer completed ( → offer viewed)

Our focus lies on group 1 (responded to the offer) and group 2 (does not respond to the offer). Group 3 doesn’t yield much information. Group 4 buys products regardless of an offer which means they should not receive offers (from a business perspective).

Therefor, in the next step we want to assign labels to the corresponding person-offer combinations if they belong to group 1 or group 2. Other groups are omitted.

Merge all 3 df into 1 single df

First of all, we merge all 3 mentioned df’s into 1 single df containing all the information. Due to many columns of this df, I had to split it into 2 images here. Now we have an impression of what this df looks like.

Merged df

Assign labels to group1 and group 2

Assigning the labels needs a few steps and includes a lot of different “groupby” and “merge” commands.

  1. Mark observations where the offer was completed after it was viewed. This is important because having seen the offer differs group 1 from group 4. An example is shown just below (again in 2 separate images due to many columns).
Offer completed after viewed

2. Mark observations where the offer was only received and viewed but not completed (group 2).

Only offer received and viewed (without completion)

3. Group the observations by “person” and “offer_id” and add a label if a person is in group 1 or not .

Group 1 observations
Group 2 observations

However, this is not completely correct yet because observations should only count towards group 1 if they completed the offer within the given duration of the offer. Otherwise we can assume they do not care for the offer and should belong to group 2!

4. Find time difference between “offer received” and “offer completed” of the corresponding offer for each person and add a column “time_diff_rec_compl”.

Time difference between “offer received” and “offer completed”

5. Add a column that contains a value of 1 if the time difference between “offer received” and “offer completed” is less or equal to the duration of the offer and a 0 otherwise

Valid duration

6. Add a column “responded_to_offer” if an offer was completed after it was viewed and if it was completed within the duration of the offer. This column with values of either 0 or 1 will later be our response variable. But first we need to do one last step.

Responded to offer

7. Group by “person” and “offer_id” to get a response (responded_to_offer) only once per person and offer_id.

Grouped by “person” and “offer_id”

In the table above, each person-offer_id combination only appears once (either with a positive or negative response).

This df that we obtain now has 31782 rows and 17 columns. We split the data according to different offer types (bogo and discount). “Offer_type=informational” is removed. We also remove the column “channel_email” because it always contains a value of 1 for any observation. Therefor this information is useless.

Split data according to “offer_type”

We split the data into “offer_type=bogo” and “offer_type=discount” and create a classificaton model for each of the 2 datasets.

Take a look a the distribution of some attributes

Age distribution
days_of_membership distribution
Income distribution

Now take a look at the distribution of the genders and the response variable “responded_to_offer”.

The same is done for the discount data. The 3 distribution plots from bogo look very similar to the discount data which we do not show them here again.

Each dataset contains roughly half the observations. In the bogo dataset the response is almost equally distributed (58% to 42%), whereas in the discount dataset it is more unevenly distributed (74% to 26%). Therefor, we might expect both models to better predict a positive response (responded_to_offer=1). “gender=M” and “gender=F” are almost evenly distributed. “gender=O” only appears a few times.

Machine learning model

Now, we have finally prepared the data such that we can train a machine learning model and make predictions. We decide to use a Random Forest Classifier (in combination with the f1-score as a performance metric) because I have had good experiences with that and it seems appropriate to classify this binary-response data here. We use the standard steps in machine learning

  • Split data into training and test set
  • Scale data (I used a MinMaxScaler)
  • Initialize the classifier
  • Train the classifier using the training data
  • Predict values with the test data
  • Score predictions (using f1-score)

Model for bogo data

Using the standard parameters, we achieve the following result for the bogo data

Random Forest with default parameters for bogo data

We are not satisfied with the result and want to improve the result using GridSearch for the following parameters

Parameters grid search

Using the best parameters from the GridSearch, we can improve the model to a f1-score of 0.72 (which is 0.04 better than above). For the positive response we even get a score of 0.78

Best model

0.72 is not bad but not great either. As expected, the model predicts a positive response better than a negative. However, this should not be a problem because sending offers to people who will not use them anyways is not as bad as not sending offers to people, who would have made a purchase only if they had had an offer.

We can also display the influence of each variable on the response

Although the model is not super great, we still get an impression of the feature importance. The attributes that have by far the most influence on the outcome are “days_of_membership” and “income”.

Model for discount data

For the discount data we take the same steps as above (model with default parameters and GridSearch for fine-tuning)

Random Forest with default parameters for discount data
Optimised parameters

Again, we notice the predictions are better for the positive response which is not problematic (same argumentation as above). The optimised model shows a f1-score of 0.73 but for the positive response we obtain a value of 0.85.

The feature importance shows a similar picture as above.

“Days_of_membership”, “income” and “age” are the attributes that have the most influence on the outcome. The remaining variables seem to have little influence on the response.


We came up with two different models, one for “”bogo” and one for “discount”. Both had similar distributions of the main attributes and roughly the same amount of observations. The result were quite similar with “days_of_membership” clearly being the largest influence on the response. “Income” was found to be the second largest influence.

Both models better predict a positive response (responded_to_offer=’yes’) which is ok because sending offers to people who will not use them anyways is not as bad as not sending offers to people, who would have made a purchase only if they had had an offer.

Predicting human behaviour is both interesting and difficult because it does not always follow a clear pattern. I still liked the project as it was quite a challenge for me, especially the preprocessing steps of assigning the response labels. Feel free trying to come up with a smart or better solution ;-).

Comments and outlook

  • Mabye other models (e.g. logistic regression) would yield better results.
  • We could also try to group the age variable (e.g. young adults/adults/seniors) and the income variable (e.g. low/medium/high income). Maybe this would improve the result.
  • Another possibility could be changing the response to (amount of completed offer) / (amount of received offers) and run a regression model on this response.

As we can see there are several other possibilities that could be tested to try to find a better model. Personally, I think any of the suggestions might be at least be worth trying. Feel free to build up on these ideas.


On the one hand, we could probably do better (see Comments and outlook), on the other hand it can also be difficult to predict human behaviour because it does not always follow a clear pattern or structure. People with a similar time of membership, income and age might react differently to the same offer. Therefor, it is probably impossible to find a model that is almost perfect.

I really liked this project but I found some preprocessing steps really challenging, especially assigning group/response labels to observations. It took me a little while to figure all the steps out.




Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Python Deques: Learn What Deques Are Designed For

How can we improve “S” factor in ESG?

Production Data Science

Stratifyd Signals Built-in Data Connectors

How many hamsters would it take to power 1 household?

Build Data Pipelines with Apache Airflow

Accessing data in a MultiIndex DataFrame in Pandas

Working with COVID-19 Data

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Tobias Merk

Tobias Merk

More from Medium

Cerebral venous thrombosis

Is a company virtue signaling or actually inclusive?

“What matters other than our talents?”

Learn About The Team Behind Readoo