# What are the largest influences on airbnb prices in Boston?

# Introduction

Have you ever thought about making a few extra bucks and wondered for how much money you could rent out your place over Airbnb for a couple of nights in Boston?

Obviously the price would depend on many factors such as:

- In what part of the city is your place located?
- What typ of property is it? An entire house, just one room, or even something very special like a houseboat?
- How many people can stay at your place and what kind of beds do you offer? A real bed, a couch or something else?
- Based on review scores for other airbnbs, what do guests consider as important in a place to stay for one or more night?

These are only some questions you might want to ask yourself before determining a price.

# The Data and Questions of Interest

Based on an Airbnb dataset with close to 3600 observations, we want to answer the following questions:

**1. What neighbourhoods are the most expensive/cheap ones?**

**2. How does the price of an Airbnb depend on the review scores from the guests?**

**3. What other attributes do have large influences on the prices?**

Of course we all have some intuitive answers for these questions. The price for an entire house would probably be higher than for a bed in a dorm. Moreover, areas around downtown or close to the coast are likely to be more expensive than neighbourhoods further away from the center.

To answer these questions, we are going to investigate the dataset with mathematical methods and models.

Before getting started with modeling, we first had to **clean up the data** since the dataset was a bit messy. Columns including prices and fees were listed as strings and not numbers, some columns had a high number of missing values, etc.

Moreover, for some attributes we had to make assumptions, e.g. it did not clearly say if price was per night or per minimum number of nights that people had to stay. We assumed that prices is per night.

After having preprocessed the data, we chose to evenly distribute the cleaning fee in the prices per night and came up with the following histogram of the total price per night.

It turns out there are a few very expensive airbnbs in Boston. In order to obtain a good model, we want to remove those “outliers”. We decided to only consider prices between 20 USD and 500 USD. In my opinion, Airbnbs are mostly used by younger people which are usually not the “rich age group”. These values were chosen arbitrarily.

Taking another look at the data, we found out that some places have a lot more bathrooms than beds, which seems a bit weird to me. Why would a place have 1 bed, accommodate 2 people, but hast 6 bathrooms??? Observations with 5 or more bathrooms were therefore removed (which more or less corresponds to the table shown just below).

In a similar way we removed some strange observations in the attributes “cleaning fee”, “minimum_nights”, “security_deposit” and “accommodates” (amout of people that can stay). For more details, we refer to the Github content here.

It is ok to remove some weird and strange observations. After the data preprocessing we were still left with close to 91% of the original amount of observations (~3250 data points).

# The Linear Regression Model

We want to model the “total_price_per_night” as described above in dependence of neighbourhood, type of property, number of beds, review scores and many more predictors.

Using linear regression (from the python package **scikit-learn)**, we obtained a more or less good model. The problem here are the coefficients…

Some coefficients of our linear regression model were extremely large (positive as negative, see both tables below). This makes it **difficult to interpret the results** and maybe we can obtain a better fit using a different model.

# The Lasso Regression model

For the sake of obtaining a hopefully better model, we used another approach called “**Lasso Regression**”. Lasso regression works in a similar way as the linear regression but it tries to avoid larger coefficients by punishing large coefficients in the optimization.

For this approach we obtain model with a similar fit as the lineare regression. The plot just below shows the **true prices vs. the predicted prices**. The closer the red dots lie on the blue line, the better our model is. The scales correspond to the square root of the prices, due to better model fit after a square root transformation of the response.

# Conclusion

# 1. What neighbourhoods are the most expensive/cheap ones?

→ **South Boston Waterfront, Downtown and Back Bay** (probably unsurprisingly) seem to be neighbourhoods with the **more expensive airbnbs** (which can be seen on the rather high positive coefficients for those variables).

→ **Hyde Park, Roslindale and Mattapan** seem to be the **rather cheaper neighbourhoods **(due to the rahter high negative coefficients for those variables).

# 2. How does the price of an Airbnb depend on the review scores from the guests?

→ Especially good **reviews for cleanliness** seems to make airbnbs more expensive. **Accuracy **and** Location** also seems to have a comparably large **positive influence on the prices**.

→ Apparently the **number of reviews per months, the number of reviews and the review_scores_value **(whatever that stands for → need more information on the data)** **seems to have a comparably **strong negative influence on the price** (maybe because many people like to stay in cheaper places and therefore those places have more reviews??)

# 3. What other attributes do in general have the largest influences on the prices?

The largest positive and negative coefficients have the highest influence on the price, therefore the **number of bedrooms, property_type=Other **(whatever that means → need more information about the data), **the number of included guests, property_type=boat, room_type=entire home/apt** and **number of people the airbnb can accommodate** have a high influence on the price.

On the other hand, the **minimum number of nights people have to stay**, the **amount of reviews per month** and e.g. staying in a **dorm** tend to make airbnbs cheaper.

In conclusion, staying in a clean entire home/apt or houseboat with many bedrooms in South Boston Waterfront or Downtown is rather more expensive than staying in a dorm in Hyde Park. All of this makes absolut sense.

**Surprisingly**, more reviews per month seem to make airbnbs cheaper (maybe because people like to stay in cheaper places and therefore frequent those places more and leave more reviews??).

# Price prediction

Using this model we can also make predictions for a given home. Let’s say you wanted to rent out your place, we simply add the information about neighbourhood, amount of beds, etc. and evaluate our model on this set of variables to obtain an average price as a prediction.

# Discussion

**Our model is neither bad nor perfect**. Therefore many results we derived from our model need to be handled carefully. Many tendencies may be correct but we certainly cannot fully trust our model.

To come up with a better model, we would have to better understand the data by having more information on the data, e.g. is the listed price really per night, did we make the right assumptions on which observations to remove, etc.?

More observations would probably also increase the reliability of our model. We could also try using different models, e.g. Random Forest to maybe obtain a better fit.

How would YOU have worked yourself through the data? Can YOU find a model that better fits the data?