library(httr)
library(xml2)
library(rvest)
library(janitor)
library(dplyr)
library(tidyr)
library(ggplot2)
library(lintr)
library(styler)
library(remedy)
library(tidylog)
library(stringr)
library(stats)
library(knitr)
library(rmarkdown)
library(infer)
Scrape the data contained in https://epfl-exts.github.io/rental-scrape/ and get a tibble of 612 rows and 9 columns.
Since I am scrapping data from a website, I will need the {httr}, {xml2} and {rvest} packages. Once the URL is converted into a xml document, I have to inspect the html code from the page to find the information I want to use in my tibble (as there are no tables we can directly scrape form). As a second step, I have to understand the structure of the html code, to identify the tags and attributes which will allow me to select the data I need as precisely as possible with the help of html_nodes(). Thirdly I use html_text() which creates a simple vector of character strings needed for my future tibble.
rental_website <- GET("https://epfl-exts.github.io/rental-scrape/")
rental_html <- content(rental_website, as = "parsed")
address <- html_nodes(rental_html, css = ".sec-info .address") %>%
html_text()
floor <- html_nodes(rental_html, css = ".sec-info .floor strong") %>%
html_text()
availability <- html_nodes(rental_html, css = ".availability strong") %>%
html_text()
price <- html_nodes(rental_html, css = ".bg-text .price .float-right strong") %>%
html_text()
currency <- html_nodes(rental_html, css = ".bg-text .price .float-right") %>%
html_text() %>%
str_extract_all("\\b[A-Z]{2,4}\\b") %>% # \\b = start/end of a word, [A-Z] = letters, {2,4} = number of characters
unlist() # The function str_extract_all returns a list and I want a vector. That's why I need unlist()
living_space <- html_nodes(rental_html, css = ".sec-info .living-space strong") %>%
html_text()
usable_surface <- html_nodes(rental_html, css = ".sec-info .usable-surface strong") %>%
html_text()
object_type <- html_nodes(rental_html, css = ".bg-text .object-type") %>%
html_text()
rooms <- html_nodes(rental_html, css = ".bg-text .rooms strong") %>%
html_text()
rental <- tibble(address,
price,
currency,
object_type,
rooms,
living_space,floor,
availability,
usable_surface) %>%
rename(location = address)
## rename: renamed one variable (location)
Explore the data and clean the tibble.
I want to make sure I’m working with a tibble and not a data frame and observe the data types of the columns. In this case, all columns are characters which means I will certainly have to change some. I will not change the “location” column to a date data type, because I don’t need to use it further and don’t know how to treat the “immediately” value.
rental <- rental %>%
tibble::as_tibble()
glimpse(rental)
## Rows: 612
## Columns: 9
## $ location <chr> "Rue de la Terrassière 58, 1207 Genève", "Address on re…
## $ price <chr> "1900", "4500", "2100", "5500", "2340", "2895", "1980",…
## $ currency <chr> "CHF", "CHF", "CHF", "CHF", "CHF", "CHF", "CHF", "NA", …
## $ object_type <chr> "Apartment", "Apartment", "Apartment", "Single house", …
## $ rooms <chr> "3", "4", "4", "6", "4", "4.5", "4", "5", "4", "4", "5"…
## $ living_space <chr> "63 m2", "185 m2", "NA", "170 m2", "73 m2", "NA", "73 m…
## $ floor <chr> "4", "NA", "50", "3", "1", "2", "11", "6", "5", "2", "6…
## $ availability <chr> "01 Oct 2018", "01 Aug 2018", "01 Aug 2018", "NA", "01 …
## $ usable_surface <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "…
Before I modify the original tibble I want to get a sense of the unique values and their proportion, using tabyl().
rental %>%
tabyl(floor) %>%
kable()
| floor | n | percent |
|---|---|---|
| 1 | 84 | 0.1372549 |
| 10 | 3 | 0.0049020 |
| 11 | 2 | 0.0032680 |
| 12 | 1 | 0.0016340 |
| 18 | 1 | 0.0016340 |
| 2 | 65 | 0.1062092 |
| 3 | 67 | 0.1094771 |
| 4 | 53 | 0.0866013 |
| 5 | 35 | 0.0571895 |
| 50 | 1 | 0.0016340 |
| 6 | 45 | 0.0735294 |
| 7 | 15 | 0.0245098 |
| 8 | 12 | 0.0196078 |
| 9 | 6 | 0.0098039 |
| NA | 214 | 0.3496732 |
| Underground | 8 | 0.0130719 |
The majority of values in the “floor” column are digits, except for “Underground” that I could turn into “-1”. 34% of the values are NA, this result will impact any future analysis and hinder the interpretation. Given the importance of NA I will not exclude them from the analysis. I can’t infer on what the value could be either as the “object_types” are of Villa, Apartment or Single House type. Also, there seems to be an anomaly with the floor number 50. I could exclude this value as well.
rental %>%
tabyl(object_type) %>%
kable()
| object_type | n | percent |
|---|---|---|
| Apartment | 440 | 0.7189542 |
| Attic flat | 8 | 0.0130719 |
| Bifamiliar house | 8 | 0.0130719 |
| Cellar compartment | 1 | 0.0016340 |
| Duplex | 16 | 0.0261438 |
| Farm house | 1 | 0.0016340 |
| Hobby room | 8 | 0.0130719 |
| Loft | 4 | 0.0065359 |
| Roof flat | 3 | 0.0049020 |
| Row house | 14 | 0.0228758 |
| Single house | 65 | 0.1062092 |
| Studio | 9 | 0.0147059 |
| Terrace flat | 2 | 0.0032680 |
| Villa | 33 | 0.0539216 |
The currency column is 97% composed of “CHF”. The rest can be considered as NA without compromising the interpretation. I could ignore the NA values.
rental %>%
tabyl(rooms) %>%
paged_table()
I don’t see any anomaly here, I could turn this column into a numeric one. The NA values represent only 4%.
rental %>%
tabyl(price) %>%
paged_table()
“Price on request” is the only non numerical value in the price column and represents roughly 2% of the values. I can replace them by “NA”. It should not hinder the interpretation and I could ignore them.
rental %>%
tabyl(living_space) %>%
paged_table()
I could turn this column into a numeric one. The NA represent 15% so can’t be excluded from the analysis. I see one anomaly here: 1m2 for a living space seems like an error and could be discarded.
rental %>%
tabyl(usable_surface) %>%
paged_table()
The majority (95%) of the data for the usable surface is NA, I don’t think the values are missing or that the data was mishandled. Depending on the definition of usable surface, I understand it means that only 5% of this data set is considered to have space dedicated for something else than living.
In the following step, I will mutate the columns needed for the next exercises and create a new tibble called “rental2”.
rental2 <- rental %>%
mutate(price = as.numeric(price)) %>%
mutate(rooms = as.numeric(rooms)) %>%
mutate(floor = if_else(floor == "Underground", "-1", floor)) %>%
mutate(floor = as.numeric(floor)) %>%
mutate(living_space = str_remove(living_space, "m2")) %>%
mutate(living_space = as.numeric(living_space)) %>%
mutate(usable_surface = str_remove(usable_surface, "m2")) %>%
mutate(usable_surface = as.numeric(usable_surface))
Now that I have standardized my data in the “rental2” variable I want to have a vizualization of the data types and missing values.
visdat::vis_dat(rental2)
Create a scatterplot showing how price evolves with living space, for all the listed properties.
ggplot(data = rental2,
mapping=aes(x=price,
y=living_space)) +
geom_point(alpha =0.7) +
labs(title = "Evolution of price depending on living space by listed proprerties",
x = "Price (CHF)",
y = "Living space (m2)")+
theme_minimal()
The observations are concentrated when the prices and living spaces are low i.e. under 100m2. The more the living space increases, the more the prices vary. Note: 106 rows were removed because they contained missing values or values outside the scale range.
Create a bar plot showing the number of properties by postcode.
The first step here is to create a separate column for the postcode which are mentioned in the “location” column. I have to extract 4 digits and I can use str_extract() to do so. The count() functions will return a simple tibble with the number of observations grouped per postcode which is what we need to create the barplot and understand the most popular postcodes in this rental agency. I flipped the coordinates thinking the postcodes might be more readable horizontally, but they are still too close from one another.
library(grid)
rental3 <- rental2 %>%
mutate(postcode = str_extract(location, "\\b\\d{4}\\b")) %>%
count(postcode)
rental3 %>%
ggplot(data = rental3,
mapping = aes(x=forcats::fct_reorder(postcode, desc(n)),
y = n)) +
geom_col(alpha = 0.7) +
coord_flip() +
labs(title = "Number of properties by postcode",
x = "Postcode",
y = "Number of properties")+
theme_minimal()+
theme(plot.margin = unit(c(1, 1, 2, 1), "cm"))
Most of the rental locations are for the postcode 1206. Overall most of the rental locations are in the city center i.e. 1200-1209.
Create a more complex scatterplot, showing how price evolves with living space by postcode and by floor.
For the previous exercise I had created a column for the postcodes in “rental3”, but this tibble only contained 2 variables: n and postcodes. For this exercise I need the price, living space, postcode and floor variable. Therefore, I will add the postcode column to “rental2” and have a new tibble “rental4”. For the vizualisation, I will exclude the floor 50 as I think it is an error and exclude the NAs. I also focus on floors 1 to 6 as the other data points have less data and can be less representative of a tendency.
rental_4 <- rental2 %>%
mutate(postcode = str_extract(location, "\\b\\d{4}\\b"))
rental_4 %>%
filter(floor != "50") %>%
filter(!is.na(floor)) %>%
filter(floor %in% 1:6) %>%
ggplot(aes(x=price,
y=living_space,
colour = postcode)) +
geom_point() +
facet_wrap(vars(floor)) +
labs(title = "Price of flats over living space",
subtitle = "Separated by floor using facets and postcodes using colors",
x = "Price in CHF",
y = "Living space in m2")+
theme_minimal()
The tendency of prices increasing as the surface increases can be observed for all floors.
Integrate the most expensive and least expensive mean postcode/floor combo in your text with inline code.
To obtain the least and most expensive mean per postcode-floor combo, I have to group my data by the combo and use max() and min().
rental_min <- rental_4 %>%
filter(floor != -1) %>%
group_by(floor, postcode) %>%
summarise(mean_combo = mean(price)) %>%
ungroup() %>%
slice_min(order_by = mean_combo, with_ties = TRUE) %>%
select(floor, postcode, mean_combo)
kable(rental_min)
| floor | postcode | mean_combo |
|---|---|---|
| 2 | 1219 | 1245 |
min_mean_f <- rental_min %>%
pull(floor) %>%
.[1]
min_mean_p1 <- rental_min %>%
pull(postcode) %>%
.[1]
min_mean_p2 <- rental_min %>%
pull(postcode) %>%
.[2]
min_mean_p <- rental_min %>%
pull(mean_combo) %>%
.[1]
The floor/postcode combo with the least expensive mean of 1245 are floor: 2 and postcode 1219 and floor 2 and postcode NA. I decided to exclude the underground, because they are either hobby rooms or cellar compartments, therefore it is logical that the price is lower and these spaces are not comparable to living spaces in apartments.
Next; I repeat the operation to obtain the maximum mean.
rental_max <- rental_4 %>%
group_by(floor, postcode) %>%
summarise(mean_combo = mean(price)) %>%
ungroup() %>%
slice_max(order_by = mean_combo, with_ties = TRUE) %>%
select(floor, postcode, mean_combo)
kable(rental_max)
| floor | postcode | mean_combo |
|---|---|---|
| 2 | 1292 | 12854 |
max_mean_f <- rental_max %>%
pull(floor)
max_mean_p1 <- rental_max %>%
pull(postcode)
max_mean_p <- rental_max %>%
pull(mean_combo)
The floor/postcode combo with the most expensive mean of 12854 are floor: 2 and postcode 1292.
Can you see any trends for listings with addresses only available on demand? Are they more expensive or less? Bigger living space? Higher floor?
I am interested in observing whether or not there is a trend for listings addresses only available on demand regarding their living space, price or floors. At the moment my tibble doesn’t distinguish the listings which are on demand or not, and the information is in in the “location” column written as “on request” or “sur demande”.
As a first step I need to create a column “status” with the following binary values: “available” / “on request”. I will use mutate(), case_when() and str_detect() to do identify the locations which include the key words “request” and “demande”.
rental5 <- rental_4 %>%
mutate(status = case_when(
str_detect(location, regex("request|demande")) ~ "on request",
TRUE ~ "available"
)) %>%
select(location, status, postcode, everything())
Now that I have my tibble as “rental5” with the “status” column I can use it to create the different boxplots and violin plots.
This boxplot shows the relationship between the availability status and the living space.
ggplot(rental5, mapping = aes(x = status,
y = living_space)) +
geom_boxplot(width = 0.4, fill = "purple", alpha = 0.5) +
geom_violin(fill = "skyblue", alpha = 0.5)+
theme_minimal() +
labs(title = "Surface of the listings by availability status",
x = "Status",
y = " Surface (m2)")
The living space median for the “on request” group is slightly above 150 m2 whereas, the living space median for the “available” group is slightly below 100 m2. The majority of observations for the “on request” group corresponds to a higher living space than the “available” group.
The following box plot shows the relationship between the availability status and the price.
ggplot(rental5, mapping = aes(x = status,
y = price)) +
geom_boxplot(fill = "orange", alpha = 0.5)+
geom_violin(width = 0.7, fill = "yellow", alpha = 0.5)+
theme_minimal()+
labs(title = "Price of the listings by availability status",
x = "Status",
y= "Price (CHF)")
The distribution of observations for the “on request” locations match more expensive prices than the “available” ones. The median from the “on request” group is close to 5000 CHF whereas, the median for the “available” locations is around 3000 CHF. Most of the observations for the “on request” group start in the last quartile from the “available” group.
The following violin plot shows the relationship between the availability status and the number of floors.
rental5 %>%
filter(floor!=50) %>%
ggplot(aes(x = status,
y = floor)) +
geom_violin(fill = "skyblue", alpha = 0.5)+
geom_boxplot(width = 0.6, fill = "red", alpha = 0.5)+
theme_minimal()+
labs(title = "Number of floors of the listings by availability status",
x = "Status",
y = "Number of floors")
The violin plot shows that most observations for the “available” group lie between 0 and 4 floors and then progressively decrease the higher the floors, whereas most observations from the on request group indicate a strong presence for floors 1,2,3. The median for both groups is very similar, and is apparently floor 3.
Make a table summarising group size, median, average, standard-deviation, minimum and maximum of the variable price per square-meter (expressed in CHF/𝑚2), by address status (on request or available).
As a first step I need to add a column for the price per square-meter which will be computed from “price” and “living_space”.
rental6 <- rental5 %>%
mutate(price_m2 = price/living_space)
Next, I will do a descriptive analysis (group size, median, average, standard deviation, minimum and maximum) of the price/m2 by availability status. I will start by dropping the NAs, as they only represented 2% of the “price” column. I use group_by() to separate observations by availability status since I want to see if the status impacts the price/m2. I can then use summarise to create a new tibble with the descriptive statistics.
rental6 %>%
drop_na(price_m2) %>%
group_by(status) %>%
summarise(group_size = n(),
median = median(price_m2),
average = mean(price_m2),
std_dev = sd(price_m2),
min = min(price_m2),
max = max(price_m2)) %>%
ungroup()
The group size of the “on request” group is smaller than the “available” group by more than 50%. Both medians are very similar, whereas the average of price/m2 is higher by 6 point for the “available” group. The min values are higher for the “on request” group by 9 points. The max values however vary disproportionately, as it is the case for the standard deviation. My understanding is that there is a very extreme value for “price/m2” coming from the “price” and “living_space” variables. This value influences the standard deviation and max value, since it is very distant from the mean and median. I had identified when exploring the data that there was a surface of 1m2, which seemed like an anomaly and could be the cause of this wide difference in the max value and standard deviation.
Hypothesis: Locations with the “on demand” status have on average a higher price/m2 than the locations with the “available” status.
Null Hypothesis: Locations with the “on demand” status have on average an equal price/m2 with the locations with the “available” status
After excluding the “1m2” value and any NAs, I will compute a t-test to confirm or exclude the Null Hypothesis with the {infer} package and the t_test() function.
rental7 <- rental6 %>%
drop_na(price_m2) %>%
filter(living_space !=1)
rental7 %>%
t_test(price_m2 ~ status,
order = c("available", "on request"))
The p_value 0.6268068 is higher than 0.05 which is the threshold from which it is very unlikely (95% interval confidence) that the Null Hypothesis is observed. With this result I can’t exclude the Null Hypothesis. The t-test is not statistically significant. We can’t reject the null hypothesis: “Locations with the”on demand” status have on average an equal price/m2 with the locations with the “available” status”.
Do the same as in part 6, but now using the variable price (expressed in CHF).
Compare the results of the analysis carried on the price per square-meter with those for the price variables. Are there differences between the results? Write down your interpretation and a general conclusion.
In this part I want to explore the relationship between the status availability and the price. I start by creating a new tibble with the descriptive statistics, grouped by status.
rental6 %>%
drop_na(price) %>%
group_by(status) %>%
summarise(group_size = n(),
median = median(price),
average = mean(price),
std_dev = sd(price),
min = min(price),
max = max(price)) %>%
ungroup()
Here the median and average from the “on request” group are higher by more than 1000. These results differ from the values for the price/m2 where they were very close. In this case the standard deviation is higher for the “on request” group by 249 points. Which means there is more disparity in prices for the “on request” group. The min values are drastically different: 30 for the “available” group and 1180 for the “on request” group.
The Hypothesis could be that locations “on request” are on average more expensive than locations “available”. And the Null Hypothesis could be that regardless of their availability status, the locations have on average the same price.
rental6 %>%
t_test(price ~ status,
order = c("available", "on request"))
The p-value is in this case statistically significant since it’s lower than 0.05, which means that the Null Hypothesis is very unlikely to happen and can be rejected. This is a difference from the previous t-test with the price/m2. And I am not sure how to interpret this difference or explain its cause. One might think that the variables in price/m2 - less extreme values - have the effect of reducing the difference between the “on demand” and “available” group.
Convert a subset of 30 addresses to latitude and longitude using an API or {tidygeocoder} and plot them on a map using {ggmap} (or {leaflet} if you cannot get {ggmap} to work).
I need a few packages before realizing this operation {tidygeocoder} and {ggmap}. I also need an API key to have access to the maps, in this case the map from Geneva because that’s where the addresses from the data set are. I create a variable for Geneva, with its latitudes and longitudes thanks to the BoundingBox website. Then I can download it with get_stadiamap().
I need to create a new column in my tibble with the coordinates from the locations so they can be shown on the Geneva map I just downloaded. To do that, I can use the geocode() function from {tidygeocoder} (and not from {ggmap}!). This function automatically adds a longitude and latitude in my tibble for each location. I need the addresses so I need them to be in the “available” status and I also just want a sample of 30 rather than the total of observations. This new tibble is called “rental8”
rental8 <- rental6 %>%
filter(status == "available") %>%
sample_n(30) %>%
tidygeocoder::geocode(address = location)
The final step is to create the map with the data from “rental8” thanks to ggmap(), the map of Geneva and geom_point().
ggmap(geneva_map) +
geom_point(data=rental8,
aes(x = long, y = lat), alpha =0.5) +
labs(title = "Rental locations",
x = "Longitude",
y = "Latitude")
I also created a leaflet alternative.
library(leaflet)
rental8 %>%
sample_n(30) %>%
leaflet() %>%
addTiles() %>%
addMarkers(lng=~long,lat=~lat,
label=~as.character(price),
popup=~location)
Comment
There are 11 unique values in the “object_type” column, and all are characters. 71% of them are “Apartment” types.