While Toronto is a beautiful place to live, no one would claim it is easy to find a place to rent in Toronto. To help current and potential Toronto residents better understand rental options available to them, we created a database containing the most recent rental listings as well other indicators of what it might be like to live at each address.
The following outlines the process for extracting, tranforming and loading data into the database.
Two sources were used to extract rental listing data - Craigslist and Kijiji.
Craigslist provides a well structured URL to scrape rental posting data based on an area (Eg:- Toronto) and site (Eg:- Downtown Toronto). We used craigslist python module to extract all the direct URLs to the posts in the Toronto area and then scraped the posts using CSS selectors in selenium. Please refer the code.
Kijiji provides rental information that includes apartment features such as apartment location and cost.
Rentals fall into three main categories: Long term rentals, Short term rentals , Rooms and Roommates rental. The Kijiji website is not dynamic and quite easy to scrape. A simple url request along with BeautifulSoup library was used for the data scrapping.
Kijiji provides rental information that includes apartment features such as apartment location and cost See Code.
Two sources of data were extract from Toronto Police Services - major crime indicators and community services.
Toronto Police Services (TPS) provides a catalogue of data sources that includes incidence of crime. Crimes are tracked in a number of categories and available in consolidated dataset - Major Crime Indicators (MCI). TPS provides an API to easily access the MCI. We have made use of this API call and merged with homicide data available on the TPS website. Please refer the code
The TPS data catalogue includes data tables identifying organizations providing services to Toronto communities. Unfortunately, this data source is not accessible via an API. Web scrapping was used to extract the community asset data.
The website is dynamic, meaning it needs to be interacted with in order for the data of interest to be accessed from back-end server then displayed on the webpage.
The Selenium and BeautifulSoup libraries were used to interact with and extract the data table rows See Code.
The webpage provides access to different tables via drop-down menu. Selenium's Select function was used to move between tables.
The static HTML only contains a subset of the data tables rows at a given time. In order to access all the data rows, the code iteratively scrolls through the data tables and scrapes the static HTML. Scrolling was achieved identifying the last row in the static HTML and clicking on it via Selenium's click() function. This triggered the webpage's Javascript to view that row as the new top row in the table thus triggering it to update the static HTML.
A CSV file was downloaded from the Government of Canada website containing Individual 2015 tax statistics by forward sortable areas (FSA). FSA are the first three digits of a postal code. The Pandas library was used to read the file into a dataframe for transformation. See Code.
One of the challenges
with craigslist data was that it doesn't have a structure in the post. A user could write the rental features the way he/she pleases. We used excessive use of regular expressions to extract meaningful attributes
like square feet, type of the house, no. of bedrooms, no.of bathrooms etc.
Apart from the regular cleaning process,
This data had some null values for the location. As, we had to use the location to connect with the crime data, we retrieved the lat, long locations based from address using google geocode API
Similarly, wherever the geotag(lat, long) was present and not address, we retrieved the postal code using google's reverse geocode API
. The postcode was later converted to FSA(first 3 digits of the postal code) and used to connect to community and Income data. Please refer the code.
A list of urls was created by scrapping the main rentals webpage. Using these urls we extracted apartment features from each apartment url creating a DataFrame for the rental_kijiji_table.
Further transformations then included removing extra columns, extracting columns such as FSA and postal codes from the address, removing rows without postal information; and updating headers to reflect best practices for SQL databases and be able to merge with Craigslist data.
Apart from the regular cleaning, timestamps were converted to the iso format and only the last 1 year of crime data was taken into consideration for connecting with the rental data logically. As crime data has two sources, data from API
and homicide CSV file
, both were transformed into a unique template.
Please refer the code.
Each data table was transformed into a dataframe. Once duplicates were removed, the individual dataframes were incorporated into a master dataframe.
Further transformations then included removing rows without postal information; normalizing how fees for services are described; and updating headers to reflect best practices for SQL databases.
Rows for FSA outside of the City of Toronto (for which FSA begin with 'M') were removed from the dataframe. Average income was calculated pre FSA by dividing “Total Income” by “Total” (population).
This was a challenging task
as there was no postal code in crime data to connect the with the rental data, but there was geocode. Though the geocodes alone could not be connected directly, We built a logic around the geocode and created a bridge table. This bridge table connects the rental postings to all crime events that happened within 3kms radius in the last 1 year
. To measure the distance between two locations, we use haversine distance
. This resulted in a bridge table of 1.4 Million rows
. Though this method has a lot of advantage (For instance, user can check crime data at any granularity within 3 km) compared to using the static postal codes, we plan to migrate to postal code logic to reduce growing the table size.
Rental data connects with community data through FSA. To handle this many-many
relationship, we have used the Income table which has unique FSA codes connecting to both the tables.
We have used PostgreSQL DB to load the data with this schema. SQL Alchemy is used to load the data to the corresponding tables. See Code here. All the tables have established desired Foreign key relationships as shown in ERD. All the final data tables are here except the ones uploaded in the google drive because of space limitations.
Code base here We have made application with the following features
- A front page explaining about the project details
- Rental data API: To provide rental listings for a number provided by the user (by default 100)
- Community data services API: To provide community services for a number provided by the user (by default 100)
- Crime Data API: To list all the crimes for a specific crime type (Assault is by default)
- Income Data API: To list the average income for all FSA's in Toronto
- Scan for other available sources of geographic data that can enrich database as indicators of what it is like to live at each rental location.
- Improve the application that we have made, to incorporate additional features