11 May 2018
Data Analytics seems like rocket science at first glance, especially when confused with Data Science. However, it can be simpler than you think, even without knowledge in coding or advanced Excel. In this blog, we are going to show you how by demonstrating a real-life example.In this demonstration, we will be doing data processing on Hong Kong Real Estate Market. We will go through 4 steps of data analytics - collect, clean, find insights and visualise the data. We will be using tools such as Webscraper, Power Query as Add-in for Excel and Microsoft Power BI. which are all very easy-to-use tools for data analytics. We will leave the specific tools walkthrough in later blog posts, and focus on the full process and how it looks like. Feel free to try along the steps!
We will first need to collect the data we need to work on. We will be using a Google Chrome add-on called Webscraper to collect our data from the websites we desire. Mind that not all websites are deemed to be “easy to scrape” - generally websites with data mapped out without embedding would be the easiest websites to scrape.
With Webscraper, you can run your scrapping process in the background with decent speed. From the website, I managed to obtain a total of 13k real estate ads. You can see the scraped data snippet below:
In this case, the data shows floor types, rent, size, price and so on; the data collected from the website is quite clean, but would still need further clean-up before we can analyse it.
Excel & Power Query are both very useful tools for that purpose. In this exercise, we are going to use Power Query to clean our data. Power Query is available as an Add-in in Excel 2010 and later version, and is also part of Power BI Desktop. It has a very easy-to-use user interface, and can enable you to do ETL (Extract Transform Load) process on data within clicks, even without advanced knowledge in Excel. We will tell you more in-depth how this all works in a separate post, and of course in our classes.
With Power Query, I managed to extract information such as floor types, the direction the flat is facing, and some quick descriptions on the offer. I also categorised my data by districts, using multiple queries. From this point on, any data added will fall into categories automatically. Here is a snapshot of the cleaned data below:
Here are some of the quick insights generated by Power BI initially:
At this stage, you will be able to get some insights from the data already. You will also have a brief idea of which analysis you would like to include in the dashboard.
For the final step, we will create an interactive dashboard with Power BI. I included the following in the dashboard:
One of the advantages of using Power BI to build your dashboard is that it is interactive. You can click on any information you need and see the respective data.
I also included a Hexbin plot, a new form of visualisation available in Power BI, where you can read the density of data with different colors of hexagon, in relation of the price/size in this dashboard. However, the current data we have is not enough for the hexbin to show insights. The following R code produced Hexbin plot give you a sense of basic visualization understanding of interpretation of the plot.
The last thing I included in the dashboard is a wordcloud. From there, you can see the trending words that are used to describe the properties in the district. Within the Cheung Sha Wan/Sham Shui Po district, the trending words are Decoration, Convenient, Quiet… since the size of words represent the word count frequency in the corpus.
Once the dashboard is completed, you can get a better picture of the real estate market in one interface: you can learn the price, supply, characteristics, and quality of flats of different districts in one go. With these 4 steps, you can easily apply to any areas of interests you would like to visualise and research on. While Excel can serve similar purpose, the end result of Power BI dashboards are far more interactive.
The next steps would be enriching your data, to consolidate more visualisation and show a more complete picture. For example, in this dashboard, the hexbin is useful in showing size/price split, but lacks data points to serve its purpose. You can also collect more types of qualitative information, such as pollution, traffic and entertainment - data that helps people to draw more in-depth insights and decisions. We’ll go deeper into this on the next post, stay tuned!
11 May 2018