13 Aug 2018
In my last blog, we went through the process of Data Analysis with Power BI. In summary, these are the four steps:
In this blog, I am going to focus on Data Cleaning - an important process in Data Analytics. Without proper data cleaning, you will not be able to proceed with generating useful insights and visualisation with Pivot Tables or Power BI. Again, we will demonstrate the process step by step, feel free to follow through the steps and try it yourselves.
Before we start, What are we looking to achieve in the data cleaning process?
The objective of data cleaning is to arrange the data collected to “Tidy Data” - to arrange data in a way that each variable is a column and each observation is a row. The idea is thoroughly explained in Hadley Wickham’s “Tidy Data”.
In this exercise, we will be using data drawn from Hong Kong Census on Transportation Means. We will be using Unpivot, a feature available in both Power Query and Power BI. Note that these applications are Windows exclusive, so if you are working with Mac, you may still install Windows with Bootcamp or Parallel and use these tools.
We will be working with Hong Kong By-census on Transportation Means that tracks means of transport for each district. Search for the dataset "Working Population with Fixed Place of Work in Hong Kong by District Council District, Place of Work, Year and Main Mode of Transport to Place of Work".
The dataset is very extensive and detailed. As it is not following the tidy data principle, it is hard to generate any good insights. You can see variables are defined in both columns and rows. We are going to correct that in this exercise.
In PowerBI click on Get Data -> Excel, then select the file you just downloaded. Select the Result table on the left by ticking the box. Then click on Edit. Now we need to re-organize the file and rename the column accordingly. First Column will be for the district. We need to fill down the district name so that it's available for each row. Using Power Query it's very simple, just go on the transform tab after you select the column, then click on Fill -> Down, this will copy the data across (See Below Screenshot).
As you can see this is very easy to use and can save you a lot of time. You can then rename column 3 as Work Location, remove the top 4 rows and clear the unnecessary rows. Once you are done you should have two columns with the district names and work location. The other columns are the transport types with the observations The table should now look like this:
Now it's time to go with the Unpivot. Select the all the columns except the first two. Then go on Transform and select the unpivot option. This stores all the observations as rows. At this stage, the data is now properly cleaned, and can be used for further analysis:
Using the formatted data we have now, we can now use Power BI to perform exploratory analysis:
We can already see MTR is the most used form of transport, followed by bus, walking coming on third. The top right indicate the breakdown in terms of population. If you look at the bottom part, I used an interactive slicer where you can see the percentage of the population and where they live (same district or other district in the different regions).
We can see that people living in Central and Western for example tend to work in the same district, considering how many companies present there it is not surprising. Only Tuen Mun, Yau Tsim Mong and Kwun Tong have the same profile. We customize the bar chart to highlight people working in the same district so that it is easier to visualize.
Next, we tried to visualize how district rank by mode of transport. To do that, we created multiple bar chart and then use visual filter for each type of transport. Here, there are no surprising results. Other than Yau Tsim Mong and Central/Western are walking to get to work, which confirm the observation earlier that they tend to live and work in the same district. Central/Western and Wanchai are also the biggest users of Taxi.
Now, these are not ground breaking discoveries. It was just an exercise to see how simply you can go from a dataset that Is hard to read to clean data that is visualized and interactive. This information could be used for companies wanting to target specific demographics, with that they can see where they could put advertising (in MTR, Bus etc…). Using the Census data you could also combined other information like the age breakdown, industry they work in to make your profiling more detailed.
Another point to make, people will often say you need big data to perform better analysis. However we demonstrated here that a small data set (provided the information collected is good) can give us some good insights.
Do you want to learn how to harness data like this in your career? Checkout our upcoming Data Science & Machine Learning Immersive Course
13 Aug 2018