on August 12, 2020
Extracting chunks of data from web pages is a job on its own, one that can be very resource and time-intensive. At least that’s what newbie developers or data scientists go through, manually combing websites to maintain hoards of excel sheets. Owing to accuracy constraints, this also turns out to be a counterproductive precursor while building any Machine Learning model.
Enter Web Scraping - an automated technique that can extract massive amounts of data from websites in a fraction of the time.
Feels like Deja Vu? You might have used (or heard of) APIs to do the same. In fact, it remains the best method to access structured website data. But keep in mind that not all websites provide APIs. In such cases, Web Scraping comes to our rescue as a powerful alternative strategy!
Web Scraping with Python can be used for a multitude of reasons. Top use cases of the practice include:
The list goes on!
Also Read：Top 10 Uses of Python in Business
This is arguably the most common query that novice web scrapers come across. The short answer - it depends; a lot on the website that you are scraping and the reason for the activity.
To better understand if a website allows scraping, your best bet is to read the robots.txt file (and tie this up with the terms and conditions to better understand your chances). They are created by webmasters and instruct the search engine bots about how to crawl pages on the website. This includes indicators for whether specific web-crawling software is allowed or disallowed to crawl any section of the website.
To access the robots.txt file, ‘simply navigate to this link on your target website’: www.example.com/robots.txt
Here is what you need to find (and steer clear of) in such as file:
This is denoted by the following and states that all pages are crawlable by bots.
This states that no part of the website can be accessed by an automated web crawler. It is denoted by:
This declares specific sections or files on the site that are accessible and the ones that are not. Denoted by:
Packed with a large collection of libraries and an easy-to-understand syntax, scraping with Python can be intuitive and fun, especially for beginners. Moreover, libraries such as Numpy, Matlplotlib, and Pandas open the gates for further manipulation of the extracted data.
So let’s get to the chase with the basics!
We would be using the Google Chrome Browser and Ubuntu Operating System for this tutorial. In the case you are on a different OS, that’s perfectly fine. Python is a high-level, cross platform language. This means that a code written on one OS can run on another OS without any hiccup.
Just head on to one of these links (based on your operating system) to download and install Python:
If you don’t have Python installed on your Ubuntu system, simply head on to the terminal and type in this command:
Next, we need to install the Python PIP package management system in order to install and manage software packages that are written in Python.
Additionally, we will be using the following libraries in the tutorial. Here are some details of why we need them and how they can be installed through the terminal:
Next, it is important to set the path to chromedriver. This will configure the webdriver to use the Chrome browser.
The best way to do this is to use the Webdriver Manager for Python with two simple steps:
Once you are set, follow this step by step process to begin scraping with Python:
Begin the scraping process by identifying the URL or website that you aim to extract the data from. This can be any URL on the Internet, as long as it does not violate any organizational policy against data security or web scraping.
For this tutorial and illustration purposes, we are going to scrape the URL: example.com/product-link
You can follow the steps discussed here for any URL of your choice.
The next step is to inspect the data on the web page and accordingly devise the approach to scaping. For instance, if the data on the web page is nested in tags (which is the most common scenario), you would need to pinpoint these tags.
To do this, simply right click on the web page and click on ‘Inspect’.
Notice that the ’<‘div’>’ tag has multiple nested tags, each representing a set of data.
Select the type of data that you want to extract from the web page. For instance, if you are gathering details of products on an e-commerce website, you may want to extract product details such as Price, Name, and Rating.
And now, we code!
Begin by creating a Python file by opening the terminal and typing gedit
Before we go any further and begin writing code in this file, we need to import the requisite libraries first. Simply follow these commands:
If you’re using chrome, run this:
For other browsers, you can refer to this link.
The last command will open the URL and let it load completely before returning the control to the script.
Next, we need to open the target URL and specify the data lists that need to be created while scraping the web page.
We will be creating two lists, one for the product names and another for their prices.
content creates an instance, representing the content of the entire page source.
soup lets you navigate and search through the HTML code for the data that is to be scraped.
In our scenario, suppose that the web page that we are scraping is of an e-commerce product where following tags are being used for our target data sets:
For every data set, we need to first find the tags that have the respective class-names and extract the target data before storing them in variables. The following code is used to execute these steps.
Here, ‘s-iteminfo’, ‘s-productname’, and ‘s-product_price’ refer to the respective classes under which our target tags (data location) are nested.
The next step is to run the code with the following command:
It returns with our target data sets. It is now time to organize and store it for better readability and retrieval. This is where the Pandas data frame comes into the picture.
The following code does the trick.
‘pd.DataFrame’ creates a two-dimensional tabular data set that contains labeled axes (rows and columns). This essentially means that the extracted data is converted into a table with two columns - Product Name and Price.
The resulting Pandas data frame is then written to a .csv file with specifications for index and encoding. index=False passes a False boolean value to the index parameter since we generally do not need to store the preceding indices of every row.
The output CSV file is then created with columns and rows of the defined data sets. It will the name ‘products.csv’ should look something like this upon running:
*prices are just indicative as examples and do not represent any real-world market indications.
We hope this guide was useful to you to get started with scraping with Python. Like every advanced Python concept, we have merely grazed the surface here. If you are planning to dive deeper into the universe of Python and its infinite capabilities, a structured course can help you better navigate the uncertainties and complexities. Our Python Training Course fits the bill in such situations and has been created to help novice developers become skilled at handling Python code!
Our magic sauce? Hybrid learning and real-life projects! Learn more about how such a course can help accelerate your career.
Also Read：Python for Beginners with Examples