A Beginner's Guide To Web Scraping With Python

Jyoti Gupta
May 11, 2021
Last updated on
June 25, 2024

[Invalid image]

Extracting chunks of data from web pages is a job on its own, one that can be very resource and time-intensive. At least that’s what newbie developers or data scientists go through, manually combing websites to maintain hoards of excel sheets. Owing to accuracy constraints, this also turns out to be a counterproductive precursor while building any Machine Learning model.

Enter Web Scraping - an automated technique that can extract massive amounts of data from websites in a fraction of the time.

Feels like Deja Vu? You might have used (or heard of) APIs to do the same. In fact, it remains the best method to access structured website data. But keep in mind that not all websites provide APIs. In such cases, Web Scraping comes to our rescue as a powerful alternative strategy!

Use Cases of Web Scraping

Web Scraping with Python can be used for a multitude of reasons. Top use cases of the practice include:

- Creating a lead generation engine with rich market data

- Monitoring approval ratings of brands by scraping reviews

- Conducting technical SEO audits with keyword analysis

- Building Machine Learning models with large data sets

- Conducting market research and analysis at scale

- Gathering feed data for end-to-end testing bots

- Collecting product information for price comparisons

- Analyzing social media data to understand what’s trending

- Collecting R&D data to carry out surveys and data analysis

- Curating data from multiple sources to create a unified platform

The list goes on!

Read Also:Top 10 Uses of Python in Business

Is Web Scraping Legal?

This is arguably the most common query that novice web scrapers come across. The short answer - it depends; a lot on the website that you are scraping and the reason for the activity.

To better understand if a website allows scraping, your best bet is to read the robots.txt file (and tie this up with the terms and conditions to better understand your chances). They are created by webmasters and instruct the search engine bots about how to crawl pages on the website. This includes indicators for whether specific web-crawling software is allowed or disallowed to crawl any section of the website.

To access the robots.txt file, 'simply navigate to this link on your target website': www.example.com/robots.txt

Here is what you need to find (and steer clear of) in such as file:

1. Allow full access

This is denoted by the following and states that all pages are crawlable by bots.

[Invalid image]

2. Block all access

This states that no part of the website can be accessed by an automated web crawler. It is denoted by:

[Invalid image]

3. Partial access

This declares specific sections or files on the site that are accessible and the ones that are not. Denoted by:

[Invalid image]

How To Scrape Websites With Python

Packed with a large collection of libraries and an easy-to-understand syntax, scraping with Python can be intuitive and fun, especially for beginners. Moreover, libraries such as Numpy, Matlplotlib, and Pandas open the gates for further manipulation of the extracted data.

So let’s get to the chase with the basics!

Required Installations

We would be using the Google Chrome Browser and Ubuntu Operating System for this tutorial. In the case you are on a different OS, that’s perfectly fine. Python is a high-level, cross-platform language. This means that a code written on one OS can run on another OS without any hiccup.

Just head on to one of these links (based on your operating system) to download and install Python:

If you don’t have Python installed on your Ubuntu system, simply head on to the terminal and type in this command:

[Invalid image]

Next, we need to install the Python PIP package management system in order to install and manage software packages that are written in Python.

[Invalid image]

Additionally, we will be using the following libraries in the tutorial. Here are some details of why we need them and how they can be installed through the terminal:

  • Beautiful Soup: To extract the data from markup languages such as HTML and XML.

[Invalid image]

  • Pandas: To extract data and store it in the desired format for data analysis.

[Invalid image]

  • Selenium: To automate website testing and performing actions such as element clicks, refresh page, goto website link, and more.

[Invalid image]

Next, it is important to set the path to chromedriver. This will configure the webdriver to use the Chrome browser.

The best way to do this is to use the Webdriver Manager for Python with two simple steps:

  • Install the manager:

[Invalid image]

Once you are set, follow this step by step process to begin scraping with Python:

1. Finding and Selecting the URL that You Want to Scrape

Begin the scraping process by identifying the URL or website that you aim to extract the data from. This can be any URL on the Internet, as long as it does not violate any organizational policy against data security or web scraping.

For this tutorial and illustration purposes, we are going to scrape the URL: example.com/product-link

You can follow the steps discussed here for any URL of your choice.

2. Inspect the Page that We Want to Scrape

The next step is to inspect the data on the web page and accordingly devise the approach to scaping. For instance, if the data on the web page is nested in tags (which is the most common scenario), you would need to pinpoint these tags.

To do this, simply right click on the web page and click on ‘Inspect’.

[Invalid image]

Notice that the '<'div'>' tag has multiple nested tags, each representing a set of data.

3. Finding and Finalizing the Data We Want to Extract

Select the type of data that you want to extract from the web page. For instance, if you are gathering details of products on an e-commerce website, you may want to extract product details such as Price, Name, and Rating.

Read Also: Should I Learn Python? Finance Professionals

4. Writing the Code to Extract the Data

And now, we code!

Begin by creating a Python file by opening the terminal and typing gedit  with a .py extension. For instance:

[Invalid image]

Before we go any further and begin writing code in this file, we need to import the requisite libraries first. Simply follow these commands:

[Invalid image]

If you’re using chrome, run this:

[Invalid image]

For other browsers, you can refer to this link.

The last command will open the URL and let it load completely before returning the control to the script.

Next, we need to open the target URL and specify the data lists that need to be created while scraping the web page.

We will be creating two lists, one for the product names and another for their prices.

[Invalid image]

content creates an instance, representing the content of the entire page source.

soup lets you navigate and search through the HTML code for the data that is to be scraped.

In our scenario, suppose that the web page that we are scraping is of an e-commerce product where following tags are being used for our target data sets:

  • < H1 > Tag: Name of the products
  • < span > Tag: Price of the products

For every data set, we need to first find the tags that have the respective class-names and extract the target data before storing them in variables. The following code is used to execute these steps.

[Invalid image]

Here, ‘s-iteminfo’, ‘s-productname’, and ‘s-product_price’ refer to the respective classes under which our target tags (data location) are nested.

5. Storing the Data in the Most Appropriate Format

The next step is to run the code with the following command:

[Invalid image]

It returns with our target data sets. It is now time to organize and store it for better readability and retrieval. This is where the Pandas data frame comes into the picture.

The following code does the trick.

[Invalid image]

‘pd.DataFrame’ creates a two-dimensional tabular data set that contains labeled axes (rows and columns). This essentially means that the extracted data is converted into a table with two columns - Product Name and Price.

The resulting Pandas data frame is then written to a .csv file with specifications for index and encoding. index=False passes a False boolean value to the index parameter since we generally do not need to store the preceding indices of every row.

The output CSV file is then created with columns and rows of the defined data sets. It will the name ‘products.csv’ should look something like this upon running:

[Invalid image]

*prices are just indicative as examples and do not represent any real-world market indications.

Why Do We Use Selenium

To put it simply, Selenium helps us automate testing for web applications. Browser interactions such as link clicks and form submissions can be quickly automated with specific Python scripts. It even helps in the case of multiple ajax requests where data needs to be scraped from Javascript generated content. While this can be done with BeautifulSoup too, the difference in the way the data on the web page is rendered calls for more options.

Looking Forward

We hope this guide was useful to you to get started with scraping with Python. Like every advanced Python concept, we have merely grazed the surface here. If you are planning to dive deeper into the universe of Python and its infinite capabilities, a structured course can help you better navigate the uncertainties and complexities. Our Python Training Course fits the bill in such situations and has been created to help novice developers become skilled at handling Python code! The monthly salary for Python Developers in HK is HK$23,000 approximately.

Read Also:Python for Beginners with Examples

Related Course in Hong Kong

Introduction to Python

Need more advice?

If you are at a choice point in your career and need someone to help you navigate professional challenges. You can make an appointment to our free 1-on-1 Career Consultation and receive personalised career advice.