Introduction
Scraping web pages is a technique to extract data from websites. Instead of manually copying and pasting the data, we can extract the data with web scrapers. Web scraping is useful for various purposes like data analytics, research and marketing etc.
In Python, there are several libraries we can use for web scraping such as Beautiful Soup, Scrapy and Selenium etc. A web scraper parses the HTML code and extracts the relevant data with the help of tags. The data can be saved depending on the requirement with the help of other libraries.
Before scraping a website, you should check the robots.txt file and terms of service to ensure you have permission. One should be careful not to overload the website which may crash it.
In this article, we explore web scraping with Beautiful Soup Library. Beautiful Soup is a Python library for pulling data out of HTML and XML files.
HTML elements
Before delving into web scraping, let's understand HTML first. Hypertext Markup language or HTML is the most basic building block of the web. It defines the structure of the web content. Hypertext refers to links that connect web pages, either within a single website or between websites. HTML uses markup to annotate text, images, and other content for display in a Web browser. An HTML element separates text with the use of tags, which consist of the element name surrounded by "<
" and ">
".
Some of the HTML tags presented below
Tag Name | Description |
<html> | It is the root element of the HTML document. All other elements must be embedded in this element. |
<head> | It contains metadata about the document. |
<title> | It defines the title of the document |
<body> | It represents the content of the document. |
<div> | It's like a container or a section in the document. |
<p> | It represents a paragraph. |
<a> | It creates a hyperlink to web pages, emails or locations on the same page etc. It consists of a href attribute, which defines the URL that the hyperlink points to. |
<img> | It embeds an image into the document. It consists of the src attribute which contains the path to the image you want to embed. |
<nav> | It represents a section of a page whose purpose is to provide navigation links. |
<ol> | It represents an ordered list of items. |
<ul> | It represents an unordered list of items. |
<li> | It represents the list item in an ordered or unordered list. |
<table> | It represents tabular data. |
<thead> | It represents a set of rows defining the head of the columns of the table. |
<tbody> | It represents the body content of a table with a set of table rows. |
<th> | It defines a cell as the header of a group of table cells. |
<tr> | It defines a row of cells in a table. |
<td> | It defines a cell of a table that contains data. |
Link for HTML reference: https://developer.mozilla.org/en-US/docs/Web/HTML/Element
Working with Beautiful Soup
We can install Beautiful Soup as the following,
pip install beautifulsoup4
or pip install bs4
pip or pip3 can be used based on the requirement.
Creating a Soup Object
Using the requests library to make a connection to the web page and getting the content of the web page. We can use urllib module also. Then the content should be passed to the BeautifulSoup constructor to parse the page.
>>> # Importing necessary libraries
>>> from bs4 import BeautifulSoup
>>> import requests
>>> import pandas as pd
>>> # Creating soup object
>>> url = "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"
>>> resp = requests.get(url)
>>> soup = BeautifulSoup(resp.content, 'html.parser')
>>> # printing soup with pretty printing prettify()
>>> print(soup.prettify())
<!DOCTYPE html>
<html class="client-nojs........" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>
population
Country wise
</title>
....
..
</script>
</body>
</html>
>>> # Get only text with get_text()
>>> soup.get_text()
'\n\n\n\npopulation Country wise\n\n\n\n\n\n\n\n\n\n\ ..
.......n\n\n\n\n\nToggle limited content width\n\n\n\n\nn\n'
Objects in Soup
BeautifulSoup transforms an HTML document into a tree of objects. Mostly we deal with the following objects,
Object Name | Description |
BeautifulSoup | It represents the parsed document as a whole. It doesn't correspond to an actual HTML or XML tag, so it has no attributes. |
NavigableString | A string corresponds to a bit of text within a tag. It's like a Unicode string. It can't be edited but can be converted to Unicode. |
Tag | A Tag object corresponds to XML or HTML tag in the original document. Tags have attributes and methods. |
Navigating the Document
As the HTML document is structured as a tree, we can go up and down to inspect the content.
>>> # Navigating using Tags
>>> # Using head tag
>>> soup.head
<head>
<meta charset="utf-8"/>
<title>List of countries by population (United Nations) - Wikipedia</title>
.....
.....
.....
<link href="//login.wikimedia.org" rel="dns-prefetch"/>
</link></head>
>>> # Using title tag
>>> soup.title
<title>List of countries by population (United Nations) - Wikipedia</title>
>>> # using title tag with head tag
>>> soup.head.title
<title>List of countries by population (United Nations) - Wikipedia</title>
>>> # Using .string to get the value, if it is of NavigableString
>>> soup.head.title.string
'List of countries by population (United Nations) - Wikipedia'
>>> # Tag has name and attributes
>>> # Tag name
>>> soup.body.name
'body'
>>> # Tag attributes
>>> soup.body.attrs
{'class': ['skin-vector',
'skin-vector-search-vue',
'mediawiki',
'ltr',
'sitedir-ltr',
'mw-hide-empty-elt',
'ns-0',
'ns-subject',
'mw-editable',
'page-List_of_countries_by_population_United_Nations',
'rootpage-List_of_countries_by_population_United_Nations',
'skin-vector-2022',
'action-view']}
>>> # Tag attribute value
>>> soup.body['class']
['skin-vector',
'skin-vector-search-vue',
'mediawiki',
'ltr',
'sitedir-ltr',
'mw-hide-empty-elt',
'ns-0',
'ns-subject',
'mw-editable',
'page-List_of_countries_by_population_United_Nations',
'rootpage-List_of_countries_by_population_United_Nations',
'skin-vector-2022',
'action-view']
Tags can be nested also to get the desired output.
>>> # Get the a tag, It will give you the first tag by that name.
>>> soup.a
<a class="mw-jump-link" href="#bodyContent">Jump to content</a>
>>> # A tags children can be available with .contents,
>>> # it returns a list
>>> soup.head.contents
['\n',
<meta charset="utf-8"/>,
'\n',....]
>>> type(soup.head.contents)
list
>>> # We can use .children generator for iterating over children
>>> # instead of getting a list with .contents
>>> for child in soup.head.children:
print(child)
<meta charset="utf-8"/>
<title>List of countries by population (United Nations) - Wikipedia</title>
...
...
<link href="//login.wikimedia.org" rel="dns-prefetch"/>
</link>
>>> # Descendants, .children and .contents shows only direct children,
>>> # .descendants iterate over all the tags children recursively.
>>> for child in soup.head.descendants:
print(child)
<meta charset="utf-8"/>
<title>List of countries by population (United Nations) - Wikipedia</title>
...
...
<link href="//login.wikimedia.org" rel="dns-prefetch"/>
</link>
<link href="//login.wikimedia.org" rel="dns-prefetch"/>
We can .parent, .parents can be used to go up the tree. We can use .next_sibling, .previous_sibling, .next_element, .previous_element etc. to navigate in sideways and back & forth.
Searching the document
We can use find() and find_all() to search for a particular tag.
>>> # To find a particular tag, here 'a'. It returns a single result
>>> soup.find('a')
<a class="mw-jump-link" href="#bodyContent">Jump to content</a>
>>> # To find all tags use find_all, it will return a list
>>> soup.find_all('a')
[<a class="mw-jump-link" href="#bodyContent">Jump to content</a>,
<a accesskey="z" href="/wiki/Main_Page" title="Visit the main page [z]"><span>Main page</span></a>,
<a href="/wiki/Wikipedia:Contents" title="Guides to browsing Wikipedia"><span>Contents</span></a>,
...
...
"List of countries by population in 1500">1500</a>,
...]
We can give various inputs to find_all() such as string, list and custom functions. We can specify the tag attribute values also to search for particular content.
>>> # Searching with id name
>>> soup.find_all(id='toc-Subregional')
[<li class="vector-toc-list-item vector-toc-level-2" id="toc-Subregional">
<a class="vector-toc-link" href="#Subregional">
<div class="vector-toc-text">
<span class="vector-toc-numb">1.4</span>Subregional</div>
</a>
<ul class="vector-toc-list" id="toc-Subregional-sublist">
</ul>
</li>]
>>> # Searching with a class name, we should use class_
>>> # to avoid error as class is python keyword
>>> soup.find_all(class_='vector-dropdown-label-text')
[<span class="vector-dropdown-label-text">Main menu</span>,
<span class="vector-dropdown-label-text">Personal tools</span>,
<span class="vector-dropdown-label-text">Toggle the table of contents</span>,
<span class="vector-dropdown-label-text">5 languages</span>,
<span class="vector-dropdown-label-text">English</span>,
<span class="vector-dropdown-label-text">Tools</span>]
There are methods like find_parent(), find_next_sibling(), find_previous_sibing(), find_next(), find_all_next() etc. can be used as per the requirement.
Modifying the Document
Apart from navigating and searching, we can modify the document data with beautiful soup.
>>> # Get the title
>>> soup.title.string
'List of countries by population (United Nations) - Wikipedia'
>>> # Changing the title
>>> soup.title.string = 'Population'
>>> # Check once agin
>>> soup.title
<title>population</title>
>>> # Adding to tag's contents with append()
>>> soup.title.append(' Country wise')
>>> soup.title
<title>population Country wise</title>
There are other methods like extend(), insert(), clear(), extract(), decompose(), replace_with(), wrap, unwrap() and smooth etc to modify the document.
HTML Parsers
We can use different parsers for parsing the HTML pages.
Parser Name | Description |
html.parser | Standard library parser. Slow compared to other parsers. |
lxml | Need to be installed (pip install lxml). Very fast. Supports both HTML and XML. C dependency. |
html5lib | Need to be installed (pip install html5lib). Very slow. Creates valid html5. |
Extracting Table data
The following sample code shows a way to extract population table data from an HTML page. The web page used
>>> # In this web page 2 tables are present
>>> # Using class for finding table
>>> from bs4 import BeautifulSoup
>>> import requests
>>> import pandas as pd
>>> url = "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"
>>> resp = requests.get(url)
>>> soup = BeautifulSoup(resp.content, 'html.parser')
>>> table = soup.find_all('table', class_='wikitable')
>>> # Following method extracts rows, iterate over them and split the
>>> # data into list of lists. Then that data will be converted to dataframe
>>> rows = table[0].find_all('tr')
>>> data = []
>>> # Skipping header row
>>> for row in rows[1:]:
data.append(row.get_text().strip().split('\n'))
>>> df1 = pd.DataFrame(data)
>>> # Follwoing method uses Pandas read_html
>>> # which returns list of df objects
>>> df2 = pd.read_html(str(table))[0]
I find Pandas read_html() method seems a good choice, it has the capability of parsing HTML pages for extracting tables. We can use Pandas read_html() also, but it extracts all the tables.
It is to be remembered that, before navigating or extracting the data, please go through the structure of the HTML page like identifying the required tags, id or class names, which will ease the process.
Link for Beautiful Soup documentation: https://beautiful-soup-4.readthedocs.io/en/latest/