Tabular data is one of the popular sources of data format on the web, storing massive amounts of helpful information and acting as gold mines for data-related projects. Are you looking for ways to access, parse and extract data from HTML tables? Don't worry! Python has got you covered!
In this blog, we have discussed how to scrape HTML using Python's Request and Beautiful Soup to extract HTML table data. We have split the process into five simple-to-follow steps to make the whole process easy for you. So, let's get started!
HTML Table Structure
We all know that the tables are built using the tags <table>, <th> or <thead>, <tbody>, <tr>, <td>. Though many developers respect these conventions while building a table, some don't follow them, making such projects harder than others. Python comes as a saviour here.
Use Python's Requests and Beautiful Soup To Extract Tabular Data
We take the page- https://datatables.net/examples/styling/stripe.html to practice scraping tabular data with Python.
Let's scrape using the requests library to send the HTTP request and parse the response using Beautiful Soup. Here is step by step guide for all our readers:
1. Sending Main Request
Start by creating a new directory for the project named python-html-table as all the data we are looking at is on HTML file.
You should follow it by creating a new folder, "bs4-table-scraper", and finally, create a "python_table_scraper.py" file.
Next, run the following commands:
- pip3 install requests beautifulsoup4
- import them to your project
In python_table_scraper.py, we need to add:
import requests
from bs4 import BeautifulSoup
Now, to send HTTP requests with Requests, you should do the following:
- Set a URL
- Pass it through requests.get()
- Store returned HTML (in a response variable)
- Print response.status_code (to check if success or failure)
url = 'https://datatables.net/examples/styling/stripe.html'
response = requests.get(url)
print(response.status_code)
Now use the command “python3 python_table_scraper.py” in the terminal and run your code.
It will return a 200 status code. If that is not the case, your IP is rejected by the anti-scraping systems. You can also use a web scraping API to handle all these complexities.
2. Build the Parser Using Beautiful Soup
In step 2, parse the raw HTML into a beautifulsoup object.
soup = BeautifulSoup(response.text, 'html.parser')
It helps you to traverse the parse tree (with HTML tags and their attributes).
As the table is enclosed between <table> tags with the class stripe dataTable, use it to select the table with the code:
table = soup.find('table', class_ = 'stripe')
print(table)
You can also use id = 'example'. It grabs the table, and you can loop through the rows to grab the tabular data.
3. Loop through the HTML Table
For tabular data, every row is represented by a <tr> element pair, and within them, the data is represented by <td> element tag, all between a <tbody> tag pair. To extract the data, you should run the following commands:
for employee_data in table.find_all('tbody'):
rows = employee_data.find_all('tr')
print(rows)
Explanation:
We have stored all the <tr> elements in the rows. Next, we created each row into a single object. Then we looped through the rows and found the desired data.
Once you grab all <td> elements, these become a nodelist. Here, we need to know their position in the index, and the first one, name, is 0.
Write our code:
for row in rows:
name = row.find_all('td')[0].text
print(name)
Explanation:
We are taking each row and finding all the cells inside. Once you have the list, grab only the first one in the index (position 0). Next, finish with the .text method to only grab the element's text. It will help you ignore the HTML data you don't need.
You can follow the same logic for the others:
position = row.find_all('td')[1].text
office = row.find_all('td')[2].text
age = row.find_all('td')[3].text
start_date = row.find_all('td')[4].text
salary = row.find_all('td')[5].text
4. Store Tabular Data in JSON file
Python language has a JSON module for JSON objects. Hence, you only need to import it.
import json
After importing the JSON module, turn all scraped data into a list by creating an empty array outside our loop.
employee_list = [ ]
Now, append the data to it, with each loop appending a new object to the array.
employee_list.append({
'Name': name,
'Position': position,
'Office': office,
'Age': age,
'Start date': start_date,
'salary': salary
})
You can now import a list to JSON.
with open('json_data', 'w') as json_file:
json.dump(employee_list, json_file, indent=2)
Explanation:
With this code, we opened a new file, passing in the name we want for the file (json_data). Here, we have passed 'w' as we want to write data.
Then, we use the .dump() function to dump the data from the array (employee_list) and indent=2 so every object has its line.
5. Run Script
Here is your codebase (complete code):
#dependencies
import requests
from bs4 import BeautifulSoup
import json
url = 'http://api.scraperapi.com?api_key=51e43be283e4db2a5afbxxxxxxxxxxx&url=https://datatables.net/examples/styling/stripe.html'
#empty array
employee_list = []
#requesting and parsing the HTML file
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
#selecting the table
table = soup.find('table', class_ = 'stripe')
#storing all rows into one variable
for employee_data in table.find_all('tbody'):
rows = employee_data.find_all('tr')
# it works to loop through the HTML table, and you can scrape the data
for row in rows:
name = row.find_all('td')[0].text
position = row.find_all('td')[1].text
office = row.find_all('td')[2].text
age = row.find_all('td')[3].text
start_date = row.find_all('td')[4].text
salary = row.find_all('td')[5].text
#sending scraped data to the empty array
employee_list.append({
'Name': name,
'Position': position,
'Office': office,
'Age': age,
'Start date': start_date,
'salary': salary
})
#importing the array to a JSON file
with open('employee_data', 'w') as json_file:
json.dump(employee_list, json_file, indent=2)
Conclusion
This step-by-step guide was for the HTML Table. Here, we have used JSON format to store scraped information. It allows re-purposing the tabular data for new applications or so. You can enhance your coding skills and automate scraping with ease. For dynamically generated tables, you have to follow a different approach: extracting and scraping JavaScript tables with Python.