How to do webscraping using Python BeautifulSoup library
Test Environment
Fedora 32 with Python3 installed
What is Webscraping
Its a technique to extract the details from a website page which can be further saved to any store location and used for data analysis.
What is HTML Parser
It’s a tool used to extract the details from an HTML formatted web page and capture its details to further use for analysis purpose. Here in this article we will be using beautifulsoup python library which is used to pull data from HTML and XML files.
So, let’s start and explore this library by implementing a simple application using this python library.
Procedure
Step1: Find a web page to extract the details
For the purpose of building this application we are going to use a webpage to extract some set of words and store it in a list.
url = ["http://www.manythings.org/vocabulary/lists/l/words.php?f=3esl.01"]
Here i am going to use the above webpage to get some set of words starting with ‘a’ and store it in a list variable.
Please note this list of words starting with a is not a complete list, we can try to extract all the dictionary words using some free open source API if feasible.
Step2: Request the content of the webpage
Now, its time to submit a GET request to the above URL and save the response content of the above webpage to a variable as shown below. For using the requests library methods we need to import the respective python module as shown below. If it’s not present we need to install that module using pip.
import requests
url = ["http://www.manythings.org/vocabulary/lists/l/words.php?f=3esl.01"]
page = requests.get(url)
Step3: Parse the request content using beautifulsoup
As of now we have submitted a GET request and savethe response of the web page in the page url. Let us now use the beautifulsoup library to parse this html content and save it to a variable as below. For using the beautifulsoup library methods we need to import the respective python module as shown below. If its not present we need to install that module using pip.
import requests
from bs4 import BeautifulSoup
url = ["http://www.manythings.org/vocabulary/lists/l/words.php?f=3esl.01"]
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
Step4: Extract the data from a particular html tag in the parsed content
From the parsed content which is stored in the ‘soup’ variable, let us try to find all the list item tags (i.e
) and extract the data stored in those list item start and end tag.
import requests
from bs4 import BeautifulSoup
url = ["http://www.manythings.org/vocabulary/lists/l/words.php?f=3esl.01"]
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
wordsList = []
for i in soup.find_all('li'):
word = str(i.get_text())
wordsList.append(word)
With this step we have complete the required procedure to extract the details from web page and save it a List variable called wordsList. If everything goes fine you should be able to see a list of all the words starting with letterĀ ‘a’ stored in the list ‘wordsList’.
Hope you enjoyed reading this article. Thank you..
Leave a Reply
You must be logged in to post a comment.