Simple Python Web Scraping
Let’s build a simple web scraper in Python 3, for this we will require the Package: BeautifulSoup4.
This package will let you easily find elements on the web page.
A scraper is good for getting information. You can make them in many different way and scrape many different sources. Scraper is a powerful tool that helps collect data and information.
We will write a basic scraper, that will print out all links it finds on the page.
First you need to install the package:
pip3 install beautifulsoup4
If you have cloned my Repository:
pip3 install -r requirements.txt
For getting the page we will use urllib to make the page request and fetch the HTML, that we later can search for the information we need.
Remember to put in a user-agent else it will be easy to see that you are using custom code. Many websites monitor for bots and block them fast if they are making many calls for a single website.
When you have fetched the HTML, then we need to load it into BeautifulSoup so it is ready for being parsed.
BeautifulSoup is CSS styling, if you want to find a element like this: <div class=”my-custom-styling”>…</div> then you can do it like this:
soup.select('div.my-custom-styling')
But to find all links on the webpage, you will write a pretty simple line.
We will store all results in list and instead of select, we use find_all. Because it will continue to look for results until all the HTML is looked thought.
list = soup.find_all('a')
Here is the final code for the scraper:
You can with this code now scrap any website and get all the links from it and start your own index. If you need to save the information, then read my post on saving data to MongoDB here: How to save data in MongoDB with Python3
# Import libraries
from bs4 import BeautifulSoup
import urllib.request
# The source code
url = 'https://wiese.xyz/'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) '\
+ 'AppleWebKit/537.36 (KHTML, like Gecko) ' \
+ 'Chrome/35.0.1916.47 Safari/537.36'
request = urllib.request.Request(url, headers={'User-Agent': user_agent})
response = urllib.request.urlopen(request)
html = response.read()
# Parsing the HTML
soup = BeautifulSoup(html, "html.parser")
# Find all A tags
list = soup.find_all("a")
# Looping though all the results
for x in list:
print(x["href"])
You can also clone the code there from Github: