Web Scraping in python using Requests and Beautifulsoup

Web Scraping is used for extracting data from Websites. Web Scrapping techniques directly use http protocols and DOM parsing techniques.

Advantages of Web Scraping

  • To fetch products from Shopping Sites
  • fetch and coping blog
  • fetching news
  • Collect offers from different shopping cart
  • Automated new/latest downloads from other web sites
  • Many more…

Now moving to implementations, below are basic steps for scrapping using any programming language.

  1. Send HTTP request to web site
  2. Get the content as html/xml/json
  3. find element/parse dom which contains your data
  4. Get content from that element and use further as your requirement.

Here we will use Request and Beautifulsoup for scrapping.

Create Environment

Make sure python3 is already installed in your system. Install below libraries using following command.

pip install beautifulsoup4
pip install requests

Requests : Requests is the library in python for http requests/response.
Beautiful Soup : Beautiful Soup is a XML or HTML parser for iterating, searching, and modifying the parse tree.

Start Scraping

Lets take simple example of Github Repository https://github.com/amitbhoraniya?tab=repositories.

import requests
result = requests.get('https://github.com/amitbhoraniya?tab=repositories')

Make sure you got result,

#for status code
result.status_code

#for headers 
result.headers

#for content
result.content

Store content to temporary variable.

content = result.content

Now parse this content string, and fetch our required data.

from bs4 import BeautifulSoup
soup = BeautifulSoup(content,'html.parser')

Find our target repository fields using beautifulsoup find methods, refer this doc.

repositories = soup.find_all(itemprop='name codeRepository')
for repo in repositories:
	print(repo.getText().strip())

Final look of code something like this,

import requests
from bs4 import BeautifulSoup

result = requests.get('https://github.com/amitbhoraniya?tab=repositories')
content = result.content

soup = BeautifulSoup(content,'html.parser')
repositories = soup.find_all(itemprop='name codeRepository')

for repo in repositories:
	print(repo.getText().strip())

Instead of printing, you can save this data to csv,json or to database.

That’s it, now go and scrap something 😉 Keep visiting 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *