Web Scraping is used for extracting data from Websites. Web Scrapping techniques directly use http protocols and DOM parsing techniques.
Advantages of Web Scraping
- To fetch products from Shopping Sites
- fetch and coping blog
- fetching news
- Collect offers from different shopping cart
- Automated new/latest downloads from other web sites
- Many more…
Now moving to implementations, below are basic steps for scrapping using any programming language.
- Send HTTP request to web site
- Get the content as html/xml/json
- find element/parse dom which contains your data
- Get content from that element and use further as your requirement.
Make sure python3 is already installed in your system. Install below libraries using following command.
pip install beautifulsoup4 pip install requests
Requests : Requests is the library in python for http requests/response.
Beautiful Soup : Beautiful Soup is a XML or HTML parser for iterating, searching, and modifying the parse tree.
Lets take simple example of Github Repository https://github.com/amitbhoraniya?tab=repositories.
import requests result = requests.get('https://github.com/amitbhoraniya?tab=repositories')
Make sure you got result,
#for status code result.status_code #for headers result.headers #for content result.content
Store content to temporary variable.
content = result.content
Now parse this content string, and fetch our required data.
from bs4 import BeautifulSoup soup = BeautifulSoup(content,'html.parser')
Find our target repository fields using beautifulsoup find methods, refer this doc.
repositories = soup.find_all(itemprop='name codeRepository') for repo in repositories: print(repo.getText().strip())
Final look of code something like this,
import requests from bs4 import BeautifulSoup result = requests.get('https://github.com/amitbhoraniya?tab=repositories') content = result.content soup = BeautifulSoup(content,'html.parser') repositories = soup.find_all(itemprop='name codeRepository') for repo in repositories: print(repo.getText().strip())
Instead of printing, you can save this data to csv,json or to database.
That’s it, now go and scrap something 😉 Keep visiting 🙂