How to Crawl a Website using Python
Web scraping, often called web crawling or web spidering, is a powerful tool for working with data on the web.
With a web scraper, we can mine data about a set of products, get a large corpus of text or quantitative data, get data from a site without an official API, or just satisfy our own personal demands.
In this article, we will create a basic scrapper. Scrapy is a Python framework for web scraping that provides a complete package for developers without worrying about maintaining code. Beautiful Soup is also widely used for web scraping. It is a Python package for parsing HTML and XML documents and extract data from them. It is available for Python 2.6+ and Python 3. We will be using Scrapy in this tutorial.
You can install Scrapy using the command:
pip install scrapy
Scrapy also provides a web-crawling shell called as Scrapy Shell, that developers can use to test their assumptions on a site’s behavior.
Open your command line and write the following command:
We have to run a crawler on the web page using the
fetch command in the Scrapy shell. A crawler or spider goes through a webpage downloading its text and metadata.
The crawler returns a response which can be viewed by using the
view(response) command on shell:
And the web page will be opened in the default browser.
We can view the raw HTML script by using the following command in Scrapy shell: