Table of Contents
- Introduction
- Prerequisites
- Setup
- Getting Started with Scrapy
- Building a Basic Spider
- Extracting Data
- Crawling Multiple Pages
- Storing Data
- Handling Dynamic Content
- Conclusion
Introduction
In this tutorial, you will learn how to use Python and Scrapy to build a web crawler that can scrape data from websites at scale. Web crawling is the process of automated data extraction from websites, and Scrapy is a powerful Python framework that makes it easy to build and run web crawlers.
By the end of this tutorial, you will have a good understanding of how to use Scrapy to crawl websites, extract data, handle dynamic content, and store the scraped data for further analysis.
Prerequisites
To follow along with this tutorial, you should have a basic understanding of Python programming language. Familiarity with web development concepts such as HTML and CSS will be helpful but not required.
Setup
Before we begin, make sure you have Scrapy installed. You can install it using pip:
shell
pip install scrapy
Scrapy also requires a few additional dependencies, such as Twisted and lxml, which will be installed automatically with Scrapy.
Getting Started with Scrapy
To create a new Scrapy project, you can use the following command:
shell
scrapy startproject myproject
This will create a new directory called “myproject” with the basic structure of a Scrapy project.
Building a Basic Spider
A spider is the main component of a Scrapy project responsible for crawling and scraping webpages. Let’s create our first spider.
Inside the “myproject” directory, create a new Python file called “quotes_spider.py”. Open the file in a text editor and add the following code: ```python import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = f'quotes-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {filename}')
``` In this code, we define a spider class called "QuotesSpider" that inherits from Scrapy's base Spider class. We set the `name` attribute to "quotes" to uniquely identify this spider.
The start_requests
method is a generator function that yields scrapy.Request
objects for each URL we want to crawl. In this case, we provide a list of URLs to scrape from the “Quotes to Scrape” website. For each URL, we yield a request object and specify the parse
method as the callback function to handle the response.
The parse
method is called with the response returned by each request. Here, we extract the page number from the URL and create a filename based on it. We then save the response body to a file with the generated filename.
Save the file and navigate to the “myproject” directory in the terminal. Run the spider using the following command:
shell
scrapy crawl quotes
Scrapy will start crawling the provided URLs, and you will see log statements indicating the progress. After the spider finishes, you will find HTML files named “quotes-1.html” and “quotes-2.html” in the “myproject” directory.
Congratulations! You have built your first spider using Scrapy.
Extracting Data
Now, let’s modify the spider to extract specific data from the webpages instead of saving the whole HTML response. We will extract the quotes and the authors from each page.
Update the parse
method in “quotes_spider.py” file as follows:
python
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
}
In this code, we use CSS selectors to extract the desired data from the HTML response. We iterate over each quote element and extract the text and author using the css
method.
Save the file and run the spider again using the previous command.
You will notice that instead of saving HTML files, Scrapy will now print the extracted data to the console. Each quote is represented as a dictionary object with the keys “text” and “author”.
Crawling Multiple Pages
Scrapy makes it easy to crawl multiple pages within the same spider. Let’s modify our spider to scrape all the pages from the “Quotes to Scrape” website.
Update the start_requests
method in “quotes_spider.py” as follows:
python
def start_requests(self):
base_url = 'http://quotes.toscrape.com/page/{}'
num_pages = 5 # Number of pages to scrape
urls = [base_url.format(page) for page in range(1, num_pages + 1)]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
We define a base_url
that includes a placeholder for the page number. We then generate all the URLs by replacing the placeholder with the numbers from 1 to the desired number of pages.
Save the file and run the spider again.
Scrapy will now scrape all the pages specified by num_pages
and extract the quotes and authors from each page.
Storing Data
Instead of printing the scraped data to the console, let’s modify our spider to store it in a JSON file.
Update the parse
method in “quotes_spider.py” as follows:
```python
import json
def parse(self, response):
data = []
for quote in response.css('div.quote'):
data.append({
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
})
filename = 'quotes.json'
with open(filename, 'a') as f:
json.dump(data, f, indent=4)
self.log(f'Saved data to {filename}')
``` We initialize an empty list `data` and for each quote, we append a dictionary object with the "text" and "author" keys to the list.
After processing all the quotes, we open a JSON file in “append” mode and use the json.dump
method to write the data list in JSON format. We set the indent
parameter to 4 to make the output more readable.
Save the file and run the spider again.
You will find a file named “quotes.json” in the “myproject” directory containing the extracted data in JSON format.
Handling Dynamic Content
Many websites use JavaScript to load content dynamically. Scrapy can handle such scenarios by using its built-in support for handling JavaScript.
To handle dynamic content, Scrapy provides a middleware called “SplashMiddleware” that integrates with a headless browser called Splash.
To use Splash, you need to have Docker installed on your system. Please refer to the official Docker documentation for installation instructions.
Once you have Docker installed, you can start a Splash container by running the following command:
shell
docker run -p 8050:8050 scrapinghub/splash
This command will start the Splash service on port 8050.
Next, install the required Python package to use Splash with Scrapy:
shell
pip install scrapy-splash
To enable Splash in Scrapy, add the following settings to the “settings.py” file in your Scrapy project:
python
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
In your spider, you can now make requests to dynamic websites using the SplashRequest class instead of Scrapy’s standard Request class.
To demonstrate dynamic content handling, let’s modify our spider to scrape quotes from the JS-based website “http://quotes.toscrape.com/js/”. We will use Splash to render the JavaScript and extract the quotes.
Update the start_requests
method in “quotes_spider.py” as follows:
```python
from scrapy_splash import SplashRequest
def start_requests(self):
urls = [
'http://quotes.toscrape.com/js/page/1/',
'http://quotes.toscrape.com/js/page/2/',
]
for url in urls:
yield SplashRequest(url=url, callback=self.parse, endpoint='render.html')
``` In this code, we import the `SplashRequest` class from `scrapy_splash` module and use it instead of `scrapy.Request` to make the requests. We also set the `endpoint` parameter to `'render.html'` to tell Scrapy to use Splash for rendering the JavaScript.
Save the file and run the spider again.
Scrapy will now use Splash to render the pages and extract the quotes from the rendered HTML.
Conclusion
In this tutorial, you learned how to use Python and Scrapy to build a web crawler capable of scraping data from websites at scale. You started by setting up Scrapy and building a basic spider. Then you learned how to extract data from webpages using CSS selectors and crawl multiple pages. You also explored storing the scraped data in a JSON file. Finally, you learned how to handle dynamic content using Splash.
Scrapy provides a powerful and flexible platform for web scraping. With its extensive features and community support, you can easily scale your web crawling projects to handle large-scale data extraction.
It’s important to note that web scraping should always be done ethically and in accordance with the terms of service of the websites you are scraping. Respect the website’s resources and be mindful of the impact your crawling may have on the site’s performance.