Table of Contents
- Introduction
- Prerequisites
- Installation
- Creating a Scrapy Project
- Understanding Scrapy Architecture
- Building Spiders
- Extracting Data
- Storing Data
- Handling Pagination
- Handling Ajax and JavaScript
- Running and Deploying the Spider
- Conclusion
Introduction
Web crawling is a technique used to automate the extraction of data from websites. Scrapy is a powerful and flexible web crawling framework written in Python. In this tutorial, you will learn advanced techniques and best practices for web crawling using Python and Scrapy.
By the end of this tutorial, you will be able to build sophisticated web crawlers to scrape data from websites, handle pagination, JavaScript, and Ajax requests, and store the extracted data in a structured format.
Prerequisites
To follow along with this tutorial, you should have a basic understanding of Python programming and web scraping concepts. Familiarity with HTML, CSS, and XPath will also be beneficial.
Installation
Before we begin, make sure you have Python and pip installed on your system. You can check if Python is installed by running the following command in the terminal:
bash
python --version
To install Scrapy, run the following command:
bash
pip install scrapy
Creating a Scrapy Project
To start working with Scrapy, we need to create a new Scrapy project. Open a terminal and navigate to the directory where you want to create the project. Then, run the following command:
bash
scrapy startproject tutorial
This will create a new directory named “tutorial” with the basic structure of a Scrapy project. Change into the project directory:
bash
cd tutorial
Understanding Scrapy Architecture
Scrapy follows a specific architecture that makes it efficient and scalable. Here are the key components of Scrapy:
- Spiders: Spiders are the core of Scrapy. They define how to navigate websites and extract data. Each spider is a Python class that subclasses
scrapy.Spider
. - Requests and Responses: Scrapy uses asynchronous requests and responses to handle website navigation. When a spider makes a request, it receives a response from the website, which can be processed to extract data.
- Items: Items are the objects that hold the extracted data. You can think of them as simple containers.
- Item Pipelines: Item pipelines are used for processing the extracted data, such as validating, cleaning, and storing it.
- Middlewares: Middlewares are responsible for processing requests and responses in a customizable way. They can modify, redirect, or even drop requests.
- Settings: Settings allow you to customize the behavior of Scrapy and your spiders.
- Scrapy Shell: Scrapy provides a command-line tool called Scrapy Shell, which is useful for testing and debugging spiders.
Building Spiders
Now that we understand the basic architecture of Scrapy, let’s start building our first spider. In Scrapy, a spider is a class that defines how to follow links and extract data from a website.
To create a new spider, open the spiders
directory in your project and create a new Python file called quotes_spider.py
. Add the following code to define a simple spider:
```python
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/'
]
def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
yield {
'text': quote.css('.text::text').get(),
'author': quote.css('.author::text').get(),
'tags': quote.css('.tag::text').getall()
}
next_page = response.css('.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
``` In this spider, we define a `name` attribute which identifies the spider, a `start_urls` attribute which contains the URLs to start crawling from, and a `parse` method which is the entry point for processing the responses.
In the parse
method, we use CSS selectors to extract the desired data from the response. We extract the text, author, and tags of each quote and yield a dictionary containing this information.
We also check for a “Next” button on the page and follow the link if it exists. This allows us to crawl through multiple pages of quotes.
Extracting Data
Scrapy provides powerful selectors to extract data from HTML or XML documents. In our previous example, we used CSS selectors to extract the quote text, author, and tags.
Scrapy supports other types of selectors as well, such as XPath and regular expressions. You can use whichever selector is most convenient for your specific use case.
Here’s an example of using XPath selectors to extract the same data as before:
python
def parse(self, response):
quotes = response.xpath('//div[@class="quote"]')
for quote in quotes:
yield {
'text': quote.xpath('.//span[@class="text"]/text()').get(),
'author': quote.xpath('.//span[@class="author"]/text()').get(),
'tags': quote.xpath('.//a[@class="tag"]/text()').getall()
}
Storing Data
Scrapy provides built-in support for storing the extracted data in various formats, such as JSON, CSV, or databases. To store data, we need to define an item class and pipelines.
First, create a new file called items.py
in your project directory and define an item class:
```python
import scrapy
class QuoteItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
``` Next, open the `pipelines.py` file and add the following code:
```python
import json
class JsonPipeline:
def open_spider(self, spider):
self.file = open('quotes.json', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
``` This pipeline will store the extracted items in a JSON file called `quotes.json`. To enable this pipeline, open the `settings.py` file and uncomment the following line:
```python
ITEM_PIPELINES = {
'tutorial.pipelines.JsonPipeline': 300,
}
``` Now, when you run your spider, the extracted data will be stored in the specified JSON file.
Handling Pagination
Web pages often have multiple pages of data that we need to crawl. Scrapy provides a convenient way to handle pagination using the scrapy.Request
object.
To demonstrate this, let’s modify our spider to crawl through multiple pages of quotes. Update the parse
method as follows:
```python
def parse(self, response):
quotes = response.css(‘.quote’)
for quote in quotes:
yield {
‘text’: quote.css(‘.text::text’).get(),
‘author’: quote.css(‘.author::text’).get(),
‘tags’: quote.css(‘.tag::text’).getall()
}
next_page = response.css('.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
``` In this updated code, we use `response.follow` to follow the link to the next page. This allows us to crawl through all the pages and extract data from each page.
Handling Ajax and JavaScript
Sometimes, the data we want to extract is loaded dynamically using Ajax or JavaScript. Scrapy alone cannot handle such cases. However, Scrapy can work well with other tools like Selenium or Splash to handle JavaScript rendering.
To use Selenium with Scrapy, you need to install the Selenium library and a web driver, such as ChromeDriver or GeckoDriver. You can refer to the official Scrapy documentation for detailed instructions on how to combine Scrapy with Selenium.
Running and Deploying the Spider
To run your spider, use the following command:
bash
scrapy crawl quotes
Replace “quotes” with the name of your spider if you’ve named it differently.
To deploy your spider to a cloud-based scraping service, such as Scrapy Cloud or Scrapinghub, refer to the respective documentation.
Conclusion
In this tutorial, you learned how to use Scrapy, a powerful web crawling framework written in Python. You learned about Scrapy’s architecture, building spiders to extract data, storing the extracted data, handling pagination, and working with Ajax and JavaScript.
Scrapy provides a robust set of features and tools to handle various web scraping scenarios. With practice and experimentation, you can become an expert in web crawling using Python and Scrapy.
Happy crawling!
I hope you find this tutorial helpful! Let me know if you have any questions or need further assistance.