Table of Contents
- Introduction
- Prerequisites
- Installation
- Creating a New Scrapy Project
- Writing a Spider
- Scraping a Website
- Data Extraction
- Pagination
- Handling JavaScript-Rendered Pages
- Data Storage
- Conclusion
Introduction
Web scraping is the process of extracting data from websites using code. It allows us to automate the collection of information, saving us time and effort. In this tutorial, we will explore scrapy
, a powerful and flexible web scraping framework written in Python. By the end of this tutorial, you will learn how to create a web scraper using scrapy
and extract data from websites.
Prerequisites
Before diving into scrapy
, make sure you have the following:
- Python installed on your machine
- Basic knowledge of Python programming language
- Familiarity with HTML and CSS
Installation
To install scrapy
, open your terminal and run the following command:
shell
pip install scrapy
Once the installation is complete, verify it by running:
shell
scrapy --version
You should see the version information displayed, indicating that scrapy
has been successfully installed.
Creating a New Scrapy Project
To start a new scrapy
project, run the following command in your terminal:
shell
scrapy startproject myproject
This will create a new directory called myproject
with the basic structure of a scrapy
project.
Writing a Spider
A spider
is the component responsible for defining how to crawl a website and how to extract data from it. To create a new spider, navigate to the project directory and run the following command:
shell
scrapy genspider myspider example.com
Replace example.com
with the URL of the website you want to scrape. This will generate a new spider file called myspider.py
inside the spiders
directory.
Open myspider.py
in your preferred text editor and you will see the generated code. The parse
method is where you define the logic for extracting data from the website.
Scraping a Website
To scrape a website, we need to define the URLs to be crawled in our spider. Update the start_urls
list in the spider file to include the URLs you want to scrape. For example:
python
start_urls = ['http://example.com/page1', 'http://example.com/page2']
Next, define the data fields you want to extract. Inside the parse
method, use response.css
or response.xpath
to select the data elements based on their CSS selectors or XPath expressions.
For example, to extract the titles of all articles on a webpage, you can use the following code:
python
def parse(self, response):
for article in response.css('article'):
title = article.css('h2::text').get()
yield {
'title': title
}
The yield
keyword is used to generate a dictionary containing the extracted data.
Data Extraction
There are various methods in scrapy
to extract data from the website. Some commonly used methods include:
response.css('selector')
: Select elements based on CSS selectorsresponse.xpath('expression')
: Select elements based on XPath expressionsresponse.css('selector::attr(attribute)')
: Extract attribute values of selected elementsresponse.xpath('expression').getall()
: Extract a list of values matching the XPath expression
Experiment with different methods to extract the desired data from the website you are scraping.
Pagination
To crawl multiple pages of a website, you can use pagination. Start by extracting the URL of the next page from the current page. Then, create a new request to crawl the next page using the extracted URL.
For example, if the next page URL is located in a link with the class next
, you can use the following code:
python
def parse(self, response):
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
This will create a new request to crawl the next page and call the parse
method again.
Handling JavaScript-Rendered Pages
Some websites use JavaScript to render their content. In such cases, scrapy
alone may not be able to extract the desired data. To handle JavaScript-rendered pages, we can use a headless browser like Selenium
along with scrapy
.
First, install Selenium
using the following command:
shell
pip install selenium
Next, install a web driver compatible with your browser. For example, if you are using Chrome, you can install chromedriver
by running:
shell
brew cask install chromedriver
Finally, import Selenium
in your spider file and use it to retrieve the content of JavaScript-rendered pages. Here’s an example of retrieving the content of a page using Selenium
:
```python
from selenium import webdriver
def parse(self, response):
driver = webdriver.Chrome()
driver.get(response.url)
content = driver.page_source
driver.quit()
# continue with data extraction
``` ## Data Storage
Once we have successfully scraped the data, we can store it in various ways. scrapy
provides built-in support for storing data in formats like CSV, JSON, or XML.
To store the scraped data in a CSV file, add the following code to your spider: ```python import csv
def parse(self, response):
# extract data
with open('data.csv', 'a') as f:
writer = csv.DictWriter(f, fieldnames=['title'])
writer.writerow({'title': title})
``` This will create a CSV file called `data.csv` and append the extracted data to it.
Conclusion
In this tutorial, you have learned how to use scrapy
to scrape websites and extract data. We covered the process of creating a new scrapy
project, writing a spider, scraping a website, handling pagination, dealing with JavaScript-rendered pages, and storing the scraped data. You are now ready to put your web scraping skills to practical use and build powerful data collection tools using scrapy
. Happy scraping!