Python and Scrapy: Web Crawling at Scale

Table of Contents

  1. Introduction
  2. Prerequisites
  3. Setup
  4. Getting Started with Scrapy
  5. Building a Basic Spider
  6. Extracting Data
  7. Crawling Multiple Pages
  8. Storing Data
  9. Handling Dynamic Content
  10. Conclusion

Introduction

In this tutorial, you will learn how to use Python and Scrapy to build a web crawler that can scrape data from websites at scale. Web crawling is the process of automated data extraction from websites, and Scrapy is a powerful Python framework that makes it easy to build and run web crawlers.

By the end of this tutorial, you will have a good understanding of how to use Scrapy to crawl websites, extract data, handle dynamic content, and store the scraped data for further analysis.

Prerequisites

To follow along with this tutorial, you should have a basic understanding of Python programming language. Familiarity with web development concepts such as HTML and CSS will be helpful but not required.

Setup

Before we begin, make sure you have Scrapy installed. You can install it using pip: shell pip install scrapy Scrapy also requires a few additional dependencies, such as Twisted and lxml, which will be installed automatically with Scrapy.

Getting Started with Scrapy

To create a new Scrapy project, you can use the following command: shell scrapy startproject myproject This will create a new directory called “myproject” with the basic structure of a Scrapy project.

Building a Basic Spider

A spider is the main component of a Scrapy project responsible for crawling and scraping webpages. Let’s create our first spider.

Inside the “myproject” directory, create a new Python file called “quotes_spider.py”. Open the file in a text editor and add the following code: ```python import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {filename}')
``` In this code, we define a spider class called "QuotesSpider" that inherits from Scrapy's base Spider class. We set the `name` attribute to "quotes" to uniquely identify this spider.

The start_requests method is a generator function that yields scrapy.Request objects for each URL we want to crawl. In this case, we provide a list of URLs to scrape from the “Quotes to Scrape” website. For each URL, we yield a request object and specify the parse method as the callback function to handle the response.

The parse method is called with the response returned by each request. Here, we extract the page number from the URL and create a filename based on it. We then save the response body to a file with the generated filename.

Save the file and navigate to the “myproject” directory in the terminal. Run the spider using the following command: shell scrapy crawl quotes Scrapy will start crawling the provided URLs, and you will see log statements indicating the progress. After the spider finishes, you will find HTML files named “quotes-1.html” and “quotes-2.html” in the “myproject” directory.

Congratulations! You have built your first spider using Scrapy.

Extracting Data

Now, let’s modify the spider to extract specific data from the webpages instead of saving the whole HTML response. We will extract the quotes and the authors from each page.

Update the parse method in “quotes_spider.py” file as follows: python def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').get(), 'author': quote.css('span small::text').get(), } In this code, we use CSS selectors to extract the desired data from the HTML response. We iterate over each quote element and extract the text and author using the css method.

Save the file and run the spider again using the previous command.

You will notice that instead of saving HTML files, Scrapy will now print the extracted data to the console. Each quote is represented as a dictionary object with the keys “text” and “author”.

Crawling Multiple Pages

Scrapy makes it easy to crawl multiple pages within the same spider. Let’s modify our spider to scrape all the pages from the “Quotes to Scrape” website.

Update the start_requests method in “quotes_spider.py” as follows: python def start_requests(self): base_url = 'http://quotes.toscrape.com/page/{}' num_pages = 5 # Number of pages to scrape urls = [base_url.format(page) for page in range(1, num_pages + 1)] for url in urls: yield scrapy.Request(url=url, callback=self.parse) We define a base_url that includes a placeholder for the page number. We then generate all the URLs by replacing the placeholder with the numbers from 1 to the desired number of pages.

Save the file and run the spider again.

Scrapy will now scrape all the pages specified by num_pages and extract the quotes and authors from each page.

Storing Data

Instead of printing the scraped data to the console, let’s modify our spider to store it in a JSON file.

Update the parse method in “quotes_spider.py” as follows: ```python import json

def parse(self, response):
    data = []
    for quote in response.css('div.quote'):
        data.append({
            'text': quote.css('span.text::text').get(),
            'author': quote.css('span small::text').get(),
        })
    
    filename = 'quotes.json'
    with open(filename, 'a') as f:
        json.dump(data, f, indent=4)
    
    self.log(f'Saved data to {filename}')
``` We initialize an empty list `data` and for each quote, we append a dictionary object with the "text" and "author" keys to the list.

After processing all the quotes, we open a JSON file in “append” mode and use the json.dump method to write the data list in JSON format. We set the indent parameter to 4 to make the output more readable.

Save the file and run the spider again.

You will find a file named “quotes.json” in the “myproject” directory containing the extracted data in JSON format.

Handling Dynamic Content

Many websites use JavaScript to load content dynamically. Scrapy can handle such scenarios by using its built-in support for handling JavaScript.

To handle dynamic content, Scrapy provides a middleware called “SplashMiddleware” that integrates with a headless browser called Splash.

To use Splash, you need to have Docker installed on your system. Please refer to the official Docker documentation for installation instructions.

Once you have Docker installed, you can start a Splash container by running the following command: shell docker run -p 8050:8050 scrapinghub/splash This command will start the Splash service on port 8050.

Next, install the required Python package to use Splash with Scrapy: shell pip install scrapy-splash To enable Splash in Scrapy, add the following settings to the “settings.py” file in your Scrapy project: python SPLASH_URL = 'http://localhost:8050' DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } In your spider, you can now make requests to dynamic websites using the SplashRequest class instead of Scrapy’s standard Request class.

To demonstrate dynamic content handling, let’s modify our spider to scrape quotes from the JS-based website “http://quotes.toscrape.com/js/”. We will use Splash to render the JavaScript and extract the quotes.

Update the start_requests method in “quotes_spider.py” as follows: ```python from scrapy_splash import SplashRequest

def start_requests(self):
    urls = [
        'http://quotes.toscrape.com/js/page/1/',
        'http://quotes.toscrape.com/js/page/2/',
    ]
    for url in urls:
        yield SplashRequest(url=url, callback=self.parse, endpoint='render.html')
``` In this code, we import the `SplashRequest` class from `scrapy_splash` module and use it instead of `scrapy.Request` to make the requests. We also set the `endpoint` parameter to `'render.html'` to tell Scrapy to use Splash for rendering the JavaScript.

Save the file and run the spider again.

Scrapy will now use Splash to render the pages and extract the quotes from the rendered HTML.

Conclusion

In this tutorial, you learned how to use Python and Scrapy to build a web crawler capable of scraping data from websites at scale. You started by setting up Scrapy and building a basic spider. Then you learned how to extract data from webpages using CSS selectors and crawl multiple pages. You also explored storing the scraped data in a JSON file. Finally, you learned how to handle dynamic content using Splash.

Scrapy provides a powerful and flexible platform for web scraping. With its extensive features and community support, you can easily scale your web crawling projects to handle large-scale data extraction.

It’s important to note that web scraping should always be done ethically and in accordance with the terms of service of the websites you are scraping. Respect the website’s resources and be mindful of the impact your crawling may have on the site’s performance.

Happy crawling!