Creating a Web Crawler with Python and Scrapy

Introduction
Prerequisites
Getting Started
- Installing Scrapy
- Setting Up a New Scrapy Project
Creating a Spider
Storing the Scraped Data
- Exporting to CSV
- Exporting to JSON
Conclusion

Introduction

In this tutorial, we will learn how to create a web crawler using Python and Scrapy, a powerful web scraping framework. We will cover the process of setting up a Scrapy project, creating a spider to crawl websites, extracting data, and storing the scraped data in various formats. By the end of this tutorial, you will have a functional web crawler that can scrape data from websites of your choice.

Prerequisites

To follow along with this tutorial, you should have a basic understanding of Python programming language. Familiarity with HTML and CSS will also be helpful, but not mandatory. Additionally, you need to have Python and Scrapy installed on your machine.

Getting Started

Installing Scrapy

Before we begin, make sure you have Python installed on your machine. You can download the latest version of Python from the official website (https://www.python.org/downloads/). Once Python is installed, open your terminal/command prompt and install Scrapy using the following command: pip install scrapy

Setting Up a New Scrapy Project

To create a new Scrapy project, open a terminal/command prompt and navigate to the directory where you want to create the project. Then run the following command: scrapy startproject mycrawler This will create a new directory called “mycrawler” with the basic structure of a Scrapy project.

Creating a Spider

A spider is the core component of a web crawler in Scrapy. It defines how to follow links and extract data from websites. In this section, we will create a spider to crawl a specific website and extract information from it.

Defining the Spider

Open the “spiders” directory inside your Scrapy project and create a new Python file called “example_spider.py”. In this file, we will define our spider.

Start by importing the necessary modules: python import scrapy Next, define a class for your spider: python class ExampleSpider(scrapy.Spider): name = "example" start_urls = [ "https://www.example.com", ] Here, we have defined a spider named “example” and specified the starting URL as “https://www.example.com”. You can replace this URL with the website you want to crawl.

Now, let’s define the logic to extract data from the website. Add the following code to your spider class: ```python def parse(self, response): # Extract data using XPath selectors data = response.xpath(“//h1/text()”).get()

    yield {
        "data": data,
    }
``` In this code, we are using XPath selectors to extract the text inside the `<h1>` tags on the website. You can customize the XPath expression based on the specific elements you want to scrape. The extracted data is then yielded as a dictionary with the key "data".

Running the Spider

To run the spider and see the extracted data, open your terminal/command prompt and navigate to your Scrapy project’s root directory. Then run the following command: scrapy crawl example Replace “example” with the name of your spider if you have chosen a different name. Scrapy will start crawling the website and display the extracted data on the console.

Storing the Scraped Data

Once we have extracted the data, we often need to store it for further analysis or use. Scrapy provides support for exporting the scraped data to various formats including CSV and JSON. In this section, we will explore how to export the data.

Exporting to CSV

To export the scraped data to a CSV file, we can use the built-in CSVFeedExporter pipeline in Scrapy. Open the “settings.py” file in your project directory and uncomment the following lines: python # Enable CSV export # ITEM_PIPELINES = { # 'scrapy.pipelines.files.CSVFeedExporter': 1, # } Next, open your spider file (“example_spider.py”) and import the necessary modules: python from scrapy.exporters import CsvItemExporter Add the following code to your spider class: ```python def init(self, *args, **kwargs): # Open a CSV file for writing self.file = open(“data.csv”, “wb”) self.exporter = CsvItemExporter(self.file) self.exporter.start_exporting()

def close(self, reason):
    # Close the CSV file after writing
    self.exporter.finish_exporting()
    self.file.close()

def parse(self, response):
    # Extract data using XPath selectors
    data = response.xpath("//h1/text()").get()

    # Export the data
    self.exporter.export_item({"data": data})
``` In this code, we create a CSV file named "data.csv" and initialize a CsvItemExporter to write the data into the file. The extracted data is then exported using the exporter.

After making these changes, when you run the spider using the command mentioned in the previous section, the scraped data will be saved in the “data.csv” file.

Exporting to JSON

To export the scraped data to a JSON file, we can use the built-in JSONLinesItemExporter pipeline in Scrapy. Open the “settings.py” file in your project directory and uncomment the following lines: python # Enable JSON export # ITEM_PIPELINES = { # 'scrapy.pipelines.files.JSONLinesItemExporter': 1, # } Next, open your spider file (“example_spider.py”) and import the necessary modules: python from scrapy.exporters import JsonLinesItemExporter Add the following code to your spider class: ```python def init(self, *args, **kwargs): # Open a JSON file for writing self.file = open(“data.json”, “wb”) self.exporter = JsonLinesItemExporter(self.file) self.exporter.start_exporting()

def close(self, reason):
    # Close the JSON file after writing
    self.exporter.finish_exporting()
    self.file.close()

def parse(self, response):
    # Extract data using XPath selectors
    data = response.xpath("//h1/text()").get()

    # Export the data
    self.exporter.export_item({"data": data})
``` In this code, we create a JSON file named "data.json" and initialize a JsonLinesItemExporter to write the data into the file. The extracted data is then exported using the exporter.

After making these changes, when you run the spider using the command mentioned in the previous section, the scraped data will be saved in the “data.json” file.

Conclusion

In this tutorial, we have learned how to create a web crawler using Python and Scrapy. We covered the process of setting up a Scrapy project, creating a spider to crawl websites, extracting data using XPath selectors, and storing the scraped data in CSV and JSON formats. By applying the concepts and techniques discussed in this tutorial, you can dive deeper into web scraping and build more advanced crawlers to gather data from various websites.

Published: 28 February 2023