Create Python Scrapy Project Example

When using Scrapy to develop a crawler, you usually need to create a Scrapy project. You can create a Scrapy project with the command scrapy startproject TestScrapyProject like below.

1. Create Scrapy Project.

$ scrapy startproject TestScrapyProject
New Scrapy project 'TestScrapyProject', using template directory '/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/templates/project', created in:
    /Users/songzhao/Documents/WorkSpace/dev2qa.com-example-code/PythonExampleProject/com/dev2qa/example/crawler/TestScrapyProject

You can start your first spider with:
    cd TestScrapyProject
    scrapy genspider example example.com

In the above commands, scrapy is the primary command provided by Scrapy, startproject is the subcommand of Scrapy, which is specially used for building projects, and TestScrapyProject is the name of the project to be created.

The above information shows that Scrapy has successfully created the TestScrapyProject project in the current directory. At this time, you can see a TestScrapyProject directory in the current directory, which saves all the files of the project.

$ ls -l
total 0
drwxr-xr-x  4 songzhao  staff  128 Feb  3 17:40 TestScrapyProject

If you go into the Scrapy project folder, you can see the following file structure.

TestScrapyProject
|
|------scrapy.cfg
|
|------TestScrapyProject
             |
             |------items.py
             |------middlewares.py
             |------pipelines.py
             |------settings.py
             |------spiders
                      |------__init__.py

2. Scrapy Project Generated Files Introduction.

  1. scrapy.cfg: The project-level configuration file, which usually does not need to be modified.
  2. TestScrapyProject: The project python module folder, save all the Scrapy project python source files.
  3. TestScrapyProject/items.py: Define the item class used by the project. The item class is a DTO (data transfer object), which usually defines several properties. This class needs to be defined by the developer.
  4. TestScrapyProject/pipelines.py: The pipeline file of the project, which handles the crawled data, needs to be written by the developer.
  5. TestScrapyProject/settings.py: Project configuration file. We will configure the project spider related settings in this file.
  6. TestScrapyProject/spider: This folder store the spiders needed by the project. Spiders are responsible for extracting the data which this project interests.

3. Scrapy Core Components Introduction.

  1. Three are 4 core components in a Scrapy project.
  2. Scheduler: This component is implemented by the Scrapy framework, which calls the middleware to download resources from the Internet.
  3. Downloader: This component is implemented by the Scrapy framework, which is responsible for downloading data from the Internet, and the downloaded data will be automatically handed over to the spider by the Scrapy engine.
  4. Spider: This component is implemented by the developer, the spider is responsible for extracting effective information from downloaded data. The information extracted by the spider will be transferred to Pipeline in the form of Item object by the Scrapy engine.
  5. Pipeline: This component is implemented by the developer. After receiving the Item object (encapsulate the information extracted by the spider), this component can write the information to a file or database server.

4. Scrapy Sub Commands.

  1. startproject: Create a Scrapy project.
  2. fetch: Get response from specified URL.
  3. genspider: Create the spider.
  4. shell: Start interactive shell console.
  5. version: Return the Scrapy version number.
  6. You can get all Scrapy subcommands list by type scrapy only.
    $ scrapy
    Scrapy 2.4.1 - no active project
    
    Usage:
      scrapy <command> [options] [args]
    
    Available commands:
      bench         Run quick benchmark test
      commands      
      fetch         Fetch a URL using the Scrapy downloader
      genspider     Generate new spider using pre-defined templates
      runspider     Run a self-contained spider (without creating a project)
      settings      Get settings values
      shell         Interactive scraping console
      startproject  Create new project
      version       Print Scrapy version
      view          Open URL in browser, as seen by Scrapy
    
      [ more ]      More commands available when run from project directory
    
    Use "scrapy <command> -h" to see more info about a command
    

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.