When using Scrapy to develop a crawler, you usually need to create a Scrapy project. You can create a Scrapy project with the command
scrapy startproject TestScrapyProject like below.
1. Create Scrapy Project.
$ scrapy startproject TestScrapyProject New Scrapy project 'TestScrapyProject', using template directory '/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/templates/project', created in: /Users/songzhao/Documents/WorkSpace/dev2qa.com-example-code/PythonExampleProject/com/dev2qa/example/crawler/TestScrapyProject You can start your first spider with: cd TestScrapyProject scrapy genspider example example.com
In the above commands,
scrapy is the primary command provided by Scrapy,
startproject is the subcommand of Scrapy, which is specially used for building projects, and TestScrapyProject is the name of the project to be created.
The above information shows that Scrapy has successfully created the TestScrapyProject project in the current directory. At this time, you can see a TestScrapyProject directory in the current directory, which saves all the files of the project.
$ ls -l total 0 drwxr-xr-x 4 songzhao staff 128 Feb 3 17:40 TestScrapyProject
If you go into the Scrapy project folder, you can see the following file structure.
TestScrapyProject | |------scrapy.cfg | |------TestScrapyProject | |------items.py |------middlewares.py |------pipelines.py |------settings.py |------spiders |------__init__.py
2. Scrapy Project Generated Files Introduction.
- scrapy.cfg: The project-level configuration file, which usually does not need to be modified.
- TestScrapyProject: The project python module folder, save all the Scrapy project python source files.
- TestScrapyProject/items.py: Define the item class used by the project. The item class is a DTO (data transfer object), which usually defines several properties. This class needs to be defined by the developer.
- TestScrapyProject/pipelines.py: The pipeline file of the project, which handles the crawled data, needs to be written by the developer.
- TestScrapyProject/settings.py: Project configuration file. We will configure the project spider related settings in this file.
- TestScrapyProject/spider: This folder store the spiders needed by the project. Spiders are responsible for extracting the data which this project interests.
3. Scrapy Core Components Introduction.
- Three are 4 core components in a Scrapy project.
- Scheduler: This component is implemented by the Scrapy framework, which calls the middleware to download resources from the Internet.
- Downloader: This component is implemented by the Scrapy framework, which is responsible for downloading data from the Internet, and the downloaded data will be automatically handed over to the spider by the Scrapy engine.
- Spider: This component is implemented by the developer, the spider is responsible for extracting effective information from downloaded data. The information extracted by the spider will be transferred to Pipeline in the form of Item object by the Scrapy engine.
- Pipeline: This component is implemented by the developer. After receiving the Item object (encapsulate the information extracted by the spider), this component can write the information to a file or database server.
4. Scrapy Sub Commands.
- startproject: Create a Scrapy project.
- fetch: Get response from specified URL.
- genspider: Create the spider.
- shell: Start interactive shell console.
- version: Return the Scrapy version number.
- You can get all Scrapy subcommands list by type
$ scrapy Scrapy 2.4.1 - no active project Usage: scrapy <command> [options] [args] Available commands: bench Run quick benchmark test commands fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy [ more ] More commands available when run from project directory Use "scrapy <command> -h" to see more info about a command