How To Use Python Scrapy To Implement Python Web Crawler

This article will tell you how to use the python Scrapy framework to create a python web crawler with examples.

1.Generate Python Scrapy Project

Python Scrapy provides a command-line tool to generate a Scrapy project. When executing this tool, it will create some default files in the generated project folder, and users need to edit these files to implement a python web crawler. You can follow below steps.

Open the terminal and execute the command scrapy startproject tutorial. Then it will generate a project which has below file structure.

tutorial/

   scrapy.cfg #this file is the scrapy project's configuration file.

   tutorial/

       __init__.py

       items.py

       pipelines.py

       settings.py

       spiders/ # Spiders written by the user are placed under the spiders directory.

           __init__.py

           ...

Below is an example of a user-created spider class. Below file should be saved in tutorial / tutorial / spiders / folder.

# First import the BaseSpider class.
from scrapy.spiderimport BaseSpider

# Create a custom spider class.
class TestSpider(BaseSpider):
    # The name attribute is very important. Different spiders cannot use the same name
    name= "test_spider"
    allowed_domains= ["test.org"]

    # start_urls is the starting point for spider to grab web pages, which can include multiple URLs.
    start_urls= [
        "http://www.test.org/Computers/Books/",
        "http://www.test.org/Computers/Resources/"
    ]

    # The parse method is the default callback function that a spider calls when it catches a Web page.
    # You should avoid using this name to define other methods in this class.
    def parse(self, response):
        # Below code just save the catched page to a local file. You can write source code to parse the catched web page in this method.
        filename= response.url.split("/")[-2]
        open(filename,'wb').write(response.body)

When the spider gets the web page content of the URL link, it will call the parse() method and pass a response parameter to it. The response object contains the content of the captured web page. In the parse method, you can parse the data from the captured web page. The above code simply saves the content of the web page to a file.

2. Run The Python Scrapy Project To Execute The Python Web Crawler.

Open the command line, go to the generated project root folder tutorial/, and execute the command scrapy crawl test_spider, test_spider is the name of the spider.

3. Parse Web Page Content.

Scrapy provides a convenient way to parse data from web pages, which requires the use of HtmlXPathSelector module, the HtmlXPathSelector module uses XPath to parse data.

# Import BaseSpider and HtmlXPathSelector module.
from scrapy.spiderimport BaseSpider
from scrapy.selectorimport HtmlXPathSelector

# The custom spider class extends BaseSpider class.
class TestSpider(BaseSpider):
    # The custom spider name.
    name= "test_spider"

    # The custom spider domain.
    allowed_domains= ["test.org"]

    # The start urls list.
    start_urls= [
        "http://www.test.org/Computers/Books/",
        "http://www.test.org/Computers/Resources/"
    ]

    # Define the parse web page content function.
    def parse(self, response):
        
        # Create a HtmlXPathSelector object.
        hxs= HtmlXPathSelector(response)

        # Select all websites elements by xpath '//ul/li'.
        sites= hxs.select('//ul/li')

        # Loop above website elements.
        for site in sites:

            # Extract title, link and desc value from each website element.

            # Extract the text for html a tag as title value.
            title= site.select('a/text()').extract()

            # Extract the href attribute for html a tag as link value.
            link= site.select('a/@href').extract()

            # Extract the site element's text value as description value.
            desc= site.select('text()').extract()

            # Print out title, link and desc value.
            print title, link, desc

We can save the parsed data in an object that can be used by Scrapy, then Scrapy can help you to save these objects to a file. To implement this we need to add a class in items.py file like below.

# Import Item and Field class.
from scrapy.item import Item, Field

# Define the WebSiteItem which extends Item class.
class WebSiteItem(Item):
   
   # Create each Field object to represent one website item.
   title = Field()

   link = Field()

   desc = Field()

Now we can save the parsed data in the WebSiteItem object in the parse method of TestSpider class.

# Import BaseSpider class.
from scrapy.spiderimport BaseSpider

# Import HtmlXPathSelector class.
from scrapy.selectorimport HtmlXPathSelector

# Import WebSiteItem class that we have defined above.
from tutorial.itemsimport WebSiteItem

# Define TestSpider to extends BaseSpider class.
class TestSpider(BaseSpider):
   name= "test_spider"
   allowed_domains= ["test.org"]
   start_urls= [
       "http://www.test.org/Computers/Books/",
       "http://www.test.org/Computers/Resources/"
   ]
   def parse(self, response):
       # Parse out all websites list use HtmlXPathSelector object.
       hxs= HtmlXPathSelector(response)
       sites= hxs.select('//ul/li')

       # Create an empty item list.
       items= []

       # Loop in the websites list object.
       for site in sites:
           # Create a WebSiteItem object.
           item= WebSiteItem()

           # Set WebSiteItem object's title, link and desc attribute value.
           item['title']= site.select('a/text()').extract()
           item['link']= site.select('a/@href').extract()
           item['desc']= site.select('text()').extract()

           # Append the WebSiteItem object to the item list.
           items.append(item)
       return items

When you run above crawler class with Scrapy command in console, you can add two arguments ( -o and -t ) to let the Scrapy output the items returned by the parse method to a JSON file like below. In below example, items.json file will be saved at the root folder of the project.

scrapy crawl test_spider -o items.json -t json

4. Make Scrapy Automatically Grab All The Links On Web Page.

In the example above, Scrapy only pulls the contents of two URLs in the start_urls list, but in general, what you want to do is finds all the links on a page and then grab and parse the web page contents of those links automatically.

In order to achieve this, we can extract the links we need in the parse method, then construct some request objects with those links, and return the request objects list. Then Scrapy will automatically grab those links. Below is the example code.

class MySpider(BaseSpider):

    name= 'test_spider'

    start_urls= (
        'http://test-spider.com/page1',
        'http://test-spider.com/page2',
        )

    /*** Parse is the default callback function, it returns a list of Request object, Scrapy automatically pulls pages from this list. ***/
    def parse(self, response):
        # collect `item_urls`
        for item_url in item_urls:
            yield Request(url=item_url, callback=self.parse_item)

    /*** Each time a web page is caught, parse_item is called, and parse_item returns a list of Request object also.  ***/
    def parse_item(self, response):
        item= MyItem()
        # populate `item` fields
        yield Request(url=item_details_url, meta={'item': item},
            callback=self.parse_details)

    /*** Scrapy will pass the Request list returned from parse_item method to parse_details method to parse the web page and return an Item object that save the parsed out data. ***/
    def parse_details(self, response):
        item= response.meta['item']
        # populate more `item` fields
        return item

In order to make this work easier, Scrapy provides another base class CrawlSpider. With this class, we can easily parse out all url links in one web page automatically.

from scrapy.contrib.linkextractors.sgmlimport SgmlLinkExtractor


class MininovaSpider(CrawlSpider):
    name= 'test.org'

    allowed_domains= ['test.org']

    start_urls= ['http://www.test.org/today']

    rules= [Rule(SgmlLinkExtractor(allow=['/tor/\d+'])),
             Rule(SgmlLinkExtractor(allow=['/abc/\d+']),'parse_torrent')]

    def parse_torrent(self, response):

        x= HtmlXPathSelector(response)

        torrent= TorrentItem()

        torrent['url']= response.url

        torrent['name']= x.select("//h1/text()").extract()

        torrent['description']= x.select("//div[@id='description']").extract()

        torrent['size']= x.select("//div[@id='info-left']/p[2]/text()[2]").extract()

        return torrent

Compared with BaseSpider class, the new class add a rules attribute. This attribute is a list, which can contain multiple rules. Each rule describes which links need to be crawled and which don’t. You can read the Scrapy document¬†crawling rule section.

These rules may or may not have a callback function, and when no callback function is provided, scrapy simply follow the links.

5. The Use Of pipelines.py.

In the pipeline.py file, we can add some classes to filter out the unwanted items.

from scrapy.exceptionsimport DropItem

class FilterWordsPipeline(object):
    """A pipeline for filtering out items which certain words in their description"""

    # put all words in lowercase

    words_to_filter= ['politics','religion']

    def process_item(self, item, spider):

        for word in self.words_to_filter:
    
            # If the item does not meet the requirements, then an exception is thrown, and the item will not be output to the json file.
            if wordin unicode(item['description']).lower():

                raise DropItem("Contains forbidden word: %s" % word)
        else:

            return item

To make the pipeline.py take effect,¬† we need to add the below code line in Scrapy project’s settings.py file first.

ITEM_PIPELINES = ['dirbot.pipelines.FilterWordsPipeline']

Now when we run the command scrapy crawl dmoz -o items.json -t json in a console, the unwanted items will be filtered out.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.