How To Crawl All URLs On A Web Page By Python Crawler

Leave a Comment / Python编程 / BeautifulSoup, Scrapy

In the process of crawling web pages with a Python crawler, the first step is to crawl the URL. If there are many URLs on the web page, how to crawl all these URLs? This article describes two methods for a Python crawler to crawls all URLs on a Web page.

Table of Contents

1. Methods To Crawls All URLs On A Web Page.

Use BeautifulSoup to extract all URLs quickly.

Use Scrapy framework to call spider class’s parse method recursively.

2. Use BeautifulSoup To Extract All URLs Quickly.

BeautifulSoup is a python library that can extract content from HTML and XML quickly.

# Import urtlib.request module.
import urllib.request

# Import BeautifulSoup module from bs4 package.
from bs4 import BeautifulSoup 

# This method will retrieve all urls in the input url web page.
def getAllUrl(self,url):

        # Get the input url web page HTML content.
        html_data = urllib.request.urlopen(url).read().decode("utf-8")

        # Create an instance of BeautifulSoup with the above HTML data.
        soup = BeautifulSoup(html_data, features='html.parser')

        # Get all HTML a tag on the web page in a list.
        tag_list = soup.find_all('a')

        # Loop the above tag list.
        for tag in tag_list:
            # Get the HTML a tag href attribute value.
            href_value = str(tag.get('href'))
            
            # Print out the a tag link url.
            print(href_value.strip())

3. Use Scrapy Framework To Call Spider Class’s Parse Method Recursively.

This method will call the Scrapy spider’s parse method recursively until all pages is crawled.

class TestSpider(scrapy.Spider):

    # This is the spider name.
    name = 'test-spider'

    # Only allow this spider to crawl below domains in the list array.
    allowed_domains = ['www.test.com']

    # The beginning crawl URL list.
    start_urls = ['https://www.test.com/']

    # Design a URL template.
    url = 'https://www.test.com/page/%d/'

    # The start page number.
    pageNum = 1

 
    # This method will be invoked by Scrapy framework to parse the web page elements.
    def parse(self, response):

        # Get all div object in list.
        div_list = response.xpath("//div[@id='content-left']/div")

        # Loop in the div list.
        for div in div_list:

            # Create each web element item object.
            ....

            # Pass the web element item object to the Scrapy pipeline class.
            yield item

 

        # Parse more next page URLs.
        # Only parse 13 page URLs to avoid infinite loop.
        if self.pageNum <= 13:  

            # Make the page number plus 1.
            self.pageNum += 1

            # Print out log data.
            print('Crawl：page %d ' % self.pageNum)

            # Create a new web page URL.
            new_url = self.url % self.pageNum

            ''' 
                Make Scrapy framework request the new web page URL.
                Pass self.parse method as the callback method.
                Then the parse method will parse the new URL web page content.
            '''
            yield scrapy.Request(url=new_url, callback=self.parse)

Leave a Comment Cancel Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.