How To Use Python Parser jparser Module To Extract Web Page Title, Content, Image

Jparser is a python parser library, which is used to extract structured data ( text paragraphs and pictures ) from HTML source code. It is very easy to use. This article will tell you how to use it.

1. Install Python jparser Module.

  1. Open a terminal and run command pip install jparser to install it.
    $ pip install jparser
    Collecting jparser
      Downloading jparser-0.0.20.tar.gz (3.5 kB)
    Requirement already satisfied: lxml>=3.7.1 in ./opt/anaconda3/lib/python3.7/site-packages (from jparser) (4.4.1)
    Building wheels for collected packages: jparser
      Building wheel for jparser (setup.py) ... done
      Created wheel for jparser: filename=jparser-0.0.20-py3-none-any.whl size=4426 sha256=1b440eab6d3b585d0d01fcafd81a74c9b46eeddca500897f7e7c22408c8eebcc
      Stored in directory: /Users/songzhao/Library/Caches/pip/wheels/50/d4/c9/b023a6fe076967d52b40c1e3ef60987e6374a52e6882d41d3b
    Successfully built jparser
    Installing collected packages: jparser
    Successfully installed jparser-0.0.20

2. Use Python Jparser To Extract Web Page Title, Content, Image.

  1. Use python jparser to extract html title, content, image source code.
    # Jparser need urllib2 module, so first we should import it.
    import urllib2
    
    # Jparser module's PageModel class provide method to extract web page title, content and image.
    from jparser import PageModel
    
    # Invoke urllib2.urlopen method to read a web page data.
    web_page_url = "http://www.test-abc.com/test.shtml"
    web_page_html = urllib2.urlopen(web_page_url).read().decode('utf-8')
    
    # Create an instance of PageModel class use above web page content.
    page_model = PageModel(web_page_html)
    
    # Extract the page title, content and image.
    result = page_model.extract()
    
    # Print page title.
    print "==title=="
    print result['title']
    
    # Pring page content.
    print "==content=="
    for item in result['content']:
        # If the item is text, then print out the text.
        if item['type'] == 'text':
            print item['data']
        # If the item is an image, then print out the image src attribute.
        if item['type'] == 'image':
            print "[IMAGE]", item['data']['src']

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.