How To Use Python Parser jparser Module To Extract Web Page Title, Content, Image

Leave a Comment / Python编程

Jparser is a python parser library, which is used to extract structured data ( text paragraphs and pictures ) from HTML source code. It is very easy to use. This article will tell you how to use it.

Table of Contents

1. Install Python jparser Module.

Open a terminal and run command pip install jparser to install it.

$ pip install jparser
Collecting jparser
  Downloading jparser-0.0.20.tar.gz (3.5 kB)
Requirement already satisfied: lxml>=3.7.1 in ./opt/anaconda3/lib/python3.7/site-packages (from jparser) (4.4.1)
Building wheels for collected packages: jparser
  Building wheel for jparser (setup.py) ... done
  Created wheel for jparser: filename=jparser-0.0.20-py3-none-any.whl size=4426 sha256=1b440eab6d3b585d0d01fcafd81a74c9b46eeddca500897f7e7c22408c8eebcc
  Stored in directory: /Users/songzhao/Library/Caches/pip/wheels/50/d4/c9/b023a6fe076967d52b40c1e3ef60987e6374a52e6882d41d3b
Successfully built jparser
Installing collected packages: jparser
Successfully installed jparser-0.0.20

2. Use Python Jparser To Extract Web Page Title, Content, Image.

Use python jparser to extract html title, content, image source code.

# Jparser need urllib2 module, so first we should import it.
import urllib2

# Jparser module's PageModel class provide method to extract web page title, content and image.
from jparser import PageModel

# Invoke urllib2.urlopen method to read a web page data.
web_page_url = "http://www.test-abc.com/test.shtml"
web_page_html = urllib2.urlopen(web_page_url).read().decode('utf-8')

# Create an instance of PageModel class use above web page content.
page_model = PageModel(web_page_html)

# Extract the page title, content and image.
result = page_model.extract()

# Print page title.
print "==title=="
print result['title']

# Pring page content.
print "==content=="
for item in result['content']:
    # If the item is text, then print out the text.
    if item['type'] == 'text':
        print item['data']
    # If the item is an image, then print out the image src attribute.
    if item['type'] == 'image':
        print "[IMAGE]", item['data']['src']

Leave a Comment Cancel Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.