Jparser is a python parser library, which is used to extract structured data ( text paragraphs and pictures ) from HTML source code. It is very easy to use. This article will tell you how to use it.
1. Install Python jparser Module.
- Open a terminal and run command
pip install jparser
to install it.$ pip install jparser Collecting jparser Downloading jparser-0.0.20.tar.gz (3.5 kB) Requirement already satisfied: lxml>=3.7.1 in ./opt/anaconda3/lib/python3.7/site-packages (from jparser) (4.4.1) Building wheels for collected packages: jparser Building wheel for jparser (setup.py) ... done Created wheel for jparser: filename=jparser-0.0.20-py3-none-any.whl size=4426 sha256=1b440eab6d3b585d0d01fcafd81a74c9b46eeddca500897f7e7c22408c8eebcc Stored in directory: /Users/songzhao/Library/Caches/pip/wheels/50/d4/c9/b023a6fe076967d52b40c1e3ef60987e6374a52e6882d41d3b Successfully built jparser Installing collected packages: jparser Successfully installed jparser-0.0.20
2. Use Python Jparser To Extract Web Page Title, Content, Image.
- Use python jparser to extract html title, content, image source code.
# Jparser need urllib2 module, so first we should import it. import urllib2 # Jparser module's PageModel class provide method to extract web page title, content and image. from jparser import PageModel # Invoke urllib2.urlopen method to read a web page data. web_page_url = "http://www.test-abc.com/test.shtml" web_page_html = urllib2.urlopen(web_page_url).read().decode('utf-8') # Create an instance of PageModel class use above web page content. page_model = PageModel(web_page_html) # Extract the page title, content and image. result = page_model.extract() # Print page title. print "==title==" print result['title'] # Pring page content. print "==content==" for item in result['content']: # If the item is text, then print out the text. if item['type'] == 'text': print item['data'] # If the item is an image, then print out the image src attribute. if item['type'] == 'image': print "[IMAGE]", item['data']['src']