Python can be used to write a web page crawler to download web pages. But the web page content is massive and not clear for us to use, we need to filter out the useful data that we need. This article will tell you how to parse the downloaded web page content and filter out the information you need use python lxml library’s xpath method.
When it comes to string content filtering, we immediately think about regular expressions, but we won’t talk about regular expressions today. Because regular expressions are too complex for a crawler that is written by a novice. Moreover, the error tolerance of regular expressions is poor, so if the web page changes slightly, the matching expression will have to be rewritten.
Fortunately, python provides many libraries for parsing HTML pages such as Bs4 BeautifulSoup and Etree in LXML (an xpath parser library). BeautifulSoup looks like a jQuery selector, it look for html elements through the id, CSS selector, and tag. Etree’s Xpath method looks for elements primarily through nested relationships of HTML nodes, similar to the path of a file. Below is an example of using Xpath to find html nodes.
#Gets all tr tags under the table tag with id account path = '//table[@id="account"]//tr'
1. LXML Installation and Usage
1.1 Install the LXML library
pip install lxml
1.2 Lxml Xpath Usage
Before using xpath, you need to import the etree class and use this class to process the original html page content to get an _Element object. Then use it’s xpath method to get related node values.
# Import etree class from lxml import etree # Example html content html = '''<div class="container"> <p class="row"> <a href="#123333" class="box"> I love xpath </a> </p> </div>''' # Use etree to process html text and return an _Element object which is a dom object. dom = etree.HTML(html) # Get a tag's text. Please Note: The _Element's xpath method always return a list of html nodes.Because there is only one a tag's text, so we can do like below. a_tag_text = dom.xpath('//div/p/a/text()') print(a_tag_text)
Save above code in a file get_html_element.py and run command python3 get_heml_element.py, Below is the execution result.
I love xpath
2. Xpath Syntax
- a / b : / represent the hierarchical relationship in xpath. a on the left is the parent node, b on the right is the child node, and b here is the direct child of a.
- a // b : Double / represents all b nodes under a node should be selected ( no matter it is a direct child node or not ). So we can also write above example xpath as //div//a/text().
- [@] : Select html nodes by tag’s attributes. //div[@classs] : Select div node with the class attribute, //a[@x] : Select a node with the x attribute, //div[@class=”container”] : Select div node which class’s attribute value is ‘container’.
- //a[contains(text(), “love”)] : Select the a tag which text content contains string ‘love’.
- //a[contains(@href, “user_name”)] : Select the a tag which href attribute’s value contains ‘user_name’.
- //div[contains(@y, “x”)] : Select div tag that has y attribute and y attribute’s value contains ‘x’.