Python can be used to write a web page crawler to download web pages. But the web page content is massive and not clear for us to use, we need to filter out the useful data that we need. This article will tell you how to parse the downloaded web page content and filter out the information you need use the python
When it comes to string content filtering, we immediately think about regular expressions, but we won’t talk about regular expressions today. Because regular expressions are too complex for a crawler that is written by a novice. Moreover, the error tolerance of regular expressions is poor, so if the web page changes slightly, the matching expression will have to be rewritten.
Fortunately, Python provides many libraries for parsing HTML pages such as Bs4 BeautifulSoup and Etree in LXML (an XPath parser library). BeautifulSoup looks like a jQuery selector, it looks for Html elements through the id, CSS selector, and tag. Etree’s Xpath method looks for elements primarily through nested relationships of HTML nodes, similar to the path of a file. Below is an example of using Xpath to find Html nodes.
#Gets all tr tags under the table tag with id account path = '//table[@id="account"]//tr'
1. LXML Installation and Usage
1.1 Install the LXML library
pip install lxml
1.2 Lxml Xpath Usage
Before using XPath, you need to import the
etree class and use this class to process the original Html page content to get an
_Element object. Then use its
xpath method to get related node values.
# Import etree class from lxml import etree # Example html content html = '''<div class="container"> <p class="row"> <a href="#123333" class="box"> I love xpath </a> </p> </div>''' # Use etree to process html text and return an _Element object which is a dom object. dom = etree.HTML(html) # Get a tag's text. Please Note: The _Element's xpath method always return a list of html nodes.Because there is only one a tag's text, so we can do like below. a_tag_text = dom.xpath('//div/p/a/text()') print(a_tag_text)
Save the above code in a file get_html_element.py and run command python3 get_heml_element.py, Below is the execution result.
I love xpath
2. Xpath Syntax
- a / b : / represent the hierarchical relationship in XPath. a on the left is the parent node, b on the right is the child node, and b here is the direct child of a.
- a // b : Double / represents all b nodes under a node should be selected ( no matter it is a direct child node or not ). So we can also write above example XPath as //div//a/text().
- [@] : Select html nodes by tag’s attributes. //div[@classs] : Select div node with the class attribute, //a[@x] : Select a node with the x attribute, //div[@class=”container”] : Select div node which class’s attribute value is ‘container’.
- //a[contains(text(), “love”)] : Select the a tag which text content contains string ‘love’.
- //a[contains(@href, “user_name”)] : Select the a tag which href attribute’s value contains ‘user_name’.
- //div[contains(@y, “x”)] : Select div tag that has y attribute and y attribute’s value contains ‘x’.
3. Question & Answer.
3.1 How to use python lxml module to parse out URL address in a web page.
- In my python script, I use the
getmethod to retrieve web content with the page URL. Then I use the python
htmlmodule to parse the web page content to a dom tree, my question is how to parse out the URL addresses from the dom tree. Below is my source code.
# Import the python requests module. import requests # Import the html module from the lxml library. from lxml import html # Define the web page url. web_page_url = "https://www.abc.com" # Get the web page content by its url with the requests module's get() method. web_page = requests.get(web_page_url) # Get the web page content data. web_page_content = web_page.content # Get the web page dom tree with the html module's fromstring() function. dom_tree = html.fromstring(web_page_content)
- Below is the source code that can read a web page by its URL, and parse out all html a tag link URLs on the web page.
# Import the python requests module. import requests # Import the etree module from the lxml library. from lxml import etree # Import StringIO class from the io package. from io import StringIO def parse_html_url_link(page_url): # Create an instance of etree.HTMLParser class. html_parser = etree.HTMLParser() # Use the python requests module get method to get the web page object with the provided url. web_page = requests.get(page_url) # Convert the web page bytes content to text string withe the decode method. web_page_html_string = web_page.content.decode("utf-8") # Create a StringIO object with the above web page html string. str_io_obj = StringIO(web_page_html_string) # Create an etree object. dom_tree = etree.parse(str_io_obj, parser=html_parser) # Get all <a href...> tag elements in a list a_tag_list = dom_tree.xpath("//a") # Loop in the html a tag list for a in a_tag_list: # Get each a tag href attribute value, the value save the a tag URL link. url = a.get('href') # Print out the parsed out URL. print(url) if __name__ == '__main__': parse_html_url_link("https://www.yahoo.com/")