Python网络爬虫编写4-用lxml解析xpath指定的页面内容

# coding=utf-8
”’
这个例子里面，我们通过使用lxml来访问页面中xpath指定区域的链接
xpath可以使用firefox的firebug获得，注意：有的时候通过firebug获得元素的xpath可能和urllib2获得的不同，
这是因为firebug看到的是通过浏览器打开的页面的源代码
lxml的安装比较复杂，最好先安装easy_install, 然后通过它来安装lxml
如果安装好了lxml编译python文件不通过，那可能是因为lxml依赖Beautiful3的缘故
请尝试使用BeautifulSoup3来代替BeautifulSoup4
”’

from bs4 import BeautifulSoup #导入beautifulsoup
from lxml import etree
import urllib2

url = ‘http://blog.csdn.net/yuetiantian/’ #你要爬取的网页地址
req = urllib2.Request(url)
req.add_header(‘User-Agent’, ‘Mozilla/5.0 (Windows NT 6.1; rv:33.0) Gecko/20100101 Firefox/33.0’)
response = urllib2.urlopen(req)
the_page = response.read()

#print the_page
#用beautifulsoup提取页面内容
#soup = BeautifulSoup(the_page)
#print soup.title.string #页面title

xpathStr = ‘/html/body/div/div[3]/div[2]/div[1]/div[4]/ul[2]/li/a’ #获得页面左边的分类列表
root = etree.HTML(the_page)
links = root.xpath(xpathStr)
for link in links:
print link.attrib[‘href’]

About author

曾月天

View all posts by 曾月天