Python网络爬虫编写3-伪装成浏览器的请求

# coding=utf-8
”’
直接用req会遇到某些网站返回403错误
要解决这个问题，我们必须把爬虫的请求伪装成来自浏览器的请求
这就需要用到在请求头上加入User-Agent
如果是firefox浏览器可以安装一个live http header的插件
可以用它看到http header的内容，你需要的就是User-Agent
”’

from bs4 import BeautifulSoup #导入beautifulsoup
import urllib2

url = ‘http://blog.csdn.net/yuetiantian/’ #你要爬取的网页地址
req = urllib2.Request(url)
req.add_header(‘User-Agent’, ‘Mozilla/5.0 (Windows NT 6.1; rv:33.0) Gecko/20100101 Firefox/33.0’)
response = urllib2.urlopen(req)
the_page = response.read()

#用beautifulsoup提取页面内容
soup = BeautifulSoup(the_page)
print soup.title.string #页面title
print soup.find_all(‘a’) #页面链接

About author

曾月天

View all posts by 曾月天