python - 使用 Python 进行网页抓取

标签 python urllib2 web-scraping

我正在尝试抓取网站 http://www.nseindia.com使用 urllib2 和 BeautifulSoup。不幸的是，当我尝试通过 Python 访问该页面时，我总是收到 403 Forbidden。我认为这是一个用户代理问题，但改变它并没有帮助。然后我认为它可能与 cookie 有关，但显然通过关闭 cookie 的链接加载页面工作正常。什么可能阻止通过 urllib 的请求？

最佳答案

http://www.nseindia.com/无论出于何种原因，似乎都需要一个 Accept header 。这应该有效:

import urllib2
r = urllib2.Request('http://www.nseindia.com/')
r.add_header('Accept', '*/*')
r.add_header('User-Agent', 'My scraping program <author@example.com>')
opener = urllib2.build_opener()
content = opener.open(r).read()

拒绝没有Accept header 的请求是不正确的； RFC 2616明确指出

If no Accept header field is present, then it is assumed that the client accepts all media types.

关于python - 使用 Python 进行网页抓取，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/6969567/

上一篇：列表检索数据的Python列表

下一篇：python - Django:如何重新启动 Web 服务器以便应用源代码中的更改

相关文章：

python - 如何转储带有前导零的 int 的 YAML？

python - 使用之前的日期从现有列创建新的 Pandas 列

python - 使用 zipfile.ZipFile 即时打开 urllib2.urlopen() 的响应

python - 不在网络抓取中迭代列表

python - 从多个 DIV 选择要打印的 URL

python - python33 的相对导入

python - 如何在 Scikit 中构建线性加性模型？

python - urllib2 在 https 网站上失败

具有国际/UTF-8 字符的 Python urllib2() 函数

javascript - 从抓取的 HTML 中提取 Javascript 对象的正则表达式