python-2.7 - 使用 beautifulsoup 时 403 禁止输出

标签 python-2.7 web-scraping beautifulsoup

我正在尝试使用 beautifulsoup 从某个网站上抓取文章。我不断收到“HTTP 错误 403:禁止”作为输出。我想知道是否有人可以向我解释如何克服这个问题?下面是我的代码:

url: http://magharebia.com/en_GB/articles/awi/features/2014/04/14/feature-03

timestamp = datetime.date.today() 

# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())

# Check if article is from Magharebia.com
# remaining issues: error 403: forbidden. Possible robots.txt? 
# Can't scrape anything atm
if "magharebia.com" in url:

# Create a new file to write content to
#txt = open('%s.txt' % timestamp, "wb")

# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())

# Write the article title to the file    
try:
    title = soup.find("h2")
    txt.write('\n' + "Title: " + str(title) + '\n' + '\n')
except:
    print "Could not find the title!"

# Author/Location/Date
try:
    artinfo = soup.find("h4").text
    txt.write("Author/Location/Date: " + str(artinfo) + '\n' + '\n')
except:
    print "Could not find the article info!" 

# Retrieve all of the paragraphs
tags = soup.find("div", {'class': 'body en_GB'}).find_all('p')
for tag in tags:
    txt.write(tag.text.encode('utf-8') + '\n' + '\n')

# Close txt file with new content added
txt.close()



           Please enter a valid URL: http://magharebia.com/en_GB/articles/awi/features/2014/04       /14/feature-03
Traceback (most recent call last):
  File "idle_test.py", line 18, in <module>
    soup = BeautifulSoup(urllib2.urlopen(url).read())
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in     urlopen
    return _opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 410, in open
    response = meth(req, response)
   File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 523, in    http_response
    'http', request, response, code, msg, hdrs)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 448, in error
    return self._call_chain(*args)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in     _call_chain
    result = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 531, in     http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden

最佳答案

我能够使用 urllib2 重现 403 Forbidden 错误,我没有深入研究它,但是以下内容对我有用:

import requests
from bs4 import BeautifulSoup

url = "http://magharebia.com/en_GB/articles/awi/features/2014/04/14/feature-03"

soup = BeautifulSoup(requests.get(url).text)

print soup # prints the HTML you are expecting

关于python-2.7 - 使用 beautifulsoup 时 403 禁止输出,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23073209/

相关文章:

python - 如何从 beautifulSoup 中提取多个 html 标签?

python - 如何在抓取时加速请求模块?

html - 如何处理网页上的 iframe

python - 即使在更改 header 和 IP 后,验证码也会使用请求。我是如何被跟踪的?

python - 使用 BeautifulSoup 提取图像链接

python - 对象属性问题

python - 使用模拟 MongoDB 服务器进行单元测试

python - 为什么列表理解比附加到列表要快得多?

python - python中的函数调用和类型转换?