python - 如何防止BeautifulSoup网页崩溃?

标签 python exception python-3.x beautifulsoup python-requests

在带有 Requests 0.12.1 和 BeautifulSoup 4.1.0 的 Kubuntu Linux 12.10 上运行的 Python 3.2.3 上,我在解析时遇到一些网页中断:

try:       
    response = requests.get('http://www.wbsonline.com/resources/employee-check-tampering-fraud/')
except Exception as error:
    return False

pprint(str(type(response)));
pprint(response);
pprint(str(type(response.content)));

soup = bs4.BeautifulSoup(response.content)

请注意,数百个其他网页可以正常解析。 这个特定页面中的什么内容导致 Python 崩溃,我该如何解决它?以下是崩溃情况:

 - bruno:scraper$ ./test-broken-site.py 
"<class 'requests.models.Response'>"
<Response [200]>
"<class 'bytes'>"
Traceback (most recent call last):
  File "./test-broken-site.py", line 146, in <module>
    main(sys.argv)
  File "./test-broken-site.py", line 138, in main
    has_adsense('http://www.wbsonline.com/resources/employee-check-tampering-fraud/')
  File "./test-broken-site.py", line 67, in test_page_parse
    soup = bs4.BeautifulSoup(response.content)
  File "/usr/lib/python3/dist-packages/bs4/__init__.py", line 172, in __init__
    self._feed()
  File "/usr/lib/python3/dist-packages/bs4/__init__.py", line 185, in _feed
    self.builder.feed(self.markup)
  File "/usr/lib/python3/dist-packages/bs4/builder/_lxml.py", line 175, in feed
    self.parser.close()
  File "parser.pxi", line 1171, in lxml.etree._FeedParser.close (src/lxml/lxml.etree.c:79886)
  File "parsertarget.pxi", line 126, in lxml.etree._TargetParserContext._handleParseResult (src/lxml/lxml.etree.c:88932)
  File "lxml.etree.pyx", line 282, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:7469)
  File "saxparser.pxi", line 288, in lxml.etree._handleSaxDoctype (src/lxml/lxml.etree.c:85572)
  File "parsertarget.pxi", line 84, in lxml.etree._PythonSaxParserTarget._handleSaxDoctype (src/lxml/lxml.etree.c:88469)
  File "/usr/lib/python3/dist-packages/bs4/builder/_lxml.py", line 150, in doctype
    doctype = Doctype.for_name_and_ids(name, pubid, system)
  File "/usr/lib/python3/dist-packages/bs4/element.py", line 720, in for_name_and_ids
    return Doctype(value)
  File "/usr/lib/python3/dist-packages/bs4/element.py", line 653, in __new__
    return str.__new__(cls, value, DEFAULT_OUTPUT_ENCODING)
TypeError: coercing to str: need bytes, bytearray or buffer-like object, NoneType found

而不是bs4.BeautifulSoup(response.content)I had tried bs4.BeautifulSoup(response.text)。这有相同的结果(此页面上有相同的崩溃)。 如何解决像这样损坏的页面,以便我可以解析它们?

最佳答案

输出中提供的网站具有文档类型:

<!DOCTYPE>

而一个合适的网站必须具有以下内容:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

当 beautifulsoup 解析器尝试在此处获取文档类型时:

File "/usr/lib/python3/dist-packages/bs4/element.py", line 720, in for_name_and_ids
return Doctype(value)

Doctype 的值为空,然后当尝试使用该值时,解析器会失败。

一种解决方案是在将页面解析到 beautifulsoup 之前使用正则表达式手动修复问题

关于python - 如何防止BeautifulSoup网页崩溃?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17134892/

相关文章:

java - 当catch block 语句出现时

java - 我应该始终使用 Lambda 表达式进行异常测试吗?

对象上的 Python itertools 组合

python-3.x - 使用 GeckoDriverManager().install() 在 Firefox 中运行测试

javascript - 捕获所有 Google API Javascript 异常

python - 使用 Tensorflow 后端的 CTC Beam 搜索

python - 将 float 转换为逗号分隔的字符串

python - 如何更改Python类VSCode的颜色

python - 为什么 python 中的 0.500000 舍入与使用 '%.0f' 的 45.500000 不同?

python - 莱布尼茨行列式公式复杂度