python - 如何处理 Beautifulsoup 递归错误(或解析错误)

标签 python parsing web-scraping beautifulsoup lxml

我有一堆 HTML 文件,我正在尝试用 Beautifulsoup 读取它们。其中一些,我收到了一个错误。我试过解码、编码……但找不到问题所在。非常感谢您。

这是一个例子。

import requests
from bs4 import BeautifulSoup
new_text = requests.get('https://www.sec.gov/Archives/edgar/data/1723069/000121390018016357/0001213900-18-016357.txt')
soup = BeautifulSoup(new_text.content.decode('utf-8','ignore').encode("utf-8"),'lxml')
print(soup)

在 Jupyter notebook 上,我收到死内核错误。 在 Pycharm 上,我收到以下错误:(它会自己重复,所以删除了其中一些。但是很长。)

Traceback (most recent call last):
  File "C:/Users/oe/.PyCharmCE2019.1/config/scratches/scratch_5.py", line 5, in <module>
    print(soup)
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1099, in __unicode__
    return self.decode()
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\__init__.py", line 566, in decode
    indent_level, eventual_encoding, formatter)
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1188, in decode
    indent_contents, eventual_encoding, formatter)
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1257, in decode_contents
    formatter))
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1188, in decode
    indent_contents, eventual_encoding, formatter)
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1257, in decode_contents
    formatter))
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1188, in decode
    indent_contents, eventual_encoding, formatter)
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1257, in decode_contents
    formatter))
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1188, in decode
    indent_contents, eventual_encoding, formatter)
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1254, in decode_contents
    text = c.output_ready(formatter)
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 745, in output_ready
    output = self.format_string(self, formatter)
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 220, in format_string
    if isinstance(formatter, Callable):
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\abc.py", line 190, in __instancecheck__
    subclass in cls._abc_negative_cache):
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\_weakrefset.py", line 75, in __contains__
    return wr in self.data
RecursionError: maximum recursion depth exceeded in comparison

最佳答案

坦率地说,我不确定您的代码的潜在问题是什么(尽管我在 Jupyter notebook 中没有遇到死内核),但这似乎可行:

url = 'https://www.sec.gov/Archives/edgar/data/1723069/000121390018016357/0001213900-18-016357.txt'

import requests
from bs4 import BeautifulSoup
new_text = requests.get(url)

soup = BeautifulSoup(new_text.text,'lxml')
print(soup.text)

请注意,在 soup 中,new_text.content 被替换为 new_text.text,我不得不删除编码/解码参数,并且print 命令必须从 print(soup)(引发错误)更改为可以正常工作的 print(soup.text)。也许更聪明的人可以解释...

另一个可行的选项是:

import urllib.request

response = urllib.request.urlopen(url)
new_text2 = response.read()
soup = BeautifulSoup(new_text2,'lxml')
print(soup.text)

关于python - 如何处理 Beautifulsoup 递归错误(或解析错误),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55029894/

相关文章:

python - 使用 Anaconda 或 Canopy 安装 Python 模块

python - Google+ 登录 - 服务器端流程 - 存储凭据 - Python 示例

node.js - 在node.js express中通过body-parser POST表单数据时,数据始终为 'undefined'

python - 如何用 Python 和 lxml 抓取这个网页?返回空列表

python - 在 Python 中仅使用内置库制作一个基本的网络抓取工具 - Python

html - 如何获取选择菜单的值?

python - Pandas 合并。 right_on 条件 "OR"?

python - 在Python中迭代列表后,打印出余数

java - XML 数据到 PostgreSQL 数据库

parsing - 在程序中使用 ocaml Args