python - 表.decompose() : AttributeError: 'str' object has no attribute 'decompose'

标签 python beautifulsoup attributeerror

我正在尝试使用 BeautifulSoup 来解析 html 文档。我试图编写一个代码来解析文档,找到所有表格并删除那些具有 数字/字母数字比例 > 15%。我使用了给出的代码作为上一个问题的答案:

Delete HTML element if it contains a certain amount of numeric characters

但由于某种原因,table.decompose() 参数被标记为错误。如果我能得到任何帮助,我将不胜感激。请注意,我是初学者,因此,尽管我确实尝试过,但我并不总是理解更复杂的解决方案!

这是代码:

test_file = 'locationoftestfile.html'


# Define a function to remove tables which have numeric characters/ alphabetic and numeric characters > 15%
def remove_table(table):
        table = re.sub('<[^>]*>', ' ', str(table))
        numeric = sum(c.isdigit() for c in table)
        print('numeric: ' + str(numeric))
        alphabetic = sum(c.isalpha() for c in table)
        print('alpha: ' + str(alphabetic))
        try:
                ratio = numeric / float(numeric + alphabetic)
                print('ratio: '+ str(ratio))
        except ZeroDivisionError as err:
                ratio = 1
        if ratio > 0.15: 
            table.decompose()


# Define a function to create our Soup object and then extract text
def file_to_text(file):
    soup_file = open(file, 'r')
    soup = BeautifulSoup(soup_file, 'html.parser')
    for table in soup.find_all('table'):
        remove_table(table)
    text = soup.get_text()
    return text


file_to_text(test_file)

这是我收到的输出/错误:

numeric: 1
alpha: 55
ratio: 0.017857142857142856
numeric: 9
alpha: 88
ratio: 0.09278350515463918
numeric: 20
alpha: 84
ratio: 0.19230769230769232
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-28-c7e380df4fdc> in <module>
----> 1 file_to_text(test_file)

<ipython-input-27-9fb65cec1313> in file_to_text(file)
     16                 ratio = 1
     17         if ratio > 0.15:
---> 18             table.decompose()
     19     text = soup.get_text()
     20     return text

AttributeError: 'str' object has no attribute 'decompose'

请注意,table.decompose() 参数与我链接的解决方案中给出的参数不同。该解决方案使用

   return True
else:
   return False

但是,也许天真地,我不明白这将如何删除表格。

最佳答案

table = re.sub('<[^>]*>', ' ', str(table))

这会用字符串覆盖参数“table”。您可能想在此处为变量使用另一个名称。例如

def remove_table(table):
    table_as_str = re.sub('<[^>]*>', ' ', str(table))
    numeric = sum(c.isdigit() for c in table_as_str)
    print('numeric: ' + str(numeric))
    alphabetic = sum(c.isalpha() for c in table_as_str)
    print('alpha: ' + str(alphabetic))
    try:
            ratio = numeric / float(numeric + alphabetic)
            print('ratio: '+ str(ratio))
    except ZeroDivisionError as err:
            ratio = 1
    if ratio > 0.15: 
        table.decompose()

关于python - 表.decompose() : AttributeError: 'str' object has no attribute 'decompose' ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59653000/

相关文章:

python - AttributeError: 'NoneType' 对象没有属性 'group' 错误

Django序列化器错误: 'NoneType' object has no attribute '_meta'

python - 运行时错误 - 张量的元素 0 不需要 grad 并且没有 grad_fn

python - 如何在 Keras for AlexNet 训练之前加载 imagenet 权重?

python - 将复杂格式的文本解析为python数据表

python - 通过在 youtube 中搜索,使用 Python 打开第一个视频

python - Django意外保存字符串元组

python - 我应该使用 `random.seed` 还是 `numpy.random.seed` 来控制 `scikit-learn` 中的随机数生成?

python - 从<a>标记内的网页中提取公司名称

python - 谷歌Foobar : Attribute Error when submitting solution