python - 在检查粗体时从 HTML 文件中提取所有文本 (Python)

输入: 任何包含粗体和非粗体文本的 HTML 文件，分布在不同类型的标签中(例如 <div>, , , , <td> 等)

期望输出:一种数据结构(例如数据框或字典)，允许我收集 HTML 文件的所有文本元素，以及某个标签中的文本元素是否为大胆与否。例如:

data = {'Text': ['bold text (1)', "text (2)", "text (3)", "bold text (4)"], 'Bold': ["yes", "no", "no", "yes"]}
df = pd.DataFrame(data)

注意事项: 据我所知，粗体文本可以位于两个 ... 之间。标签，或任意标签，具有属性 style="font-weight:700;"或 style="font-weight:bold;"，例如 ... .

可重现的例子: 这是我的示例 html 文件，其中包含 15 个文本元素，其中 4 个是粗体:

<html><head><title>text (1)</title></head><body><div>text (2)</div><div>text (3)</div><div><span>text (4)</span></div><div>text (5)</div><div><span>text (6)</span><span>text (7)</span></div><div><span style="font-weight:bold;">bold text (8)</span></div><div><span>text (9)</span></div><div><span style="font-weight:700;">bold text (10)</span></div><div><span>text (11)</span></div><div><span><b>bold text (12)</b></span></div><div><span>text (13)</span><span><a href="www.google.de"><b>bold text (14)</b></a></span></div><div><span>text (15)</span></div></body></html>

我想出了如何用漂亮的汤来获取所有的文本元素......

from bs4 import BeautifulSoup
with open(html_file, 'r') as f:
    # create soup object of .html file
    soup = BeautifulSoup(f, 'html.parser')
    soup.findAll(text=True, recursive=True)

# output: ['text (1)', 'text  (2)', 'text (3)', 'text (4)', 'text (5)', 'text (6)', 'text (7)', 'bold text (8)', 'text (9)', 'bold text (10)', 'text (11)', 'bold text (12)', 'text (13)', 'bold text (14)', 'text (15)']

...但我不知道如何获取有关标签属性(字体粗细)的信息，也不知道如何检查标签是否为 ...或不。你能给我一个提示吗？

最佳答案

您可以查看文本 parent如果它是 name是b或现有的 attribute风格更近一步:

for e in soup.find_all(text=True, recursive=True):
    data.append({
        'text':e,
        'isBoldTag': True if e.parent.name == 'b' else False,
        'isBoldStyle':  True if e.parent.get('style') and 'font-weight' in e.parent.get('style') else False
    })

例子

from bs4 import BeautifulSoup

html='''<html><head><title>text (1)</title></head><body><div>text (2)</div><div>text (3)</div><div><span>text (4)</span></div><div>text (5)</div><div><span>text (6)</span><span>text (7)</span></div><div><span style="font-weight:bold;">bold text (8)</span></div><div><span>text (9)</span></div><div><span style="font-weight:700;">bold text (10)</span></div><div><span>text (11)</span></div><div><span><b>bold text (12)</b></span></div><div><span>text (13)</span><span><a href="www.google.de"><b>bold text (14)</b></a></span></div><div><span>text (15)</span></div></body></html>'''

soup = BeautifulSoup(html)

data = []

for e in soup.find_all(text=True, recursive=True):
    data.append({
        'text':e,
        'isBoldTag': True if e.parent.name == 'b' else False,
        'isBoldStyle':  True if e.parent.get('style') and 'font-weight' in e.parent.get('style') else False
    })

data

输出

[{'text': 'text (1)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (2)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (3)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (4)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (5)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (6)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (7)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (8)', 'isBoldTag': False, 'isBoldStyle': True}, {'text': 'text (9)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (10)', 'isBoldTag': False, 'isBoldStyle': True}, {'text': 'text (11)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (12)', 'isBoldTag': True, 'isBoldStyle': False}, {'text': 'text (13)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (14)', 'isBoldTag': True, 'isBoldStyle': False}, {'text': 'text (15)', 'isBoldTag': False, 'isBoldStyle': False}]

或作为 DataFrame -> pd.DataFrame(data)

<表类="s-表"> <头> 文本 isBoldTag isBoldStyle <正文> 0 文本 (1) 假假 1 文本 (2) 假假 2 文本 (3) 假假 3 文本 (4) 假假 4 文本 (5) 假假 5 文本 (6) 假假 6 文本 (7) 假假 7 粗体文本 (8) 假正确 8 文本 (9) 假假 9 粗体文本 (10) 假正确 10 文本 (11) 假假 11 粗体文本 (12) 正确假 12 文本 (13) 假假 13 粗体文本 (14) 正确假 14 文本 (15) 假假

关于python - 在检查粗体时从 HTML 文件中提取所有文本 (Python)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/72776078/

python - 在检查粗体时从 HTML 文件中提取所有文本 (Python)

例子

输出

上一篇：r - Google Maps Platform 最多 60 个结果限制的解决方法

下一篇：flutter - 提供商 : How can I `notifyListener()` within a `StreamBuilder()` ? 它导致错误 `setState() or markNeedsBuild() called during build`