python - 如何使用漂亮的汤从 html 文档中获取 <text> 标签

我怎样才能得到 <text>使用美丽汤的 html 文档中的标签 Abbot lab 10k filing

我想提取 <text></text> 的所有 child 的标签名称使用以下代码标记

from bs4 import BeautifulSoup
import urllib.request
url ='https://www.sec.gov/Archives/edgar/data/1800/000104746919000624/a2237733z10-k.htm'
htmlpage = urllib.request.urlopen(url)
soup = BeautifulSoup(htmlpage, "html.parser")
all_text = soup.find('text')
all_tags = all_text.contents
all_tags = [x.name for x in all_tags if x.name is not None]
print(all_tags)

但是上面的代码我得到的输出是 ['html'] .

Expected output:
['p','p','p','p','p','p','div','div','font','font', etc......]

最佳答案

您可以使用 CSS 选择器(用于打印标签文本的所有子元素):

for child in all_text.select('text *'):
    print(child.name, end=' ')

打印:

br p font font b p font b br p font b div div ...

编辑:为了仅打印标签文本的直接子元素，您可以使用:

from bs4 import BeautifulSoup
import requests

url ='https://www.sec.gov/Archives/edgar/data/1800/000104746919000624/a2237733z10-k.htm'

htmlpage = requests.get(url)
soup = BeautifulSoup(htmlpage.text, "lxml")

for child in soup.select('text > *'):
    print(child.name, end=' ')

关于python - 如何使用漂亮的汤从 html 文档中获取 <text> 标签，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56765933/

上一篇：java - thymeleaf 中 String 对象的 #strings.replace() 实用方法是否替换了提供的字符串中的所有匹配项？

下一篇：javascript - 在父 div 类的一侧动态分组 html 元素

python - 过滤具有 0 和 1 序列的列中的行模式

javascript - 如何从 chart.js 的 html 表中动态添加元素

javascript - Firefox 和 IE 中的图像源分配

python - 如何返回字符串中所有大写字母索引的列表？

python - 如何将python代码修改为cython代码？

用于 INSERT 或 UPDATE(不仅仅是 INSERT)的 Python PostgreSQL COPY 命令

javascript - jQuery 分离、前置和条件

python-3.x - 获取从 "/etc/group"开始排序的组 ID

python - 无法导入 tensorflow_probability