python - 从 HTML 中提取标签之间的特定文本部分

我想从 HTML 文件中提取特定部分的文本(“Item 1A”部分)。我想要从内容部分中的“项目 1A”开始(而不是从内容列表中)开始，并在“项目 1B”处停止。但“Item 1A”和“Item 1B”有几个相同的文本。如何确定要开始和停止的特定文本。

import requests
from bs4 import BeautifulSoup
import re

url = "https://www.sec.gov/Archives/edgar/data/1606163/000114420416089184/v434424_10k.htm"
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
text = soup.get_text()

item1a = re.search(r"(item\s1A\.?)(.+)(item\s1B\.?)", text, re.DOTALL | re.IGNORECASE)

item1a.group(2)

输出捕获内容列表中第一个“Item 1A”的文本，而不是该部分的标题。

因此我想知道:

如何从内容部分的“Item 1A”(而不是内容表格中的“Item 1A”)捕获文本。
为什么它捕获了最后一个“Item 1B”，而不是停在目录中的“Item 1B”处。

最佳答案

既然您有一个可以帮助您处理 HTML 结构的 soup，为什么不利用它呢？

一种表达方式是“在两个具有特定属性的标签之间查找文本”。 (代表 1A 和 1B header 的标签。)为此，您可以将可调用对象(函数)传递给 soup.find():

import requests
from bs4 import BeautifulSoup
from bs4.element import Tag
import re

url = "https://www.sec.gov/Archives/edgar/data/1606163/000114420416089184/v434424_10k.htm"
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")

def is_pstyle(tag: tag) -> bool:
    return tag.name == "p" and tag.has_attr("style")

def is_i1a(tag: Tag) -> bool:
    return is_pstyle(tag) and re.match(r"Item 1A\..*", tag.text)

def is_i1b(tag: Tag) -> bool:
    return is_pstyle(tag) and re.match(r"Item 1B\..*", tag.text)

def grab_1a_thru_1b(soup: BeautifulSoup) -> str:
    start = soup.find(is_i1a)
    def gen_t():
        for tag in start.next_siblings:
            if is_i1b(tag):
                break
            else:
                if hasattr(tag, "get_text"):
                    yield tag.get_text()  # get_text("\n")
                else:
                    yield str(tag)
    return "".join(gen_t())

if __name__ == "__main__":
    print(grab_1a_thru_1b(soup))

输出的第一部分:

The risks and uncertainties described below
are those specific to the Company which we currently believe have the potential to be material, but they may not be the only ones
we face. If any of the following risks, or any other risks and uncertainties that we have not yet identified or that we currently
consider not to be material, actually occur or become material risks, our business, prospects, financial condition, results of
operations and cash flows could be materially and adversely affected. Investors are advised to consider these factors along with
the other information included in this Annual Report and to review any additional risks discussed in our filings with the SEC.
 
Risks Associated with Our Business
 
We are a newly formed company with no operating history and, accordingly, you have no basis on which to evaluate our ability to achieve our business
objective.

您可以将迷你函数 is_pstyle、is_i1a 和 is_i1b 视为“过滤器” - 只是以不同的方式精确查找开始和结束标签。然后迭代这些标签之间的同级标签。 (.get_text() 将在每个同级标记中递归地工作。)

关于python - 从 HTML 中提取标签之间的特定文本部分，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56046738/

python - 从 HTML 中提取标签之间的特定文本部分

上一篇：python - 在 Gmail API 中使用确认码验证待处理的转发地址

下一篇：python - 如何获取特定日期是一周中的哪一天？