python - 如何使用 Beautifulsoup-python 从 div 内特定标题中的段落元素中提取网页文本

基本上就是标题。我试图从 https://www.genecards.org/cgi-bin/carddisp.pl?gene=IL6&keywords=il6 的“name_of_gene 基因的基因卡摘要”下面的区域中提取段落文本。以IL-6基因为例。我想要拉的是只想拉“IL6(白细胞介素6)是一种蛋白质编码基因。与IL6相关的疾病包括卡波西肉瘤和类风湿性关节炎、系统性幼年病。其相关途径包括IL-1家族信号传导途径和免疫响应IFNα/β信号通路。与该基因相关的基因本体(GO)注释包括信号受体结合和生长因子 active 。”

我一直在尝试将 Beautifulsoup 4 与 python 一起使用。我具体遇到的问题是，我只是不知道如何指定要从网站中提取的文本。

from bs4 import BeautifulSoup

from urllib.request import Request, urlopen

baseURL = "https://www.genecards.org/cgi-bin/carddisp.pl?gene="
GeneToSearch = input("Gene of Interest: ")`
updatedURL = baseURL + GeneToSearch
print(updatedURL)

req = Request(updatedURL, headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(req).read()

soup = BeautifulSoup(response, 'lxml')

for tag in soup.find_all(['script', 'style']):
   tag.decompose()
soup.get_text(strip=True)
VALID_TAGS = ['div', 'p']

for tag in soup.findAll('GeneCards Summary for '+ GeneToSearch +    'Gene'):
    if tag.name not in VALID_TAGS:
        tag.replaceWith(tag.renderContents())

print(soup.text)

这最终给了我网站上的所有元素。

最佳答案

使用最新版本的BeautifulSoup，您可以使用伪CSS选择器(:contains)来搜索具有特定文本的标签。然后，您可以导航到下一个 p 标记并提取相应的文本:

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

baseURL = "https://www.genecards.org/cgi-bin/carddisp.pl?gene="
GeneToSearch = input("Gene of Interest: ")
updatedURL = baseURL + GeneToSearch
print(updatedURL)

req = Request(updatedURL, headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(req).read()

soup = BeautifulSoup(response, 'lxml')

text_find = 'GeneCards Summary for ' + GeneToSearch + ' Gene'

<b>el = soup.select_one('h3:contains("' + text_find + '")')
summary = el.parent.find_next('p').text.strip()</b>

print(summary)

输出:

IL6 (Interleukin 6) is a Protein Coding gene.
Diseases associated with IL6 include Kaposi Sarcoma and Rheumatoid Arthritis, Systemic Juvenile.
Among its related pathways are IL-1 Family Signaling Pathways and Immune response IFN alpha/beta signaling pathway.
Gene Ontology (GO) annotations related to this gene include signaling receptor binding and growth factor activity.

关于python - 如何使用 Beautifulsoup-python 从 div 内特定标题中的段落元素中提取网页文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57634304/

python - 如何使用 Beautifulsoup-python 从 div 内特定标题中的段落元素中提取网页文本

上一篇：python - 需要帮助将从 Salesforce 请求的 Salesforce 数据转换为内部仪表板的 Dataframe

下一篇：python - 操纵神经网络的输出