python - 在Python中使用BeautifulSoup提取两个标题标签之间的文本

标签 python html web-scraping beautifulsoup

我正在尝试使用 BeautifulSoup 在 Python 中从维基百科页面提取电影情节。我是 Python 和 BeautifulSoup 的新手,所以我不知道如何实际处理它。

这是输入代码。

<h2><span class="mw-headline" id="Plot">Plot</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php? title=Moana_(2016_film)&amp;action=edit&amp;section=1" title="Edit section: Plot">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<p>A small <a href="/wiki/Pounamu" title="Pounamu">pounamu</a> stone that is    the mystical heart of the island <a href="/wiki/Goddess" title="Goddess">goddess</a> Te Fiti is stolen by the <a href="/wiki/Demigod" title="Demigod">demigod</a> <a href="/wiki/M%C4%81ui_(mythology)" title="Māui (mythology)">Maui</a>, who was planning to give it to humanity as a gift. As Maui makes his escape, he is attacked by the lava <a href="/wiki/Demon" title="Demon">demon</a> Te Kā, causing the heart of Te Fiti as well as his power-granting magical fish hook to be lost in the ocean.</p><p>A millennium later, young Moana Waialiki, daughter and heir of the chief on the small <a href="/wiki/Polynesia" title="Polynesia">Polynesian</a> island of Motunui, is chosen by the ocean to receive the heart, but drops it when her father, Chief Tui, comes to get her. He insists the island provides everything the villagers need. But years later, fish become scarce and the island's vegetation begins dying. Moana proposes going beyond the reef to find more fish. Tui rejects her request, as sailing past the reef is forbidden.</p>`
<p>Moana's grandmother Tala shows Moana a secret cave behind a waterfall, where she finds boats inside and discovers her ancestors were voyagers, sailing and discovering new islands across the world. Tala explains that they stopped voyaging because Maui stole the heart of Te Fiti, causing Te Kā and monsters to appear in the ocean. Tala then says Te Kā's darkness has been spreading from island to island, slowly killing them. Tala gives Moana the heart of Te Fiti, which she has kept safe for her granddaughter.</p>
<p>Tala falls ill and with her dying breaths tells Moana to set sail. Moana and her pet <a href="/wiki/Rooster" title="Rooster">rooster</a> Heihei depart in a <a href="/wiki/Drua" title="Drua">drua</a> to find Maui. A <a href="/wiki/Manta_ray" title="Manta ray">manta ray</a>, Tala's reincarnation, follows. After a <a href="/wiki/Typhoon" title="Typhoon">typhoon</a> wave flips her sailboat and knocks her unconscious, she awakens the next morning on an island inhabited by Maui, who traps her in a cave and takes her sailboat to search for his fishhook. After escaping and catching up to Maui, Moana tries to convince him to return the heart, but Maui refuses, fearing its power will attract dark creatures.</p>
<p>Sentient coconut pirates called Kakamora surround the boat and steal the heart, but Maui and Moana retrieve it. Maui agrees to help return the heart, but only after he reclaims his hook, which is hidden in Lalotai, the Realm of Monsters. At Lalotai, they retrieve it by tricking Tamatoa, a giant <a href="/wiki/Coconut_crab" title="Coconut crab">coconut crab</a>. Maui teaches Moana how to properly sail and navigate. They arrive at Te Fiti, where Te Kā attacks. Maui is overpowered and Te Kā severely damages his hook and repels their boat far out to sea. Fearful that returning to fight Te Kā will destroy his hook, Maui abandons Moana.</p>
<p>Distraught, Moana begs the ocean to take the heart and choose another person to return it to Te Fiti. The spirit of Tala comes to her and encourages to find her true calling within herself. Inspired, Moana retrieves the heart from the ocean and returns to Te Fiti alone. Maui, having had a change of heart, returns to distract the lava demon, and his hook is destroyed in the battle. Realizing that Te Kā is actually Te Fiti without her heart, Moana asks the ocean to clear a path for Te Kā to approach her. She sings a song, asking Te Kā to remember who she truly is, allowing Moana to restore her heart. Te Fiti returns and gives a new canoe to Moana and a new magical hook to Maui before returning to her island form.</p>
<p>In a <a href="/wiki/Post-credits_scene" title="Post-credits scene">post-credits scene</a>, Tamatoa, who has been stranded on his back during Moana and Maui's escape, grumbles to the audience that they would help him if he was a <a href="/wiki/Sebastian_(Disney)" title="Sebastian (Disney)">Jamaican crab named Sebastian</a>.</p>
<h2><span class="mw-headline" id="Cast">Cast</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Moana_(2016_film)&amp;action=edit&amp;section=2" title="Edit section: Cast">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<div class="thumb tright">

所以我只想提取两个 h2 之间的文本,即情节。我应该如何使用 BeautifulSoup 提取它?

编辑1:这是我现在拥有的代码。

from BeautifulSoup import *

movie = raw_input('Enter:')
main = "http://www.wikipedia.org"
url = "http://www.wikipedia.org/wiki/"+movie+"_(disambiguation)"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# Retrieve a list of the anchor tags
# Each tag is like a dictionary of HTML attributes
tags = soup('a')
for tag in tags:
    chk = tag.get('href', None)
    chk = str(chk)
    if "film" in chk:
        final = chk

html = urllib.urlopen(main+final).read()
soup = BeautifulSoup(html)
new = []
spa = soup.findAll("span",id = "Plot")
spa_1 = soup.findAllNext("p")
for i in spa_1:
    print i

我尝试到达 id=Plot 并尝试打印其后的所有 p 标签。

最佳答案

文档的结构是这样的:

[h2] / [span id=Plot]
...
[h2]

我们可以做的是搜索 id 为“Plot”的范围,然后导航到父同级节点,收集它们的文本,直到到达下一个 H2 header 。

# collect plot in this list
plot = []

# find the node with id of "Plot"
mark = soup.find(id="Plot")

# walk through the siblings of the parent (H2) node 
# until we reach the next H2 node
for elt in mark.parent.nextSiblingGenerator():
    if elt.name == "h2":
        break
    if hasattr(elt, "text"):
        plot.append(elt.text)

# enjoy
print("".join(plot))

关于python - 在Python中使用BeautifulSoup提取两个标题标签之间的文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42450743/

相关文章:

python - 使用 scipy.weave 从内联 C 调用 Python 函数

python - 使用 reduce 选项更改数据类型的 Pandas 调用适用于空数据框

python - 网络爬虫脚本在两台不同的机器上产生不同的结果

python - 使用带有 webdriver.find 函数的 python 过滤与 selenium 进行网络抓取

python - 当我使用 ImageTk 时,我的图像不透明

python - Pandas 每月重新采样第 15 天

html - 将第二个 flex 元素垂直和水平分成两半

javascript - 如何在javascript中从img src获取 "real"url

html - 在 Windows 中安装 OTRS 3.2

javascript - 如何在 Phantomjs 中使用 jQuery 选择 html 元素?