python - 使用 BeautifulSoup 提取不在标签之间的文本

我正在通过抓取来练习 BeautifulSoup imdb.com对于某个特定的 Actor ，我想

获取他们作为 Actor 主演的所有电影的列表；
过滤掉所有非长片的电影，例如电视剧、短片、短片纪录片等。

到目前为止，对于所有电影，我都可以得到类似以下汤的东西:

<div class="filmo-row even" id="actor-tt14677742">
    <span class="year_column">2021</span>
    <b><a href="/title/tt14677742/">Welcome Back Future</a></b>
     (Short)
    <br/>
     Leo
</div>

正如我们所见，这部电影应该被过滤掉，因为它很短。我们还可以看到有关(Short)的信息未包含在任何标签中。
因此，我的问题:
如何从汤中获取这些信息，如何在</b>之后查找一些信息如果有的话？

最佳答案

你可以使用这个:

from bs4 import BeautifulSoup as bs

HTML="""<div class="filmo-row even" id="actor-tt14677742">
    <span class="year_column">2021</span>
    <b><a href="/title/tt14677742/">Welcome Back Future</a></b>
     (Short)
    <br/>
     Leo
</div>
"""

soup=bs(HTML,"lxml")

print(soup.find("div").find_all(text=True,recursive=False))
# ['\n', '\n', '\n     (Short)\n    ', '\n     Leo\n']

# If you use html5lib as parse then answer is a bit different:
soup=bs(HTML,"html5lib")
print(soup.find("div").find_all(text=True,recursive=False))
# ['\n    ', '\n    ', '\n     (Short)\n    ', '\n     Leo\n']

# If you want all of the text from div then try this:
print(soup.find("div").find_all(text=True,recursive=True))
# ['\n', '2021', '\n', 'Welcome Back Future', '\n     (Short)\n    ', '\n     Leo\n']
# Or simply use
print(soup.find("div").text)
"""
2021
Welcome Back Future
     (Short)

     Leo

"""

我认为您现在可以清理它，并且我相信获取他们作为 Actor 主演的所有电影的列表；意味着您还需要Leo。

关于python - 使用 BeautifulSoup 提取不在标签之间的文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/68549820/

python - 使用 BeautifulSoup 提取不在标签之间的文本

上一篇：java - 错误: Could not find or load main class - When running a JAR at the prompt

下一篇：r - 如何根据字符串的匹配部分合并R中的两个数据帧？