Python BeautifulSoup 从父/兄弟关系中获取内容

标签 python parsing web-scraping beautifulsoup

html 的一部分结构如下。我想从中获得工作“标题”和“时间”。我可以单独获取它们，例如:

from bs4 import BeautifulSoup


pages = '<div class="content"> \
                <a href="Org"> \
                        <h3 class="title"> \
                            Dep. Manager</h3> \
                        </a> \
                <div class="contributor"></div> \
                <p>John</p> \
                <time class="time"> \
                        <span class="timestamp">May 02 2016</span> \
                    </time> \
                </div>'


soup = BeautifulSoup(pages, "lxml")


soup.prettify()


s = soup.find_all(class_ = "title")[0]

t = soup.find_all('span', class_ = "timestamp")[0].text.strip()


pp_title = s.text.strip()

print t

print (pp_title)

它返回我想要的。

Dep. Manager
May 02 2016

对于“时间”，我想要另一种方式来获取它，因为“时间”总是在“标题”下方。我试过这条线来获取“时间”，它不起作用。

print (s.parent.next_sibling.next_sibling)

从关系到“标题”的“时间”的正确方法是什么？谢谢。

最佳答案

您可以通过指定详细信息findParent:

t = s.findParent("div", class_='content').find('span', class_='timestamp').text.strip()

例子:

titles = soup.find_all(class_="title")
for title in titles:
    timestamp = title.findParent("div", class_='content').find('span', class_='timestamp').text.strip()
    print(title.text.strip(), timestamp)

关于Python BeautifulSoup 从父/兄弟关系中获取内容，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58679479/

上一篇：Pythonic 类和 Python 之禅

下一篇：python - 我正在尝试使用插入方法在列表中已经存在的每个元素之后将相同的元素添加到我的列表中

python - 在 numpy 数组中绘制多边形

python - SSLV3_ALERT_HANDSHAKE_FAILURE 与 SNI 在 Python 2.9.10 中使用 Tornado 4.2

Python:如何计算 NLTK 语料库中最常用的前 X 个词？

javascript - 解析大字符串以调用 Node 中的 JS 函数

python - 如何在python中读取selenium webdriver下载的文件

parsing - 在 Clojure 中缓存解析的数据

java - JSONObject 文本必须以 java 中的 1 [字符 2 第 1 行] 处的 '{' 开头

python - 谷歌抓取 href 值

Perl - Web::Scraper - 链接数组