python - 使用 find_parent 结果从中获取特定项目

标签 python parsing web-scraping beautifulsoup

我正在努力寻找如何让它发挥作用。

我从网站上抓取了一些数据，但内容放在没有类的表中。就这个问题而言，我正在运行它来找出我正在搜索的单词在哪里:

item = soup.find_all(text=re.compile('WORD'))

然后，由于其他内容位于同一父项中，我这样做:

parent = item.find_parent('tr')

现在，我得到了这样的东西:

<tr>
<td class="someclass1">WORD</td>
<td class="someclass2">TIRE</td>
<td class="someclass3">GUN</td>
<td class="someclass4">CAR</td>
<td class="someclass5">BYCICLE</td>
</tr>

既然它确实找到了WORD所在的好地方，那么我应该如何从其中取出GUN或CAR呢？正如我所说，这里的主要问题是，有多个表具有相同的 TD CLASSES，但其中只有 1 个具有 WORD。该表中的内容就是我要查找的内容。

最佳答案

在 bs4 4.7.1 + 中，您可以使用 :contains 和 :has 根据给定类元素中的 WORD 进行隔离在一个表内。在您描述的情况下，您也可以直接在 table 上工作 :contains ，即 table = soup.select_one('table:contains("WORD")') .

from bs4 import BeautifulSoup as bs

html = '''
<html>
 <head></head>
 <body>
  <table> 
   <tbody>
    <tr> 
     <td class="someclass1">WORD</td> 
    </tr> 
   </tbody>
  </table>
  <table></table> 
  <table> 
   <tbody>
    <tr> 
     <td class="someclass1">NOT_WORD</td> 
    </tr> 
   </tbody>
  </table>
  <table></table>
 </body>
</html>
'''
soup = bs(html, 'lxml')
table = soup.select_one('table:has(.someclass1:contains("WORD"))')
print(table.text)

关于python - 使用 find_parent 结果从中获取特定项目，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58401731/

上一篇：python - 如何获取带有与浏览器 View 而不是 html 源匹配的换行符的文本(使用 python 和 beautifulsoup)

下一篇：python - 如何打开 tar 文件并获取其文件之一内的数据？

相关文章：

python 非 web 应用程序模式

php - 是否可以在服务器端运行 jQuery？

python - 从两个列表中找到丢失的名字

python - 附加到 PopupMenu 的 MenuItem 未使用 wxPython 显示位图

python - 将数据框中带有时间戳的多行事件转换为带有开始和结束日期时间的单行

python - 如何将一组文档收集到 pandas 数据框中？

android - 重复输入错误 - 带有 GSON 的 JSONPath

swift - 我有字典数组类型的 JSON 数据。如何使用 Codable 协议(protocol)对其进行解码？

python - 抓取特定网站上的问题

html - 如何使用 BeautifulSoup 获取特定元素后的所有文本？