python - BeautifulSoup : Get the Contents of Sub-Nodes

我有以下 python 代码:

def scrapeSite(urlToCheck):
    html = urllib2.urlopen(urlToCheck).read()
    from BeautifulSoup import BeautifulSoup
    soup = BeautifulSoup(html)
    tdtags = soup.findAll('td', { "class" : "c" })
    for t in tdtags:
            print t.encode('latin1')

这将返回以下 html 代码:

<td class="c">
<a href="more.asp">FOO</a>
</td>
<td class="c">
<a href="alotmore.asp">BAR</a>
</td>

我想获取 a 节点(例如 FOO 或 BAR)之间的文本，即 t.contents.contents。不幸的是，它并不那么容易:) 有谁知道如何解决这个问题？

非常感谢，感谢任何帮助!

干杯，约瑟夫

最佳答案

在这种情况下，您可以使用 t.contents[1].contents[0]得到 FOO 和 BAR。

问题是 contents 返回一个包含所有元素(Tags 和 NavigableStrings)的列表，如果你打印 contents，你可以看到它是这样的

[u'\n', <a href="more.asp">FOO</a>, u'\n']

因此，要获得您需要访问的实际标签 contents[1] (如果您有完全相同的内容，这可能会因源 HTML 的不同而有所不同)，找到合适的索引后，您可以使用 contents[0]之后获取 a 标签内的字符串。

现在，由于这取决于 HTML 源的确切内容，因此它非常脆弱。一个更通用和更强大的解决方案是使用 find()再次通过 t.find('a') 找到“a”标签然后使用内容列表获取其中的值 t.find('a').contents[0]或者只是 t.find('a').contents获取整个列表。

关于python - BeautifulSoup : Get the Contents of Sub-Nodes，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/3987732/

上一篇：python取消注释正确的行

下一篇：python - 如何使用自定义名称和 zipfile 模块从 python 中提取文件？

python - Pandas 拆分列并汇总结果，索引中有重复项

python - 如何抓取与特定期刊/文章论文的每位教授相关的从属关系

python - 如何解决导出到 csv 文件时的 unicode 错误 (python)

python - 使用 urllib 打开 HTTPS 链接失败

python - Pandas :获取一行索引的值？

python - 值错误 : could not convert string to float: id

python - 字符编码错误 : UnicodeEncodeError: 'charmap' codec can't encode character X in position Y: character maps to <undefined>

python - 解析除以 <br> 但不在 <span> 内的文本

python - 使用Python获取smtp服务器证书