python - 使用lxml代码解析HTML

我有以下 HTML 代码:-

<table class="results">
  <tr>
    <td>
      <a href="..">link</a><span>2nd Mar 2011</span><br>XYZ Consultancy Ltd<br>
       <div>....</div>
    </td>
  </tr>
</table>

我正在使用 lxml+python 代码来解析上面的 HTML 文件。我想检索“XYZ Consultancy Ltd”，但我不知道如何执行此操作。到目前为止我的代码如下:-

import lxml.html
for el in root.cssselect("table.results"):    
 for el2 in el: #tr tags
  for e13 in el2:#td tags
     for e14 in e13:
      if ( e14.tag == 'a') :
         print "keyword: ",e14.text_content()
      if (e14.tag == 'span'):
         print "date: ",e14.text_content()

最佳答案

您可以使用 CSS 选择器 + ，一个direct adjacent combinator ，获取 <br>置于正文之前。然后，目标文本包含在其 tail 中。属性。

import lxml.html
root = lxml.html.fromstring('''
<table class="results">
  <tr>
    <td>
      <a href="..">link</a><span>2nd Mar 2011</span><br>XYZ Consultancy Ltd<br>
       <div>....</div>
    </td>
  </tr>
</table>
''')
for br_with_tail in root.cssselect('table.results > tr > td > a + span + br'):
    print br_with_tail.tail
    # => XYZ Consultancy Ltd

关于python - 使用lxml代码解析HTML，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/5646032/

上一篇：python - 如何从Python中的多个线程收集数据？

下一篇：python - Python 中的邮槽

python - Pika/RabbitMQ 连接问题 - 运行 VMWare CentOS 6.3

python - 如何修复多次尝试后未知登录的问题

python - SSL 上的 MITM 代理卡在客户端的 wrap_socket 上

python lxml inkscape 命名空间标签

python - onenote API 使用 python : adding more text to a page

python - Python 中命名空间 XML 的 XPath？

python - lxml - 是否有任何 hacky 方法来保留“？

python - lxml 的 iterparse 尝试将整个文件加载到内存中

python - python lxml 的 schematron 报告问题