我有一个这样的 html 表格:
<TABLE>
<TR>
<TD><P>Name</P></TD>
<TD><P>Fees</P></TD>
<TD><P>Awards</P></TD>
<TD><P>Total</P></TD>
</TR>
<TR>
<TD><P>Tony</P></TD>
<TD >7,800</TD>
<TD >7</TD>
<TD>15,400</TD>
</TR>
<TR>
<TD><P>Paul</FONT></P></TD>
<TD >7,800</TD>
<TD >7</TD>
<TD>15,400</TD>
</TR>
<TR>
<TD><P>Richard</P></TD>
<TD >7,800</TD>
<TD >7</TD>
<TD>15,400</TD>
</TR>
</TR>
</TABLE>
我想提取表的值。我尝试了以下方法。
import lxml.html
html = lxml.html.parse(''html_table)
text_value = html.xpath('//tr/td/text()')
packages = html.xpath('//tr/td/p')
p_content = [p.text_content() for p in packages]
有什么方法可以同时提取 <p>
正文和<td>
的正文到单个列表?
最佳答案
你可以做类似的事情
>>> doc = """<TABLE>
... <TR>
... <TD><P>Name</P></TD>
... <TD><P>Fees</P></TD>
... <TD><P>Awards</P></TD>
... <TD><P>Total</P></TD>
... </TR>
... <TR>
... <TD><P>Tony</P></TD>
... <TD >7,800</TD>
... <TD >7</TD>
... <TD>15,400</TD>
... </TR>
... <TR>
... <TD><P>Paul</FONT></P></TD>
... <TD >7,800</TD>
... <TD >7</TD>
... <TD>15,400</TD>
... </TR>
... <TR>
... <TD><P>Richard</P></TD>
... <TD >7,800</TD>
... <TD >7</TD>
... <TD>15,400</TD>
... </TR>
...
... </TR>
... </TABLE>"""
>>> import lxml.html
>>> root = lxml.html.fromstring(doc)
>>> root.xpath('//tr/td//text()')
['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400']
>>>
如果您在文档中有 2 个表,您可以先循环表,然后对每个表上的后代文本节点使用相对 XPath 表达式(带有前导 .
)表格
>>> doc = """<TABLE>
... <TR>
... <TD><P>Name</P></TD>
... <TD><P>Fees</P></TD>
... <TD><P>Awards</P></TD>
... <TD><P>Total</P></TD>
... </TR>
... <TR>
... <TD><P>Tony</P></TD>
... <TD >7,800</TD>
... <TD >7</TD>
... <TD>15,400</TD>
... </TR>
... <TR>
... <TD><P>Paul</FONT></P></TD>
... <TD >7,800</TD>
... <TD >7</TD>
... <TD>15,400</TD>
... </TR>
... <TR>
... <TD><P>Richard</P></TD>
... <TD >7,800</TD>
... <TD >7</TD>
... <TD>15,400</TD>
... </TR>
...
... </TR>
... </TABLE>
... <TABLE>
... <TR>
... <TD><P>Name</P></TD>
... <TD><P>Fees</P></TD>
... <TD><P>Awards</P></TD>
... <TD><P>Total</P></TD>
... </TR>
... <TR>
... <TD><P>Tony</P></TD>
... <TD >7,800</TD>
... <TD >7</TD>
... <TD>15,400</TD>
... </TR>
... <TR>
... <TD><P>Paul</FONT></P></TD>
... <TD >7,800</TD>
... <TD >7</TD>
... <TD>15,400</TD>
... </TR>
... <TR>
... <TD><P>Richard</P></TD>
... <TD >7,800</TD>
... <TD >7</TD>
... <TD>15,400</TD>
... </TR>
...
... </TR>
... </TABLE>"""
>>> import lxml.html
>>> root = lxml.html.fromstring(doc)
>>> root.xpath('//tr/td//text()')
['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400', 'Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400']
>>> for tbl in root.xpath('//table'):
... elements = tbl.xpath('.//tr/td//text()')
... print elements
...
['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400']
['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400']
>>>
关于python使用lxml解析html表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20418807/