这是一个 HTML 表格:
<table width="100%" cellpadding="4" cellspacing="0" style="page-break-before: always">
<col width="32*"/>
<col width="32*"/>
<col width="32*"/>
<col width="32*"/>
<col width="32*"/>
<col width="32*"/>
<col width="32*"/>
<col width="32*"/>
<tr valign="top">
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">A</font></font></font></p>
</td>
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">B</font></font></font></p>
</td>
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">C</font></font></font></p>
</td>
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">D</font></font></font></p>
</td>
</tr>
<tr valign="top">
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">E</font></font></font></p>
</td>
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">F</font></font></font></p>
</td>
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">G</font></font></font></p>
</td>
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">H</font></font></font></p>
</td>
</tr>
<tr valign="top">
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">I</font></font></font></p>
</td>
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">J</font></font></font></p>
</td>
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">K</font></font></font></p>
</td>
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">L</font></font></font></p>
</td>
</tr>
<tr valign="top">
<td width="12%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">M</font></font></font></p>
</td>
<td width="13%" style="background: transparent" style="border: none; padding: 0cm"><p lang="ru-RU" align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">M2</font></font></font></p>
</td>
<td width="12%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">N</font></font></font></p>
</td>
<td width="13%" style="background: transparent" style="border: none; padding: 0cm"><p lang="ru-RU" align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">N2</font></font></font></p>
</td>
<td width="12%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">O</font></font></font></p>
</td>
<td width="13%" style="background: transparent" style="border: none; padding: 0cm"><p lang="ru-RU" align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">O2</font></font></font></p>
</td>
<td width="12%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">P</font></font></font></p>
</td>
<td width="13%" style="background: transparent" style="border: none; padding: 0cm"><p lang="ru-RU" align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">P2</font></font></font></p>
</td>
</tr>
</table>
此处最后一行的列数比其他行多 2 倍。当我尝试将其读入 Pandas 数据帧时,我得到以下结果:
table = pd.read_html('1111.html')
table[0]
0 1 2 3 4 5 6 7
0 A A B B C C D D
1 E E F F G G H H
2 I I J J K K L L
3 M M2 N N2 O O2 P P2
如何在不配音的情况下正确朗读?我不需要最后一行。
最佳答案
您可以使用BeautifulSoup
解析表格,然后将结果转换为数据帧:
import pandas as pd
from bs4 import BeautifulSoup as soup
df = pd.DataFrame([[k[1:-1] for i in b.find_all('td') if (k:=i.text) is not None] for b in soup(html, 'html.parser').table.find_all('tr')])
输出:
0 1 2 3 4 5 6 7
0 A B C D None None None None
1 E F G H None None None None
2 I J K L None None None None
3 M M2 N N2 O O2 P P2
编辑:没有赋值表达式的解决方案:
df = pd.DataFrame([[i.text[1:-1] if i else i for i in b.find_all('td')] for b in soup(html, 'html.parser').table.find_all('tr')])
输出:
0 1 2 3 4 5 6 7
0 A B C D None None None None
1 E F G H None None None None
2 I J K L None None None None
3 M M2 N N2 O O2 P P2
关于python - 正确读取html表格,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58486063/