我正在编写一个 python 脚本来解析一个包含表格的 html 文件。这是我要解析的文件示例:
<table border="0" cellspacing="1" cellpadding="0" width="3080">
<tr>
<th width="50" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 1</font></small></th>
<th width="130" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 2</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 3</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 4</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 5</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 6</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 7</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 8</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 9</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 10</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 11</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 12</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 13</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 14</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 15</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 16</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 17</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 18</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 19</font></small></th>
<th width="95" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 20</font></small></th>
<th width="95" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 21</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 22>/font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 23</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 24</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 25</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 26</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 27</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 28</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 29</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 30</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 31</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 32</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 33</font></small></th>
</tr>
<tr bgcolor=#D5BCCD>
<td rowspan="5">1</td>
<td rowspan="5">01/02/2016</td>
<td rowspan="5">18</td>
<td rowspan="5">20</td>
<td rowspan="5">25</td>
<td rowspan="5">23</td>
<td rowspan="5">10</td>
<td rowspan="5">11</td>
<td rowspan="5">24</td>
<td rowspan="5">14</td>
<td rowspan="5">06</td>
<td rowspan="5">02</td>
<td rowspan="5">13</td>
<td rowspan="5">09</td>
<td rowspan="5">05</td>
<td rowspan="5">16</td>
<td rowspan="5">03</td>
<td rowspan="5">Next value indicates number of rows to skip</td>
<td rowspan="5">5</td>
<td></td>
<td>XA</td>
<td rowspan="5">15</td>
<td rowspan="5">46</td>
<td rowspan="5">48</td>
<td rowspan="5">25</td>
<td rowspan="5">49</td>
<td rowspan="5">68</td>
<td rowspan="5">10</td>
<td rowspan="5">40</td>
<td rowspan="5">20</td>
<td rowspan="5">000</td>
<td rowspan="5">000</td>
<td rowspan="5">000</td>
</tr>
<tr bgcolor=#D5BCCD><td></td><td>XB</td></tr>
<tr bgcolor=#D5BCCD><td></td><td>XC</td></tr>
<tr bgcolor=#D5BCCD><td></td><td>XD</td></tr>
<tr bgcolor=#D5BCCD><td></td><td>XE</td></tr>
<tr>
<td rowspan="1">2</td>
<td rowspan="1">02/02/2016</td>
<td rowspan="1">23</td>
<td rowspan="1">15</td>
<td rowspan="1">05</td>
<td rowspan="1">04</td>
<td rowspan="1">12</td>
<td rowspan="1">16</td>
<td rowspan="1">20</td>
<td rowspan="1">06</td>
<td rowspan="1">11</td>
<td rowspan="1">19</td>
<td rowspan="1">24</td>
<td rowspan="1">01</td>
<td rowspan="1">09</td>
<td rowspan="1">13</td>
<td rowspan="1">07</td>
<td rowspan="1">Next value indicates number of rows to skip</td>
<td rowspan="1">1</td>
<td></td>
<td>XA</td>
<td rowspan="1">184</td>
<td rowspan="1">6232</td>
<td rowspan="1">81252</td>
<td rowspan="1">478188</td>
<td rowspan="1">596.323,70</td>
<td rowspan="1">1.388,95</td>
<td rowspan="1">10,00</td>
<td rowspan="1">4,00</td>
<td rowspan="1">2,00</td>
<td rowspan="1">0,00</td>
<td rowspan="1">0,00</td>
<td rowspan="1">0,00</td>
</tr>
<tr bgcolor=#D5BCCD>
<td rowspan="5">3</td>
<td rowspan="5">04/02/2016</td>
<td rowspan="5">18</td>
<td rowspan="5">20</td>
<td rowspan="5">25</td>
<td rowspan="5">23</td>
<td rowspan="5">10</td>
<td rowspan="5">11</td>
<td rowspan="5">24</td>
<td rowspan="5">14</td>
<td rowspan="5">06</td>
<td rowspan="5">02</td>
<td rowspan="5">13</td>
<td rowspan="5">09</td>
<td rowspan="5">05</td>
<td rowspan="5">16</td>
<td rowspan="5">03</td>
<td rowspan="5">Next value indicates number of rows to skip</td>
<td rowspan="5">2</td>
<td></td>
<td>XA</td>
<td rowspan="5">15</td>
<td rowspan="5">46</td>
<td rowspan="5">48</td>
<td rowspan="5">25</td>
<td rowspan="5">49</td>
<td rowspan="5">68</td>
<td rowspan="5">10</td>
<td rowspan="5">40</td>
<td rowspan="5">20</td>
<td rowspan="5">000</td>
<td rowspan="5">000</td>
<td rowspan="5">000</td>
</tr>
<tr bgcolor=#D5BCCD><td></td><td>XB</td></tr>
</table>
这是我为解析它而编写的脚本:
# Parse the data
soup = BeautifulSoup(file(result_file))
table = soup.find('table')
# The first tr contains the field names.
headings = [th.get_text() for th in table.find('tr').find_all('th')]
important_headings = headings[:19]
all_tr = table.find_all('tr')
count = 1
data_sets = []
while count < len(all_tr):
date_results = all_tr[count].find_all('td')
skip_rows = int(date_results[18].get_text())
count += skip_rows
data_set = zip(important_headings, (td.get_text() for td in date_results[:19]))
data_sets.append(data_set)
# Write the csv file
with open(csv_file, 'wb') as f:
writer = csv.writer(f)
writer.writerows(data_sets)
它可以工作,但解析 7 行需要大约 30 毫秒。真正的 html 文件上的表格大约有 1300 行,因此解析它需要一些时间。如果是这样,因为通常进程会在完成之前崩溃。
如何让它表现得更好?
更新(分析信息):
这是算法各部分所花费的时间:
- 已完成查找表。耗时 32.11 秒完成。
- 已找到文件中的所有 tr。耗时 103.414059 毫秒完成。
while循环部分
- 已找到 tr 内的所有 td。耗时 0.142097 毫秒完成。
- 完成跳过行。耗时 0.020027 毫秒完成。
- 完成压缩。耗时 0.100851 毫秒完成。
- 完成追加。耗时 0.001907 毫秒完成。
最佳答案
尝试使用 native C/C++ 解析库的 python 绑定(bind),例如libxml (这显然需要从 beautifulsoup 的便利性上稍作调整)。
关于python - 如何使用 python 使 html 解析更具性能,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35382959/