python - 如何使用 python 使 html 解析更具性能

标签 python html parsing

我正在编写一个 python 脚本来解析一个包含表格的 html 文件。这是我要解析的文件示例:

<table border="0" cellspacing="1" cellpadding="0" width="3080">
<tr>
<th width="50"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 1</font></small></th>
<th width="130" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 2</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 3</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 4</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 5</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 6</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 7</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 8</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 9</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 10</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 11</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 12</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 13</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 14</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 15</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 16</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 17</font></small></th>
<th width="80"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 18</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 19</font></small></th>
<th width="95" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 20</font></small></th>
<th width="95" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 21</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 22>/font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 23</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 24</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 25</font></small></th>
<th width="80"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 26</font></small></th>
<th width="80"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 27</font></small></th>
<th width="80"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 28</font></small></th>
<th width="80"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 29</font></small></th>
<th width="80"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 30</font></small></th>
<th width="80"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 31</font></small></th>
<th width="80"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 32</font></small></th>
<th width="80"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 33</font></small></th>
</tr>

<tr bgcolor=#D5BCCD>
<td rowspan="5">1</td>
<td rowspan="5">01/02/2016</td>
<td rowspan="5">18</td>
<td rowspan="5">20</td>
<td rowspan="5">25</td>
<td rowspan="5">23</td>
<td rowspan="5">10</td>
<td rowspan="5">11</td>
<td rowspan="5">24</td>
<td rowspan="5">14</td>
<td rowspan="5">06</td>
<td rowspan="5">02</td>
<td rowspan="5">13</td>
<td rowspan="5">09</td>
<td rowspan="5">05</td>
<td rowspan="5">16</td>
<td rowspan="5">03</td>
<td rowspan="5">Next value indicates number of rows to skip</td>
<td rowspan="5">5</td>
<td></td>
<td>XA</td>
<td rowspan="5">15</td>
<td rowspan="5">46</td>
<td rowspan="5">48</td>
<td rowspan="5">25</td>
<td rowspan="5">49</td>
<td rowspan="5">68</td>
<td rowspan="5">10</td>
<td rowspan="5">40</td>
<td rowspan="5">20</td>
<td rowspan="5">000</td>
<td rowspan="5">000</td>
<td rowspan="5">000</td>
</tr>
<tr bgcolor=#D5BCCD><td></td><td>XB</td></tr>
<tr bgcolor=#D5BCCD><td></td><td>XC</td></tr>
<tr bgcolor=#D5BCCD><td></td><td>XD</td></tr>
<tr bgcolor=#D5BCCD><td></td><td>XE</td></tr>

<tr>
<td rowspan="1">2</td>
<td rowspan="1">02/02/2016</td>
<td rowspan="1">23</td>
<td rowspan="1">15</td>
<td rowspan="1">05</td>
<td rowspan="1">04</td>
<td rowspan="1">12</td>
<td rowspan="1">16</td>
<td rowspan="1">20</td>
<td rowspan="1">06</td>
<td rowspan="1">11</td>
<td rowspan="1">19</td>
<td rowspan="1">24</td>
<td rowspan="1">01</td>
<td rowspan="1">09</td>
<td rowspan="1">13</td>
<td rowspan="1">07</td>
<td rowspan="1">Next value indicates number of rows to skip</td>
<td rowspan="1">1</td>
<td></td>
<td>XA</td>
<td rowspan="1">184</td>
<td rowspan="1">6232</td>
<td rowspan="1">81252</td>
<td rowspan="1">478188</td>
<td rowspan="1">596.323,70</td>
<td rowspan="1">1.388,95</td>
<td rowspan="1">10,00</td>
<td rowspan="1">4,00</td>
<td rowspan="1">2,00</td>
<td rowspan="1">0,00</td>
<td rowspan="1">0,00</td>
<td rowspan="1">0,00</td>
</tr>

<tr bgcolor=#D5BCCD>
<td rowspan="5">3</td>
<td rowspan="5">04/02/2016</td>
<td rowspan="5">18</td>
<td rowspan="5">20</td>
<td rowspan="5">25</td>
<td rowspan="5">23</td>
<td rowspan="5">10</td>
<td rowspan="5">11</td>
<td rowspan="5">24</td>
<td rowspan="5">14</td>
<td rowspan="5">06</td>
<td rowspan="5">02</td>
<td rowspan="5">13</td>
<td rowspan="5">09</td>
<td rowspan="5">05</td>
<td rowspan="5">16</td>
<td rowspan="5">03</td>
<td rowspan="5">Next value indicates number of rows to skip</td>
<td rowspan="5">2</td>
<td></td>
<td>XA</td>
<td rowspan="5">15</td>
<td rowspan="5">46</td>
<td rowspan="5">48</td>
<td rowspan="5">25</td>
<td rowspan="5">49</td>
<td rowspan="5">68</td>
<td rowspan="5">10</td>
<td rowspan="5">40</td>
<td rowspan="5">20</td>
<td rowspan="5">000</td>
<td rowspan="5">000</td>
<td rowspan="5">000</td>
</tr>
<tr bgcolor=#D5BCCD><td></td><td>XB</td></tr>
</table>

这是我为解析它而编写的脚本:

# Parse the data
soup = BeautifulSoup(file(result_file))
table = soup.find('table')

# The first tr contains the field names.
headings = [th.get_text() for th in table.find('tr').find_all('th')]
important_headings = headings[:19]

all_tr = table.find_all('tr')
count = 1
data_sets = []
while count < len(all_tr):
    date_results = all_tr[count].find_all('td')
    skip_rows = int(date_results[18].get_text())
    count += skip_rows
    data_set = zip(important_headings, (td.get_text() for td in date_results[:19]))
    data_sets.append(data_set)

# Write the csv file
with open(csv_file, 'wb') as f:
    writer = csv.writer(f)
    writer.writerows(data_sets)

它可以工作,但解析 7 行需要大约 30 毫秒。真正的 html 文件上的表格大约有 1300 行,因此解析它需要一些时间。如果是这样,因为通常进程会在完成之前崩溃。

如何让它表现得更好?

更新(分析信息):

这是算法各部分所花费的时间:

  • 已完成查找表。耗时 32.11 秒完成。
  • 已找到文件中的所有 tr。耗时 103.414059 毫秒完成。

while循环部分

  • 已找到 tr 内的所有 td。耗时 0.142097 毫秒完成。
  • 完成跳过行。耗时 0.020027 毫秒完成。
  • 完成压缩。耗时 0.100851 毫秒完成。
  • 完成追加。耗时 0.001907 毫秒完成。

最佳答案

尝试使用 native C/C++ 解析库的 python 绑定(bind),例如libxml (这显然需要从 beautifulsoup 的便利性上稍作调整)。

关于python - 如何使用 python 使 html 解析更具性能,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35382959/

相关文章:

Android Monkey Runner 设备调用挂起但在进程被杀死时工作

html - 如何将 div 渲染为 100% 的父元素而不是子元素

html - HTML/CSS 中具有相同行高的多列

android - 如何将数据库条目 "12"解析为日期格式为 "12"而不是 "0"

python - 如何使用训练/测试数据评估 pymc2 模型?

python - 如何将一组特征转换为 Pandas 中的计数矩阵

php - 如何使用 simplexmlelement 在 PHP 中解析 XML?

javascript - 如何修改正则表达式以解析来自 YouTube 网址的参数

python - 使用 pandas 数据框进行主成分分析

html - 格式未应用 CSS