我正在尝试带 this table到 pandas DataFrame 中。我尝试过使用 pandas read_html
,我尝试过使用 requests 和 bs4。我想抓取我们看到的整个表格,但在 html 代码中,表格被分成 3 个 block 。不过,我还没弄清楚如何确定它们中的每一个。
这是起始代码:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-sistema-pregao-enUS.asp'
response = requests.get(url, params={'Data': '08/01/2018', 'Mercadoria': 'DI1'})
soup = BeautifulSoup(response.text, 'html.parser')
最佳答案
可以帮助您解决此问题的一件重要事情是,您在此处使用 requests
获得的响应并不真正包含渲染的 table
元素,但它肯定包含所需的数据。
问题是,页面需要 JavaScript 来呈现表格。您可能会注意到您的数据位于 script
元素内:
<script type="text/javascript">
var MercFut0 = "";
var MercFut1 = "";
var MercFut2 = "";
var MercFut3 = "";
MercFut0 = MercFut0 + '<table class="secondary">';
MercFut0 = MercFut0 + '<tr><td></td></tr>';
MercFut1 = MercFut1 + '<table class="secondary" id="teste">';
MercFut1 = MercFut1 + '<tr style="height: 120px;">';
MercFut2 = MercFut2 + '<table class="secondary" id="teste">';
MercFut2 = MercFut2 + '<tr style="height: 120px;">';
MercFut2 = MercFut2 + '<th class="text-center">Open Interest opening*</th>';
MercFut2 = MercFut2 + '<th class="text-center">Open Interest closing**</th>';
...
MercFut1 = MercFut1 + '</tr>';
MercFut0 = MercFut0 + '</table>';
MercFut1 = MercFut1 + '</table>';
MercFut2 = MercFut2 + '</table>';
MercFut3 = MercFut3 + '</table>';
MercadoFut0.innerHTML = MercFut3;
MercadoFut1.innerHTML = MercFut0;
MercadoFut2.innerHTML = MercFut1;
tableShow(2,false);
tableShow(9,false);
</script>
此时,最简单的方法可能是使用像 selenium
这样的东西。只是呈现此页面。
或者,您可以尝试获取此脚本并执行它,例如 pyexecjs
。
大致如下:
import execjs
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-sistema-pregao-enUS.asp'
response = requests.get(url, params={'Data': '08/01/2018', 'Mercadoria': 'DI1'})
soup = BeautifulSoup(response.text, 'html.parser')
# compile the desired table html from the data
script = soup.find("script", text=lambda text: text and 'tableShow' in text and "<table" in text).get_text()
script = """
var MercadoFut0 = {},
MercadoFut1 = {},
MercadoFut2 = {};
var tableShow = function () {};
function getTables() {
%s
return [MercFut1, MercFut2, MercFut3];
}
""" % script
ctx = execjs.compile(script)
table1, table2, table3 = ctx.call("getTables")
# parse tables into dataframes
df1 = pd.read_html(table1)[0]
df2 = pd.read_html(table2)[0]
df3 = pd.read_html(table3)[0]
print(df1)
print(df2)
print(df3)
然后,您可以“连接”df1
和 df2
这应该会得到您想要的表格:
df = pd.concat([df2, df1], axis=1)
pd.set_option('display.expand_frame_repr', False)
print(df)
打印:
0 1 2 3 4 0 1 2 3 4 5 6 7 8 9 10
0 Open Interest opening* Open Interest closing** Number of Trades Trading Volume Financial Volume (R$) Previous Settlement*** Indexed Settlement**** Opening Price Minimum Price Maximum Price Average Price Last Price Settlement Price Change Last Bid Last Offer
1 442761 442761 0 0 0 99999.99 - 0.000 0.000 0.000 0.000 0.000 100000.00 0.01+ 0.000 0.000
2 760332 792464 147 114160 11351487370 99434.83 99434.83 6.404 6.403 6.410 6.406 6.407 99434.72 0.11- 6.404 6.408
3 2343218 2377609 183 99890 9885888562 98967.40 98967.40 6.429 6.421 6.435 6.423 6.425 98967.53 0.13+ 0.000 6.425
...
38 72923 74133 280 3920 126375887 32296.91 32296.91 11.510 11.450 11.600 11.521 11.580 32063.31 233.60- 11.590 11.610
39 5325 5325 0 0 0 28878.71 28878.71 0.000 0.000 0.000 0.000 0.000 28649.55 229.16- 0.000 11.680
关于python - 用python读取复杂的html表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52064686/