我已成功使用 selenium 来抓取 html 表格,该表格需要在抓取之前单击按钮。
所以selenium可以工作,并成功地将html表存储为变量“r”。
然而,我在将其解析为 pandas 数据帧时遇到了困难。
与 URL 一样,数据框应有 5 列和大约 30 行。
谁能看出哪里出了问题吗?
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Firefox(executable_path=r'/Users/computer_name/Documents/python/web_drivers/geckodriver')
browser.get('https://www.investing.com/equities/exxon-mobil-income-statement')
linkElem = browser.find_element_by_link_text('Annual')
linkElem.click()
r = browser.find_element_by_css_selector("#rrtable > table").get_attribute('innerHTML')
browser.quit()
soup = BeautifulSoup(r, 'html.parser')
df = pd.DataFrame(soup)
print(df)
非常感谢
最佳答案
一旦获得soup元素
,然后使用pd.read_html()
您需要使用outerHTML
而不是innerHTML
r = browser.find_element_by_css_selector("#rrtable > table").get_attribute('outerHTML')
browser.quit()
soup = BeautifulSoup(r, 'html.parser')
df = pd.read_html(str(soup))[0]
print(df)
输出:
Period Ending: ... 201631/12
0 Total Revenue ... 200628
1 Revenue 255583 279332 237162 200628 Other Rev... ... Revenue 255583 279332 237162 200628 Other Rev...
2 Revenue ... 200628
3 Other Revenue, Total ... -
4 Cost of Revenue, Total ... 136098
5 Gross Profit ... 64530
6 Total Operating Expenses ... 199692
7 Selling/General/Admin. Expenses, Total 41923 4... ... Selling/General/Admin. Expenses, Total 41923 4...
8 Selling/General/Admin. Expenses, Total ... 39819
9 Research & Development ... 1467
10 Depreciation / Amortization ... 22308
11 Interest Expense (Income) - Net Operating ... -
12 Unusual Expense (Income) ... -
13 Other Operating Expenses, Total ... -
14 Operating Income ... 936
15 Interest Income (Expense), Net Non-Operating ... 4353
16 Gain (Loss) on Sale of Assets ... -
17 Other, Net ... 2680
18 Net Income Before Taxes ... 7969
19 Provision for Income Taxes ... -406
20 Net Income After Taxes ... 8375
21 Minority Interest ... -535
22 Equity In Affiliates ... -
23 U.S GAAP Adjustment ... -
24 Net Income Before Extraordinary Items ... 7840
25 Total Extraordinary Items ... -
26 Net Income ... 7840
27 Total Adjustments to Net Income ... -
28 Income Available to Common Excluding Extraordi... ... 7840
29 Dilution Adjustment ... -
30 Diluted Net Income ... 7840
31 Diluted Weighted Average Shares ... 4177
32 Diluted EPS Excluding Extraordinary Items ... 1.88
33 DPS - Common Stock Primary Issue ... 2.98
34 Diluted Normalized EPS ... 1.88
关于pandas - 使用 beautifulsoup 将 selenium html 表放入 pandas 数据框中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61018158/