python - 如何使用 Beautiful Soup 以正确的顺序提取数据

我正在尝试提取 balance sheet例如，来自雅虎财经的代码“MSFT”(微软)。

在完成任何抓取之前，使用 Selenium 单击“全部展开”按钮。这部分似乎有效。

顺便说一下，当 Chrome Web 驱动程序启动时，我手动单击按钮来接受或拒绝 cookie。在稍后的步骤中，我计划添加更多代码，以便这部分也自动化。我的问题现在不在这个问题上。

以下是当前代码的样子。

# for scraping the balance sheet from Yahoo Finance
import pandas as pd
import requests
from datetime import datetime
from bs4 import BeautifulSoup
# importing selenium to click on the "Expand All" button before scraping the financial statements
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC


def get_balance_sheet_from_yfinance(ticker):
    url = f"https://finance.yahoo.com/quote/{ticker}/balance-sheet?p={ticker}"

    options = Options()
    options.add_argument("start-maximized")
    driver = webdriver.Chrome(chrome_options=options)
    driver.get(url)
    WebDriverWait(driver, 3600).until(EC.element_to_be_clickable((
        By.XPATH, "//section[@data-test='qsp-financial']//span[text()='Expand All']"))).click()

    #content whole page in html format
    soup = BeautifulSoup(driver.page_source, 'html.parser')

    # get the column headers (i.e. 'Breakdown' row)
    div = soup.find_all('div', attrs={'class': 'D(tbhg)'})
    if len(div) < 1:
        print("Fail to retrieve table column header")
        exit(0)

    # get the list of columns from the column headers
    col = []
    for h in div[0].find_all('span'):
        text = h.get_text()
        if text != "Breakdown":
            col.append(datetime.strptime(text, "%m/%d/%Y"))

    df = pd.DataFrame(columns=col)


    # the following code returns an empty list for index (why?)
    # and values in a list that need actually be in a DataFrame
    idx = []
    for div in soup.find_all('div', attrs={'data-test': 'fin-row'}):
        for h in div.find_all('title'):
            text = h.get_text()
            idx.append(text)

    val = []
    for div in soup.find_all('div', attrs={'data-test': 'fin-col'}):
        for h in div.find_all('span'):
            num = int(h.get_text().replace(",", "")) * 1000
            val.append(num)

    # if the above part is commented out and this block is used instead
    # the following code manages to work well until the row "Cash Equivalents" 
    # that is because there are no entries for years 2020 and 2019 on this row
    """ for div in soup.find_all('div', attrs={'data-test': 'fin-row'}):
        i = 0
        idx = ""
        val = []
        for h in div.find_all('span'):
            if i % 5 == 0:
                idx = h.get_text()
            else:
                num = int(h.get_text().replace(",", "")) * 1000
                val.append(num)
            i += 1
        row = pd.DataFrame([val], columns=col, index=[idx])
        df = pd.concat([df, row], axis=0) """
    
    return idx, val


get_balance_sheet_from_yfinance("MSFT")

我无法以可用的表格格式从扩展表中抓取数据。相反，上面的函数返回我设法从网页中抓取的内容。代码中还有一些附加注释。

您能否给我一些关于如何正确提取数据并将其放入 DataFrame 对象中的想法，该对象的索引应该是“Breakdown”列下的文本？基本上，DataFrame 应该类似于下面的快照，其中第一列下方是索引。

balance-sheet-df

最佳答案

我在这方面花了很长时间，希望它有所帮助，基本上你的函数现在返回一个具有以下格式的dataFrame:


                                                  2022-06-29   2021-06-29   2020-06-29   2019-06-29
Total Assets                                     364,840,000  333,779,000  301,311,000  286,556,000
Current Assets                                   169,684,000  184,406,000  181,915,000  175,552,000
Cash, Cash Equivalents & Short Term Investments  104,749,000  130,334,000  136,527,000  133,819,000
Cash And Cash Equivalents                         13,931,000   14,224,000   13,576,000   11,356,000
Cash                                               8,258,000    7,272,000            -            -
...                                                      ...          ...          ...          ...
Tangible Book Value                               87,720,000   84,477,000   67,915,000   52,554,000
Total Debt                                        61,270,000   67,775,000   70,998,000   78,366,000
Net Debt                                          35,850,000   43,922,000   49,751,000   60,822,000
Share Issued                                       7,464,000    7,519,000    7,571,000    7,643,000
Ordinary Shares Number                             7,464,000    7,519,000    7,571,000    7,643,000

这是最终的代码:

# for scraping the balance sheet from Yahoo Finance
from time import sleep
import pandas as pd
import requests
from datetime import datetime
from bs4 import BeautifulSoup
# importing selenium to click on the "Expand All" button before scraping the financial statements
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC


def get_balance_sheet_from_yfinance(ticker):
    url = f"https://finance.yahoo.com/quote/{ticker}/balance-sheet?p={ticker}"

    options = Options()
    options.add_argument("start-maximized")
    driver = webdriver.Chrome(chrome_options=options)
    driver.get(url)
    WebDriverWait(driver, 3600).until(EC.element_to_be_clickable((
        By.XPATH, "//section[@data-test='qsp-financial']//span[text()='Expand All']"))).click()

    # content whole page in html format
    soup = BeautifulSoup(driver.page_source, 'html.parser')

    # get the column headers (i.e. 'Breakdown' row)
    div = soup.find_all('div', attrs={'class': 'D(tbhg)'})
    if len(div) < 1:
        print("Fail to retrieve table column header")
        exit(0)

    # get the list of columns from the column headers
    col = []
    for h in div[0].find_all('span'):
        text = h.get_text()
        if text != "Breakdown":
            col.append(datetime.strptime(text, "%m/%d/%Y"))

    row = {}
    for div in soup.find_all('div', attrs={'data-test': 'fin-row'}):
        head = div.find('span').get_text()
        i = 4
        for h in div.find_all('span'):
            if h.get_text().replace(',', '').isdigit() or h.get_text()[0] == '-':
                row[head].append(h.get_text())
                i += 1
            else:
                while i < 4:
                    row[head].append('')
                    i += 1
                else:
                    head = h.get_text()
                    row[head] = []
                    i = 0

    for k, v in row.items():
        while len(v) < 4:
            row[k].append('-')

    df = pd.DataFrame(columns=col, index=row.keys(), data=row.values())
    print(df)
    
    return df
get_balance_sheet_from_yfinance("MSFT")

我删除了一些未使用的代码并添加了新的报废方法，但我保留了获取所有列日期的方法。

如果您有任何疑问，请随时在评论中提问。

关于python - 如何使用 Beautiful Soup 以正确的顺序提取数据，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/73833605/

python - 如何使用 Beautiful Soup 以正确的顺序提取数据

上一篇：c - 为什么 pthread_cond_signal() 没有被调用？

下一篇：haskell - 如何避免 zipWith self 引用中的无限循环？