需要的数据:
我想抓取两个网页,一个在这里:https://finance.yahoo.com/quote/AAPL/balance-sheet?p=AAPL另一个:https://finance.yahoo.com/quote/AAPL/financials?p=AAPL . 从第一页开始,我需要名为 Total Assets 的行的值。这将是该行中的 5 个值:365,725,000 375,319,000 321,686,000 290,479,000 231,839,000 然后,我需要名为 Total Current Liabilities 的行的 5 个值。这些将是: 43,658,000 38,542,000 27,970,000 20,722,000 11,506,000 在第二个链接中,我需要名为 Operating Income or Loss 的行的 10 个值。这些将是:52,503,000 48,999,000 55,241,000 33,790,000 18,385,000。
编辑:我也需要 TTM 值,然后是上面提到的五年值。谢谢。 这是我想要的逻辑。我想运行这个模块,运行时,我希望输出为:
TTM array: 365725000, 116866000, 64423000
year1 array: 375319000, 100814000, 70898000
year2 array: 321686000, 79006000, 80610000
我的代码:
这是我到目前为止写的。如果我只是将它放在一个变量中,我可以提取 div 类中的值,如下所示。但是,我如何有效地循环遍历“div”类,因为页面中有数千个类。换句话说,我如何找到我正在寻找的值?
# Import libraries
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
# Set the URL you want to webscrape from
url = 'https://finance.yahoo.com/quote/AAPL/balance-sheet?p=AAPL'
# Connect to the URL
response = requests.get(url)
# Parse HTML and save to BeautifulSoup object¶
soup = BeautifulSoup(response.text, "html.parser")
soup1 = BeautifulSoup("""<div class="D(tbc) Ta(end) Pstart(6px) Pend(4px) Bxz(bb) Py(8px) BdB Bdc($seperatorColor) Miw(90px) Miw(110px)--pnclg" data-test="fin-col"><span>321,686,000</span></div>""", "html.parser")
spup2 = BeautifulSoup("""<span data-reactid="1377">""", "html.parser");
#This works
print(soup1.find("div", class_="D(tbc) Ta(end) Pstart(6px) Pend(4px) Bxz(bb) Py(8px) BdB Bdc($seperatorColor) Miw(90px) Miw(110px)--pnclg").text)
#How to loop through all the relevant div classes?
最佳答案
编辑 - 应@Life is complex 的要求,编辑以添加日期标题。
使用 lxml 试试这个:
import requests
from lxml import html
url = 'https://finance.yahoo.com/quote/AAPL/balance-sheet?p=AAPL'
url2 = 'https://finance.yahoo.com/quote/AAPL/financials?p=AAPL'
page = requests.get(url)
page2 = requests.get(url2)
tree = html.fromstring(page.content)
tree2 = html.fromstring(page2.content)
total_assets = []
Total_Current_Liabilities = []
Operating_Income_or_Loss = []
heads = []
path = '//div[@class="rw-expnded"][@data-test="fin-row"][@data-reactid]'
data_path = '../../div/span/text()'
heads_path = '//div[contains(@class,"D(ib) Fw(b) Ta(end)")]/span/text()'
dats = [tree.xpath(path),tree2.xpath(path)]
for entry in dats:
heads.append(entry[0].xpath(heads_path))
for d in entry[0]:
for s in d.xpath('//div[@title]'):
if s.attrib['title'] == 'Total Assets':
total_assets.append(s.xpath(data_path))
if s.attrib['title'] == 'Total Current Liabilities':
Total_Current_Liabilities.append(s.xpath(data_path))
if s.attrib['title'] == 'Operating Income or Loss':
Operating_Income_or_Loss.append(s.xpath(data_path))
del total_assets[0]
del Total_Current_Liabilities[0]
del Operating_Income_or_Loss[0]
print('Date Total Assets Total_Current_Liabilities:')
for date,asset,current in zip(heads[0],total_assets[0],Total_Current_Liabilities[0]):
print(date, asset, current)
print('Operating Income or Loss:')
for head,income in zip(heads[1],Operating_Income_or_Loss[0]):
print(head,income)
输出:
Date Total Assets Total_Current_Liabilities:
9/29/2018 365,725,000 116,866,000
9/29/2017 375,319,000 100,814,000
9/29/2016 321,686,000 79,006,000
Operating Income or Loss:
ttm 64,423,000
9/29/2018 70,898,000
9/29/2017 61,344,000
9/29/2016 60,024,000
当然,如果需要,这可以很容易地合并到 pandas 数据框中。
关于python - 如何在 Python BeautifulSoup 上有效地解析大型 HTML div 类和跨度数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58194952/