python - 如何在 Python BeautifulSoup 上有效地解析大型 HTML div 类和跨度数据?

标签 python html parsing beautifulsoup

需要的数据:

我想抓取两个网页,一个在这里:https://finance.yahoo.com/quote/AAPL/balance-sheet?p=AAPL另一个:https://finance.yahoo.com/quote/AAPL/financials?p=AAPL . 从第一页开始,我需要名为 Total Assets 的行的值。这将是该行中的 5 个值:365,725,000 375,319,000 321,686,000 290,479,000 231,839,000 然后,我需要名为 Total Current Liabilities 的行的 5 个值。这些将是: 43,658,000 38,542,000 27,970,000 20,722,000 11,506,000 在第二个链接中,我需要名为 Operating Income or Loss 的行的 10 个值。这些将是:52,503,000 48,999,000 55,241,000 33,790,000 18,385,000。

编辑:我也需要 TTM 值,然后是上面提到的五年值。谢谢。 这是我想要的逻辑。我想运行这个模块,运行时,我希望输出为:

TTM array: 365725000, 116866000, 64423000
year1 array: 375319000, 100814000, 70898000
year2 array: 321686000, 79006000, 80610000

我的代码:

这是我到目前为止写的。如果我只是将它放在一个变量中,我可以提取 div 类中的值,如下所示。但是,我如何有效地循环遍历“div”类,因为页面中有数千个类。换句话说,我如何找到我正在寻找的值?

# Import libraries
import requests
import urllib.request
import time
from bs4 import BeautifulSoup

# Set the URL you want to webscrape from
url = 'https://finance.yahoo.com/quote/AAPL/balance-sheet?p=AAPL'

# Connect to the URL
response = requests.get(url)

# Parse HTML and save to BeautifulSoup object¶
soup = BeautifulSoup(response.text, "html.parser")
soup1 = BeautifulSoup("""<div class="D(tbc) Ta(end) Pstart(6px) Pend(4px) Bxz(bb) Py(8px) BdB Bdc($seperatorColor) Miw(90px) Miw(110px)--pnclg" data-test="fin-col"><span>321,686,000</span></div>""", "html.parser")
spup2 = BeautifulSoup("""<span data-reactid="1377">""", "html.parser");

#This works
print(soup1.find("div", class_="D(tbc) Ta(end) Pstart(6px) Pend(4px) Bxz(bb) Py(8px) BdB Bdc($seperatorColor) Miw(90px) Miw(110px)--pnclg").text)

#How to loop through all the relevant div classes? 

最佳答案

编辑 - 应@Life is complex 的要求,编辑以添加日期标题。

使用 lxml 试试这个:

import requests
from lxml import html

url = 'https://finance.yahoo.com/quote/AAPL/balance-sheet?p=AAPL'
url2 = 'https://finance.yahoo.com/quote/AAPL/financials?p=AAPL'
page = requests.get(url)
page2 = requests.get(url2)


tree = html.fromstring(page.content)
tree2 = html.fromstring(page2.content)

total_assets = []
Total_Current_Liabilities = []
Operating_Income_or_Loss = []
heads = []


path = '//div[@class="rw-expnded"][@data-test="fin-row"][@data-reactid]'
data_path = '../../div/span/text()'
heads_path = '//div[contains(@class,"D(ib) Fw(b) Ta(end)")]/span/text()'

dats = [tree.xpath(path),tree2.xpath(path)]

for entry in dats:
    heads.append(entry[0].xpath(heads_path))
    for d in entry[0]:
        for s in d.xpath('//div[@title]'):
            if s.attrib['title'] == 'Total Assets':
                total_assets.append(s.xpath(data_path))
            if s.attrib['title'] == 'Total Current Liabilities':
                Total_Current_Liabilities.append(s.xpath(data_path))
            if s.attrib['title'] == 'Operating Income or Loss':
                Operating_Income_or_Loss.append(s.xpath(data_path))

del total_assets[0]
del Total_Current_Liabilities[0]
del Operating_Income_or_Loss[0]

print('Date   Total Assets Total_Current_Liabilities:')
for date,asset,current in zip(heads[0],total_assets[0],Total_Current_Liabilities[0]):    
         print(date, asset, current)
print('Operating Income or Loss:')
for head,income in zip(heads[1],Operating_Income_or_Loss[0]):
         print(head,income)

输出:

Date      Total Assets Total_Current_Liabilities:
9/29/2018 365,725,000 116,866,000
9/29/2017 375,319,000 100,814,000
9/29/2016 321,686,000 79,006,000
Operating Income or Loss:
ttm       64,423,000
9/29/2018 70,898,000
9/29/2017 61,344,000
9/29/2016 60,024,000

当然,如果需要,这可以很容易地合并到 pandas 数据框中。

关于python - 如何在 Python BeautifulSoup 上有效地解析大型 HTML div 类和跨度数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58194952/

相关文章:

parsing - 如何找到预告片字典?

Python将C头文件转换为dict

python - 使用 pyodbc 将 pandas 数据帧高效插入到 MS SQL Server

python - 从 Singleton 接收 pyqtSignal

html - 用于小屏幕的 CSS 被用于大屏幕的 CSS 覆盖

javascript - Tablesorter - 依赖于排序类型的自定义解析器

python - 如何将位置过滤器添加到 tweepy 模块

python - 将字符串格式化为 Python 中的特定字符限制

javascript - Jquery 只影响被悬停的类中的一个元素

jquery - 使用 jQuery 删除字符串中的所有空格