python - 用 Python 抓取雅虎财务 Assets 负债表

标签 python regex web-scraping beautifulsoup python-requests

我的问题是一个问题的后续问题 here .

函数:

periodic_figure_values()

似乎工作正常,除非被搜索的行项目的名称出现两次。我指的具体案例是试图获取“长期债务”的数据。上面链接中的函数将返回以下错误:

Traceback (most recent call last):
  File "test.py", line 31, in <module>
    LongTermDebt=(periodic_figure_values(soup, "Long Term Debt"))
  File "test.py", line 21, in periodic_figure_values
    value = int(str_value)
ValueError: invalid literal for int() with base 10: 'Short/Current Long Term Debt'

因为它似乎被“短期/当前长期债务”绊倒了。你看,该页面同时具有“短期/当前长期债务”和“长期债务”。您可以使用 Apple 的 Assets 负债表查看源页面示例 here .

我正在尝试为函数找到一种方法来返回“长期债务”的数据,而不会被“短期/当前长期债务”绊倒。

这是获取“现金和现金等价物”的函数和示例,它工作正常,而“长期债务”不工作:

import requests, bs4, re

def periodic_figure_values(soup, yahoo_figure):
    values = []
    pattern = re.compile(yahoo_figure)
    title = soup.find("strong", text=pattern)    # works for the figures printed in bold
    if title:
        row = title.parent.parent
    else:
        title = soup.find("td", text=pattern)    # works for any other available figure
        if title:
            row = title.parent
        else:
            sys.exit("Invalid figure '" + yahoo_figure + "' passed.")
    cells = row.find_all("td")[1:]    # exclude the <td> with figure name
    for cell in cells:
        if cell.text.strip() != yahoo_figure:    # needed because some figures are indented
            str_value = cell.text.strip().replace(",", "").replace("(", "-").replace(")", "")
            if str_value == "-":
                str_value = 0
            value = int(str_value)
            values.append(value)
    return values

res = requests.get('https://ca.finance.yahoo.com/q/bs?s=AAPL')
res.raise_for_status
soup = bs4.BeautifulSoup(res.text, 'html.parser')
Cash=(periodic_figure_values(soup, "Cash And Cash Equivalents"))
print(Cash)
LongTermDebt=(periodic_figure_values(soup, "Long Term Debt"))
print(LongTermDebt)

最佳答案

最简单的方法是使用 try/except 组合并使用引发的 ValueError:

import requests, bs4, re

def periodic_figure_values(soup, yahoo_figure):
    values = []
    pattern = re.compile(yahoo_figure)
    title = soup.find("strong", text=pattern)    # works for the figures printed in bold
    if title:
        row = title.parent.parent
    else:
        title = soup.find("td", text=pattern)    # works for any other available figure
        if title:
            row = title.parent
        else:
            sys.exit("Invalid figure '" + yahoo_figure + "' passed.")
    cells = row.find_all("td")[1:]    # exclude the <td> with figure name
    for cell in cells:
        if cell.text.strip() != yahoo_figure:    # needed because some figures are indented
            str_value = cell.text.strip().replace(",", "").replace("(", "-").replace(")", "")
            if str_value == "-":
                str_value = 0
### from here
            try:
                value = int(str_value)
                values.append(value)
            except ValueError:
                continue
### to here
    return values

res = requests.get('https://ca.finance.yahoo.com/q/bs?s=AAPL')
res.raise_for_status
soup = bs4.BeautifulSoup(res.text, 'html.parser')
Cash=(periodic_figure_values(soup, "Cash And Cash Equivalents"))
print(Cash)
LongTermDebt=(periodic_figure_values(soup, "Long Term Debt"))
print(LongTermDebt)

这个可以很好地打印出你的数字。
请注意,在这种情况下,您实际上并不需要 re 模块,因为您只检查文字(无通配符,无边界)等。

关于python - 用 Python 抓取雅虎财务 Assets 负债表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38604506/

相关文章:

javascript - 在 HTML5 视频上叠加元素

python - 如何区分Django模板中的继承模型?

clojure - 使用 enlive 解析 html 片段

python - 在 python 中使用正则表达式查找字符串

jQuery在输入字段输入6个字符后显示div

python - 使用Python请求库登录帐户困难

java - 有没有工具可以隔离网页内容?

Python.Runtime.PythonException : since Python. NET 3.0 int 无法隐式转换为 Enum。使用枚举(int_value)

python - 在 Python 中批量转换 Unix 时间戳的函数

regex - 对正则表达式子匹配进行编号