python - 抓取不伦不类的标签之间的文本

标签 python html web-scraping beautifulsoup

在某些情况下,我的文本位于模糊值和属性之间,这些值和属性在整个文件中多次出现(例如“重复使用”)。

最终,我想提取:“Prev Close:”和“565.07”,并将这些信息放入字符串或列表之类的内容中(请提出建议 )。


部分相关 HTML 源代码:

<div class="yui-u first yfi-start-content"><div class="yfi_quote_summary"><div id="yfi_quote_summary_data" class="rtq_table"><table id="table1"><tr><th scope="row" width="48%">Prev Close:</th><td class="yfnc_tabledata1">565.07</td></tr>

我的代码(Python 3.4.1):

soup = BeautifulSoup(data) # data contains the HTML source

FirstTable_tag = soup.find('div', attrs={'class': '"yui-u first yfi-start-content"'})
# Should the keys (attributes) in the "findNextSibling parameters below be filled in or left empty???
next_FirstTable_tag = FirstTable_tag.findNextSibling('div', attrs={'class': '"yfi_quote_summary"'})     
next_next_FirstTable_tag = next_FirstTable_tag.findNextSibling('div', attrs={'id': '"yfi_quote_sumary_data"', 'class': '"rtq_table"'})
next_next_next_FirstTable_tag = next_next_FirstTable_tag.findNextSibling('table', attrs={'id': '"table1"'})
data = next_next_next_FirstTable_tag.get_text()

SelectSoup = BeautifulSoup(data)
print("SelectSoup:" + SelectSoup + "(should be:  Prev Close)")

错误

Traceback (most recent call last):
    next_FirstTable_tag = FirstTable_tag.findNextSibling          
AttributeError: 'NoneType' object has no attribute 'findNextSibling'
<<< Process finished. (Exit code 1)

编辑

Here is the initial and full source as requested

虽然我已经开始使用雅虎的 API,这显然是一个更好的方法,但出于好奇,我仍在尝试在 @scandinavian_ 的帮助下进行抓取

我更新了上面的代码,但仍然遇到相同的错误。


编辑2

这篇文章今后将重点关注 @scandinavian_ 正在协助开发的解决方案:

import sys
import urllib.request
url = "http://finance.yahoo.com/q?s=GOOG"
urlRunner = urllib.request.urlopen(url)
data = urlRunner.read()

from bs4 import BeautifulSoup
soup = BeautifulSoup(data)

import re
tables = soup.findAll("table", id = re.compile('^table'))
result = {}
for table in tables:
    for th, td in zip(table.findAll("th"), table.findAll("td")):
        result[th.text] = td.text
print(result)

结果:

{'52周范围:':'502.80 - 604.83','市值:':'381.04B','下一个盈利日期:':'N/A','市盈率(ttm) :': '29.52', '平均成交量 (3m):': '1,701,610', 'EPS (ttm):': '19.09', '1 年目标预计:': 'N/A', '成交量:': '561,384','要价:':'563.98 x 100','分割与 yield :':'不适用(不适用)','出价:':'563.56 x 100','测试版:':' 1.144', '开盘价:': '568.00', "当日范围:": '562.53 - 569.77', '前收盘价:': '566.37'}

最佳答案

这是基于我认为你想要的,但如果没有适当的数据样本,这是不可能说的。我无法猜测它的结构如何。在您的描述中,听起来数据是不规则的,这在您的示例中是不可能看到的。

from bs4 import BeautifulSoup
from itertools import izip

html = """<div class="yui-u first yfi-start-content">
    <div class="yfi_quote_summary">
        <div id="yfi_quote_summary_data" class="rtq_table">
            <table id="table1">
                <tr>
                    <th scope="row" width="48%">Target Point:</th>
                    <td class="yfnc_tabledata1">200.22</td>
                </tr>
                <tr>
                    <th scope="row" width="48%">Target Point:</th>
                    <td class="yfnc_tabledata1">200.22</td>
                </tr>
                <tr>
                    <th scope="row" width="48%">Target Point:</th>
                    <td class="yfnc_tabledata1">200.22</td>
                </tr>
            </table>
        </div>
    </div>
</div>"""

bs = BeautifulSoup(html)

result = {}

ths = bs.findAll("th")
tds = bs.findAll("td")
elements = izip(ths, tds)

result = []

for x, y in elements:
    result.append((x.text, y.text))

print result

编辑:

Yahoo API 解决方案,请考虑使用此解决方案:

import requests

URL = "https://query.yahooapis.com/v1/public/yql"

query = 'select * from yahoo.finance.quotes where symbol in ("GOOG")'

params = {
    "q": query,
    "format": "json",
    "env": "store://datatables.org/alltableswithkeys"
}

data = requests.get(URL, params=params).json()

print data['query']['results']['quote']['PreviousClose']
print data['query']['results']['quote']['Open']

这将打印:

565.07
561.78

这些是股票的可用数据:

AfterHoursChangeRealtime
AnnualizedGain
Ask
AskRealtime
AverageDailyVolume
Bid
BidRealtime
BookValue
Change
Change_PercentChange
ChangeFromFiftydayMovingAverage
ChangeFromTwoHundreddayMovingAverage
ChangeFromYearHigh
ChangeFromYearLow
ChangeinPercent
ChangePercentRealtime
ChangeRealtime
Commission
Currency
DaysHigh
DaysLow
DaysRange
DaysRangeRealtime
DaysValueChange
DaysValueChangeRealtime
DividendPayDate
DividendShare
DividendYield
EarningsShare
EBITDA
EPSEstimateCurrentYear
EPSEstimateNextQuarter
EPSEstimateNextYear
ErrorIndicationreturnedforsymbolchangedinvalid
ExDividendDate
FiftydayMovingAverage
HighLimit
HoldingsGain
HoldingsGainPercent
HoldingsGainPercentRealtime
HoldingsGainRealtime
HoldingsValue
HoldingsValueRealtime
LastTradeDate
LastTradePriceOnly
LastTradeRealtimeWithTime
LastTradeTime
LastTradeWithTime
LowLimit
MarketCapitalization
MarketCapRealtime
MoreInfo
Name
Notes
OneyrTargetPrice
Open
OrderBookRealtime
PEGRatio
PERatio
PERatioRealtime
PercebtChangeFromYearHigh
PercentChange
PercentChangeFromFiftydayMovingAverage
PercentChangeFromTwoHundreddayMovingAverage
PercentChangeFromYearLow
PreviousClose
PriceBook
PriceEPSEstimateCurrentYear
PriceEPSEstimateNextYear
PricePaid
PriceSales
SharesOwned
ShortRatio
StockExchange
symbol
Symbol
TickerTrend
TradeDate
TwoHundreddayMovingAverage
Volume
YearHigh
YearLow
YearRange

关于python - 抓取不伦不类的标签之间的文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25148648/

相关文章:

python - 删除所有 Python 版本并全新安装 Python 3

javascript - 如何检查浏览器是否支持翻转(即许多移动设备没有光标)

html - gwt中的垂直居中

python - 使用 Python 单击网站按钮

r - 使用 rvest 或 RSelenium 来抓取表

python - pandas read_sql 返回带有参数传递的查询字符串

python - 覆盖方法而不陷入无限递归

html - CSS 不会实时更新,但适用于暂存

python - 如何在 Python 的 json 文件中将键与嵌套级别连接起来?

python - 带和不带笔记本的 IPython 差异