python - 使用Python提取数据

我有兴趣从此链接中提取历史价格: https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol=KEL

为此，我使用以下代码

import requests
import pandas as pd
import time as t

t0=t.time()

symbols =[
          'HMIM',
           'CWSM','DSIL','RAVT','PIBTL','PICT','PNSC','ASL',
          'DSL','ISL','CSAP','MUGHAL','DKL','ASTL','INIL']

for symbol in symbols:
    header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}
    r = requests.get('https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol={}'.format(str(symbol)), headers=header)
    dfs = pd.read_html(r.text)
    df=dfs[6]
    df=df.ix[2: , ]
    df.columns=['Date','Open','High','Low','Close','Volume']
    df.set_index('Date', inplace=True)
    df.to_csv('/home/furqan/Desktop/python_data/{}.csv'.format(str(symbol)),columns=['Open','High','Low','Close','Volume'],
             index_label=['Date'])

    print(symbol)


t1=t.time()
print('exec time is ', t1-t0, 'seconds')

上面的代码从链接中提取数据，将其转换为 pandas 数据框并保存。

问题是它需要花费大量时间，并且对于更多数量的符号来说效率不高。任何人都可以建议任何其他方法来有效地实现上述结果。

此外，是否有任何其他编程语言可以在更短的时间内完成相同的工作。

最佳答案

带有 requests 的正常 GET 请求是“阻塞”的；发送一个请求，接收并处理一个响应。您的处理时间至少有一部分用于等待响应 - 我们可以使用 requests-futures 异步发送所有请求。然后收集准备好的回复。

也就是说，我认为 DSIL 超时或类似的情况(我需要进一步查看)。虽然我能够通过从符号中随机选择来获得不错的加速，但这两种方法都需要大约。如果 DSIL 在列表中，则同时进行。

编辑:看来我撒谎了，这只是多次与“DSIL”的不幸巧合。 symbols 中的标签越多，异步方法相对于标准请求的速度就越快。

import requests
from requests_futures.sessions import FuturesSession
import time

start_sync = time.time()

symbols =['HMIM','CWSM','RAVT','ASTL','INIL']

header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}

for symbol in symbols:
    r = requests.get('https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol={}'.format(str(symbol)), headers=header)

end_sync = time.time()

start_async = time.time()
# Setup
session = FuturesSession(max_workers=10)
pooled_requests = []

# Gather request URLs
for symbol in symbols:
    request= 'https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol={}'.format(symbol)
    pooled_requests.append(request)

# Fire the requests
fire_requests = [session.get(url, headers=header) for url in pooled_requests]
responses = [item.result() for item in fire_requests]

end_async = time.time()

print "Synchronous requests took: {}".format(end_sync - start_sync)
print "Async requests took:       {}".format(end_async - start_async)

在上面的代码中，我获得响应的速度提高了 3 倍。您可以迭代响应列表并正常处理每个响应。

编辑2: 像之前一样检查异步请求的响应并保存它们:

for i, r in enumerate(responses):
    dfs = pd.read_html(r.text)
    df=dfs[6]
    df=df.ix[2: , ]
    df.columns=['Date','Open','High','Low','Close','Volume']
    df.set_index('Date', inplace=True)
    df.to_csv('/home/furqan/Desktop/python_data/{}.csv'.format(symbols[i]),columns=['Open','High','Low','Close','Volume'],
             index_label=['Date'])

关于python - 使用Python提取数据，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43476692/

python - 使用Python提取数据

上一篇：python - 将 Google Analytics API 转换为 CSV

下一篇：Python:检查数组是否没有所需数量的成员