我有兴趣从此链接中提取历史价格: https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol=KEL
为此,我使用以下代码
import requests
import pandas as pd
import time as t
t0=t.time()
symbols =[
'HMIM',
'CWSM','DSIL','RAVT','PIBTL','PICT','PNSC','ASL',
'DSL','ISL','CSAP','MUGHAL','DKL','ASTL','INIL']
for symbol in symbols:
header = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
r = requests.get('https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol={}'.format(str(symbol)), headers=header)
dfs = pd.read_html(r.text)
df=dfs[6]
df=df.ix[2: , ]
df.columns=['Date','Open','High','Low','Close','Volume']
df.set_index('Date', inplace=True)
df.to_csv('/home/furqan/Desktop/python_data/{}.csv'.format(str(symbol)),columns=['Open','High','Low','Close','Volume'],
index_label=['Date'])
print(symbol)
t1=t.time()
print('exec time is ', t1-t0, 'seconds')
上面的代码从链接中提取数据,将其转换为 pandas 数据框并保存。
问题是它需要花费大量时间,并且对于更多数量的符号来说效率不高。任何人都可以建议任何其他方法来有效地实现上述结果。
此外,是否有任何其他编程语言可以在更短的时间内完成相同的工作。
最佳答案
带有 requests
的正常 GET 请求是“阻塞”的;发送一个请求,接收并处理一个响应。您的处理时间至少有一部分用于等待响应 - 我们可以使用 requests-futures
异步发送所有请求。然后收集准备好的回复。
也就是说,我认为 DSIL 超时或类似的情况(我需要进一步查看)。虽然我能够通过从符号
中随机选择来获得不错的加速,但这两种方法都需要大约。如果 DSIL
在列表中,则同时进行。
编辑:看来我撒谎了,这只是多次与“DSIL”的不幸巧合。 symbols
中的标签越多,异步方法相对于标准请求
的速度就越快。
import requests
from requests_futures.sessions import FuturesSession
import time
start_sync = time.time()
symbols =['HMIM','CWSM','RAVT','ASTL','INIL']
header = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
for symbol in symbols:
r = requests.get('https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol={}'.format(str(symbol)), headers=header)
end_sync = time.time()
start_async = time.time()
# Setup
session = FuturesSession(max_workers=10)
pooled_requests = []
# Gather request URLs
for symbol in symbols:
request= 'https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol={}'.format(symbol)
pooled_requests.append(request)
# Fire the requests
fire_requests = [session.get(url, headers=header) for url in pooled_requests]
responses = [item.result() for item in fire_requests]
end_async = time.time()
print "Synchronous requests took: {}".format(end_sync - start_sync)
print "Async requests took: {}".format(end_async - start_async)
在上面的代码中,我获得响应的速度提高了 3 倍。您可以迭代响应
列表并正常处理每个响应。
编辑2: 像之前一样检查异步请求的响应并保存它们:
for i, r in enumerate(responses):
dfs = pd.read_html(r.text)
df=dfs[6]
df=df.ix[2: , ]
df.columns=['Date','Open','High','Low','Close','Volume']
df.set_index('Date', inplace=True)
df.to_csv('/home/furqan/Desktop/python_data/{}.csv'.format(symbols[i]),columns=['Open','High','Low','Close','Volume'],
index_label=['Date'])
关于python - 使用Python提取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43476692/