python - 直接在 pandas 中打开网站上的 csv 文件,无需下载到文件夹

标签 python pandas csv selenium

这个website包含“导出数据”链接,可将页面内容下载到 csv 文件中。该按钮不包含 csv 文件的链接,而是运行 JavaScript 过程。我想直接用 pandas 打开 csv 文件,而不是下载它,找出下载文件夹,然后从那里打开它。这可能吗?

我现有的代码使用 selenium 来单击按钮,尽管如果有更好的方法来做到这一点,我很想听听。

# assign chrome driver path to variable
chrome_path = chromepath

# create browser object
    driver=webdriver.Chrome(chrome_path)

# assign url variable    
url = 'http://www.fangraphs.com/projections.aspx?pos=all&stats=bat&type=fangraphsdc&team=0&lg=all&players=0&sort=24%2cd'

# navigate to web page    
driver.get(url)

# click export data button    
driver.find_element_by_link_text("Export Data").click()

#close driver
driver.quit()

最佳答案

只是碰巧遇到了这个问题,并且有一个脚本,如果您更改 URL,它应该可以工作。不是使用 selenium 下载 CSV,而是使用 soup 来抓取页面内的表,并使用 pandas 创建用于 CSV 导出的表。

只需确保末尾有“page=1_100000”即可获取所有行。如果您有任何疑问,请告诉我。

import requests
from random import choice
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urlparse, parse_qs
from functools import reduce

desktop_agents = ['Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
                 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
                 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
                 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14',
                 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
                 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
                 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
                 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
                 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
                 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0']

url = "https://www.fangraphs.com/leaders.aspx?pos=np&stats=bat&lg=all&qual=0&type=c,4,6,5,23,9,10,11,13,12,21,22,60,18,35,34,50,40,206,207,208,44,43,46,45,24,26,25,47,41,28,110,191,192,193,194,195,196,197,200&season=2018&month=0&season1=2018&ind=0&team=0&rost=0&age=0&filter=&players=0&page=1_100000"

def random_headers():
    return {'User-Agent': choice(desktop_agents),'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}

df3 = pd.DataFrame()
# get the url

page_request = requests.get(url,headers=random_headers())
soup = BeautifulSoup(page_request.text,"lxml")

table = soup.find_all('table')[11]
data = []
# pulls headings from the fangraphs table
column_headers = []
headingrows = table.find_all('th')
for row in headingrows[0:]:
    column_headers.append(row.text.strip())

data.append(column_headers)
table_body = table.find('tbody')
rows = table_body.find_all('tr')

for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols[1:]])

ID = []

for tag in soup.select('a[href^=statss.aspx?playerid=]'):
    link = tag['href']
    query = parse_qs(link)
    ID.append(query)

df1 = pd.DataFrame(data)
df1 = df1.rename(columns=df1.iloc[0])
df1 = df1.loc[1:].reset_index(drop=True)

df2 = pd.DataFrame(ID)
df2.drop(['position'], axis = 1, inplace = True, errors = 'ignore')
df2['statss.aspx?playerid'] = df2['statss.aspx?playerid'].str[0]

df3 = pd.concat([df1, df2], axis=1)

df3.to_csv("HittingGA2018.csv")

关于python - 直接在 pandas 中打开网站上的 csv 文件,无需下载到文件夹,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48218781/

相关文章:

python - PyQt : Compiling date/time into window title

python - oauth2client.client.ApplicationDefaultCredentialsError

python - Pandas:如果索引存在,则将一列的值添加到另一列

php - 当 csv 表标题与 mySQL 表标题不匹配时使用 LOAD DATA INFILE 语句?

csv - mdb-export不创建CSV文件

python - 连接嵌入层

python - nose 框架命令行正则表达式模式匹配不起作用(-e,-m,-i)

python - pandas 按小时分组 按天分组

python - 在 Pandas 中将多列与数值组合

python - 尝试在包含 Pandas Dataframe 列(包含字符串)的 TFidfVectorizer 上应用 'fit_transform()' 时出现内存错误