python - 通过 Pydrive 将 Google 电子表格转换为 Pandas 数据框，无需下载

如何在不下载文件的情况下将 Google 电子表格的内容读取到 Pandas 数据框中？

我认为gspread或df2gspread可能是不错的镜头，但我一直在使用pydrive到目前为止并接近解决方案。

使用 Pydrive，我设法获取电子表格的导出链接，作为 .csv 或 .xlsx 文件。身份验证过程完成后，看起来像


    gauth = GoogleAuth()
    gauth.LocalWebserverAuth()
    drive = GoogleDrive(gauth)
    
    # choose whether to export csv or xlsx
    data_type = 'csv'
    
    # get list of files in folder as dictionaries
    file_list = drive.ListFile({'q': "'my-folder-ID' in parents and 
    trashed=false"}).GetList()
    
    export_key = 'exportLinks'
    
    excel_key = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
    csv_key = 'text/csv'
    
    if data_type == 'excel':
        urls = [ file[export_key][excel_key] for file in file_list ]
    
    elif data_type == 'csv':
        urls = [ file[export_key][csv_key] for file in file_list ]

我得到的 xlsx 的 url 类型是

https://docs.google.com/spreadsheets/export?id=my-id&exportFormat=xlsx

对于 csv 也类似

https://docs.google.com/spreadsheets/export?id=my-id&exportFormat=csv

现在，如果我单击这些链接(或使用 webbrowser.open(url) 访问它们)，我下载该文件，然后我可以正常读取该文件具有 pandas.read_excel() 或 pandas.read_csv() 的 Pandas 数据框，如所述 here .

如何跳过下载，直接从这些链接将文件读入数据帧？

我尝试了几种解决方案:

The obvious pd.read_csv(url) 给出

    pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2

有趣的是，这些数字 (1, 6, 2) 并不取决于电子表格中的行数和列数，这暗示脚本试图读取的不是它想要的内容。

模拟pd.read_excel(url)给出

    ValueError: Excel file format cannot be determined, you must specify an engine manually.

并指定例如engine = 'openpyxl' 给出

zipfile.BadZipFile: File is not a zip file

BytesIO解决方案看起来很有希望，但是


    r = requests.get(url)
    data = r.content
    df = pd.read_csv(BytesIO(data))

仍然给出


    pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2

如果我 print(data) 我会得到数百行 html 代码


    b'\n<!DOCTYPE html>\n<html lang="de">\n  <head>\n  <meta charset="utf-8">\n  <meta content="width=300, initial-scale=1" name="viewport">\n 
    ...
    ...
     </script>\n  </body>\n</html>\n'

最佳答案

根据您的情况，进行以下修改如何？在本例中，通过从 gauth 检索访问 token ，电子表格将导出为 XLSX 数据，并将 XLSX 数据放入数据帧中。

修改后的脚本:

gauth = GoogleAuth()
gauth.LocalWebserverAuth()

url = "https://docs.google.com/spreadsheets/export?id={spreadsheetId}&exportFormat=xlsx"
res = requests.get(url, headers={"Authorization": "Bearer " + gauth.attr['credentials'].access_token})
values = pd.read_excel(BytesIO(res.content))
print(values)

在此脚本中，请添加导入请求。
在本例中，使用 XLSX 数据的第一个选项卡。
当您想使用其他选项卡时，请按如下方式修改values = pd.read_excel(BytesIO(res.content))。
```
  sheet = "Sheet2"
  values = pd.read_excel(BytesIO(res.content), sheet_name=sheet)
```

关于python - 通过 Pydrive 将 Google 电子表格转换为 Pandas 数据框，无需下载，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/71278523/

python - 通过 Pydrive 将 Google 电子表格转换为 Pandas 数据框，无需下载

如何跳过下载，直接从这些链接将文件读入数据帧？

修改后的脚本:

上一篇：python - youtube api - 等待视频

下一篇：amazon-web-services - 无法使用 CodeCommit 存储库启动 Sagemaker Notebook 实例