api - 如何将太大的 Kaggle 数据集的一个选定文件从 Kaggle 加载到 Colab 中

如果我想从 Kaggle 笔记本切换到 Colab 笔记本，我可以从 Kaggle 下载笔记本并在 Google Colab 中打开该笔记本。这样做的问题是，您通常还需要下载和上传 Kaggle 数据集，这是一项相当大的工作。

如果您有一个小数据集，或者您只需要一个较小的数据集文件，您可以将数据集放入 Kaggle 笔记本所需的相同文件夹结构中。因此，您需要在 Google Colab 中创建该结构，例如 kaggle/input/ 或其他内容，然后将其上传到那里。这不是问题。

如果您有一个大型数据集，您可以:

安装您的 Google 云端硬盘并使用其中的数据集/文件

或者您按照 Easiest way to download kaggle data in Google Colab 上的官方 Colab 指南将 Kaggle 数据集从 Kaggle 下载到 Colab 中，请使用链接了解更多详细信息:

Please follow the steps below to download and use kaggle data within Google Colab:

Go to your Kaggle account, Scroll to API section and Click Expire API Token to remove previous tokens

Click on Create New API Token - It will download kaggle.json file on your machine.

Go to your Google Colab project file and run the following commands:
   ! pip install -q kaggle
Choose the kaggle.json file that you downloaded
from google.colab import files

files.upload()
Make directory named kaggle and copy kaggle.json file there.
! mkdir ~/.kaggle

! cp kaggle.json ~/.kaggle/
Change the permissions of the file.
! chmod 600 ~/.kaggle/kaggle.json
That's all ! You can check if everything's okay by running this command.
! kaggle datasets list
Download Data
   ! kaggle competitions download -c 'name-of-competition'

或者，如果您想下载数据集(取自评论):

! kaggle datasets download -d USERNAME/DATASET_NAME
You can get these dataset names (if unclear) from "copy API command" in the "three-dots drop down" next to "New Notebook" button on the Kaggle dataset page.

问题来了:这似乎只适用于较小的数据集。我已经试过了

kaggle datasets download -d allen-institute-for-ai/CORD-19-research-challenge

并且它没有找到该 API，可能是因为下载 40 GB 数据受到限制:404 - Not Found。

在这种情况下，您只能下载所需的文件并使用挂载的 Google Drive，或者需要使用 Kaggle 而不是 Colab。

有没有办法将 40 GB CORD-19 Kaggle 数据集的 800 MB 元数据.csv 文件下载到 Colab 中？以下是文件信息页面的链接:

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge?select=metadata.csv

我现在已将文件加载到 Google 云端硬盘中，我很好奇这是否已经是最好的方法。相比之下，在 Kaggle 上，整个数据集都已经可用，无需下载，加载速度快，这是相当费力的。

PS:将 zip 文件从 Kaggle 下载到 Colab 后，需要将其解压。再次进一步引用quide:

Use unzip command to unzip the data:

For example, create a directory named train,
   ! mkdir train
unzip train data there,
   ! unzip train.zip -d train

更新:我建议安装 Google 云端硬盘

在尝试了两种方法(安装 Google Drive 或直接从 Kaggle 加载)后，如果您的架构允许，我建议安装 Google Drive。这样做的好处是文件只需上传一次:Google Colab 和 Google Drive 直接连接。安装 Google Drive 需要额外的步骤，从 Kaggle 下载文件、解压缩并将其上传到 Google Drive，以及为每个 Python session 获取并激活一个 token 来安装 Google Drive，但激活 token 很快就能完成。使用 Kaggle，您需要在每次 session 时将文件从 Kaggle 上传到 Google Colab，这会花费更多时间和流量。

最佳答案

您可以编写一个脚本，仅下载某些文件或一个接一个地下载文件:

import os

os.environ['KAGGLE_USERNAME'] = "YOUR_USERNAME_HERE"
os.environ['KAGGLE_KEY'] = "YOUR_TOKEN_HERE"

!kaggle datasets files allen-institute-for-ai/CORD-19-research-challenge

!kaggle datasets download allen-institute-for-ai/CORD-19-research-challenge -f metadata.csv

关于api - 如何将太大的 Kaggle 数据集的一个选定文件从 Kaggle 加载到 Colab 中，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/67713193/

api - 如何将太大的 Kaggle 数据集的一个选定文件从 Kaggle 加载到 Colab 中

更新:我建议安装 Google 云端硬盘

上一篇：powershell - 通过 CMD 运行带参数的 Powershell

下一篇：logging - Kusto 自定义排序顺序？