我想从 NIMH Data Archive 下载公共(public)数据集。在他们的网站上创建帐户并接受他们的数据使用协议(protocol)后，我可以下载一个 CSV 文件，其中包含我感兴趣的数据集中所有文件的路径。每个路径的格式为 s3://NDAR_Central_1/...。

1 在我的个人计算机上下载

在 NDA Github repository ，nda-tools Python 库公开了一些有用的 Python 代码，用于将这些文件下载到我自己的计算机上。假设我要下载以下文件:

s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz

给定我的用户名 (USRNAME) 和密码 (PASSWD)(我用来在 NIMH 数据存档上创建帐户的密码)，以下代码允许我将此文件下载到我个人计算机上的 TARGET_PATH:

from NDATools.clientscripts.downloadcmd import configure
from NDATools.Download import Download

config = configure(username=USRNAME, password=PASSWD)
s3Download = Download(TARGET_PATH, config)

target_fnames = ['s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz']

s3Download.get_links('paths', target_fnames, filters=None)
s3Download.get_tokens()
s3Download.start_workers(False, None, 1)

在幕后，get_tokens s3Download 方法将使用 USRNAME 和 PASSWD 生成临时访问 key 、 secret key 和安全 token 。然后，start_workers方法将使用 boto3 和 s3 将 Python 库传输到 download所选文件。

一切正常!

2 下载到 GCP 存储桶

现在，假设我在 GCP 上创建了一个项目，并希望直接将此文件下载到 GCP 存储桶。

理想情况下，我想做这样的事情:

gsutil cp s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz gs://my-bucket

为此，我在 Cloud Shell 中执行以下 Python 代码(通过运行 python3):

from NDATools.TokenGenerator import NDATokenGenerator
data_api_url = 'https://nda.nih.gov/DataManager/dataManager'
generator = NDATokenGenerator(data_api_url)
token = generator.generate_token(USRNAME, PASSWD)

这给了我访问 key 、 secret key 和 session token 。事实上，在下面，

ACCESS_KEY指的是token.access_key的值，
SECRET_KEY指的是token.secret_key的值，
SECURITY_TOKEN 指的是 token.session 的值。

然后，我在 Cloud Shell 中将这些凭据设置为环境变量:

export AWS_ACCESS_KEY_ID = [copy-paste ACCESS_KEY here]
export AWS_SECRET_ACCESS_KEY = [copy-paste SECRET_KEY here]
export AWS_SECURITY_TOKEN = [copy-paste SECURITY_TOKEN here]

最终，我还在家里设置了 .boto 配置文件。它看起来像这样:

[Credentials]
aws_access_key_id = $AWS_ACCESS_KEY_ID
aws_secret_access_key = $AWS_SECRET_ACCESS_KEY
aws_session_token = $AWS_SECURITY_TOKEN
[s3]
calling_format = boto.s3.connection.OrdinaryCallingFormat
use-sigv4=True
host=s3.us-east-1.amazonaws.com

当我运行以下命令时:

gsutil cp s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz gs://my-bucket

我最终得到:

AccessDeniedException: 403 AccessDenied

完整的回溯如下:

Non-MD5 etag ("a21a0b2eba27a0a32a26a6b30f3cb060-6") present for key <Key: NDAR_Central_1,submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz>, data integrity checks are not possible.
Copying s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz [Content-Type=application/x-gzip]...
Exception in thread Thread-2:iB]
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/google/google-cloud-sdk/platform/gsutil/gslib/daisy_chain_wrapper.py", line 213, in PerformDownload
    decryption_tuple=self.decryption_tuple)
  File "/google/google-cloud-sdk/platform/gsutil/gslib/cloud_api_delegator.py", line 353, in GetObjectMedia
    decryption_tuple=decryption_tuple)
  File "/google/google-cloud-sdk/platform/gsutil/gslib/boto_translation.py", line 590, in GetObjectMedia
    generation=generation)
  File "/google/google-cloud-sdk/platform/gsutil/gslib/boto_translation.py", line 1723, in _TranslateExceptionAndRaise
    raise translated_exception  # pylint: disable=raising-bad-type
AccessDeniedException: AccessDeniedException: 403 AccessDenied
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>A93DBEA60B68E04D</RequestId><HostId>Z5XqPBmUdq05btXgZ2Tt7HQMzodgal6XxTD6OLQ2sGjbP20AyZ+fVFjbNfOF5+Bdy6RuXGSOzVs=</HostId></Error>

AccessDeniedException: 403 AccessDenied
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>A93DBEA60B68E04D</RequestId><HostId>Z5XqPBmUdq05btXgZ2Tt7HQMzodgal6XxTD6OLQ2sGjbP20AyZ+fVFjbNfOF5+Bdy6RuXGSOzVs=</HostId></Error>

I would like to be able to directly download this file from a S3 bucket to my GCP bucket (without having to create a VM, setup Python and run the code above [which works]). Why is it that the temporary generated credentials work on my computer but do not work in GCP Cloud Shell?

调试命令的完整日志

gsutil -DD cp s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz gs://my-bucket

可以找到here .

最佳答案

您尝试实现的过程称为 "Transfer Job"

要将文件从 Amazon S3 存储桶传输到 Cloud Storage 存储桶:

A. Click the Burger Menu on the top left corner

B. Go to Storage > Transfer

C. Click Create Transfer

Under Select source, select Amazon S3 bucket.

In the Amazon S3 bucket text box, specify the source Amazon S3 bucket name. The bucket name is the name as it appears in the AWS Management Console.

In the respective text boxes, enter the Access key ID and Secret key associated with the Amazon S3 bucket.

To specify a subset of files in your source, click Specify file filters beneath the bucket field. You can include or exclude files based on file name prefix and file age.

Under Select destination, choose a sink bucket or create a new one.

To choose an existing bucket, enter the name of the bucket (without the prefix gs://), or click Browse and browse to it.

To transfer files to a new bucket, click Browse and then click the New bucket icon.

Enable overwrite/delete options if needed.

By default, your transfer job only overwrites an object when the source version is different from the sink version. No other objects are overwritten or deleted. Enable additional overwrite/delete options under Transfer options.

Under Configure transfer, schedule your transfer job to Run now (one time) or Run daily at the local time you specify.

Click Create.

在设置传输作业之前，请确保您已为您的帐户分配了必要的角色以及描述的所需权限 here .

另请注意，存储传输服务目前可用于某些 Amazon S3 区域，如设置传输作业的 AMAZON S3 选项卡中所述

传输作业也可以通过编程方式完成。更多信息here

请告诉我这是否有帮助。

编辑

传输服务或 gsutil 命令目前均不支持“临时安全凭证”，尽管 AWS 支持它们。执行您想要的操作的解决方法是更改 gsutil 命令的源代码。

我还提交了 Feature Request我代表您建议您为其加注星标，以便获取程序的更新。

关于python - 使用临时凭证将数据从 S3 存储桶传输到 GCP 存储桶，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59376583/

python - 使用临时凭证将数据从 S3 存储桶传输到 GCP 存储桶

1 在我的个人计算机上下载

2 下载到 GCP 存储桶

上一篇：python - 多输入模型keras的K.function

下一篇：python - 如何获取特定行集的索引