我想从 NIMH Data Archive 下载公共(public)数据集。在他们的网站上创建帐户并接受他们的数据使用协议(protocol)后,我可以下载一个 CSV 文件,其中包含我感兴趣的数据集中所有文件的路径。每个路径的格式为 s3://NDAR_Central_1/...
。
1 在我的个人计算机上下载
在 NDA Github repository ,nda-tools Python 库公开了一些有用的 Python 代码,用于将这些文件下载到我自己的计算机上。假设我要下载以下文件:
s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz
给定我的用户名 (USRNAME
) 和密码 (PASSWD
)(我用来在 NIMH 数据存档上创建帐户的密码),以下代码允许我将此文件下载到我个人计算机上的 TARGET_PATH
:
from NDATools.clientscripts.downloadcmd import configure
from NDATools.Download import Download
config = configure(username=USRNAME, password=PASSWD)
s3Download = Download(TARGET_PATH, config)
target_fnames = ['s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz']
s3Download.get_links('paths', target_fnames, filters=None)
s3Download.get_tokens()
s3Download.start_workers(False, None, 1)
在幕后,get_tokens
s3Download
方法将使用 USRNAME
和 PASSWD
生成临时访问 key 、 secret key 和安全 token 。然后,start_workers
方法将使用 boto3 和 s3 将 Python 库传输到 download所选文件。
一切正常!
2 下载到 GCP 存储桶
现在,假设我在 GCP 上创建了一个项目,并希望直接将此文件下载到 GCP 存储桶。
理想情况下,我想做这样的事情:
gsutil cp s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz gs://my-bucket
为此,我在 Cloud Shell 中执行以下 Python 代码(通过运行 python3
):
from NDATools.TokenGenerator import NDATokenGenerator
data_api_url = 'https://nda.nih.gov/DataManager/dataManager'
generator = NDATokenGenerator(data_api_url)
token = generator.generate_token(USRNAME, PASSWD)
这给了我访问 key 、 secret key 和 session token 。事实上,在下面,
ACCESS_KEY
指的是token.access_key
的值,SECRET_KEY
指的是token.secret_key
的值,SECURITY_TOKEN
指的是token.session
的值。
然后,我在 Cloud Shell 中将这些凭据设置为环境变量:
export AWS_ACCESS_KEY_ID = [copy-paste ACCESS_KEY here]
export AWS_SECRET_ACCESS_KEY = [copy-paste SECRET_KEY here]
export AWS_SECURITY_TOKEN = [copy-paste SECURITY_TOKEN here]
最终,我还在家里设置了 .boto
配置文件。它看起来像这样:
[Credentials]
aws_access_key_id = $AWS_ACCESS_KEY_ID
aws_secret_access_key = $AWS_SECRET_ACCESS_KEY
aws_session_token = $AWS_SECURITY_TOKEN
[s3]
calling_format = boto.s3.connection.OrdinaryCallingFormat
use-sigv4=True
host=s3.us-east-1.amazonaws.com
当我运行以下命令时:
gsutil cp s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz gs://my-bucket
我最终得到:
AccessDeniedException: 403 AccessDenied
完整的回溯如下:
Non-MD5 etag ("a21a0b2eba27a0a32a26a6b30f3cb060-6") present for key <Key: NDAR_Central_1,submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz>, data integrity checks are not possible.
Copying s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz [Content-Type=application/x-gzip]...
Exception in thread Thread-2:iB]
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/google/google-cloud-sdk/platform/gsutil/gslib/daisy_chain_wrapper.py", line 213, in PerformDownload
decryption_tuple=self.decryption_tuple)
File "/google/google-cloud-sdk/platform/gsutil/gslib/cloud_api_delegator.py", line 353, in GetObjectMedia
decryption_tuple=decryption_tuple)
File "/google/google-cloud-sdk/platform/gsutil/gslib/boto_translation.py", line 590, in GetObjectMedia
generation=generation)
File "/google/google-cloud-sdk/platform/gsutil/gslib/boto_translation.py", line 1723, in _TranslateExceptionAndRaise
raise translated_exception # pylint: disable=raising-bad-type
AccessDeniedException: AccessDeniedException: 403 AccessDenied
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>A93DBEA60B68E04D</RequestId><HostId>Z5XqPBmUdq05btXgZ2Tt7HQMzodgal6XxTD6OLQ2sGjbP20AyZ+fVFjbNfOF5+Bdy6RuXGSOzVs=</HostId></Error>
AccessDeniedException: 403 AccessDenied
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>A93DBEA60B68E04D</RequestId><HostId>Z5XqPBmUdq05btXgZ2Tt7HQMzodgal6XxTD6OLQ2sGjbP20AyZ+fVFjbNfOF5+Bdy6RuXGSOzVs=</HostId></Error>
I would like to be able to directly download this file from a S3 bucket to my GCP bucket (without having to create a VM, setup Python and run the code above [which works]). Why is it that the temporary generated credentials work on my computer but do not work in GCP Cloud Shell?
调试命令的完整日志
gsutil -DD cp s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz gs://my-bucket
可以找到here .
最佳答案
您尝试实现的过程称为 "Transfer Job"
要将文件从 Amazon S3 存储桶传输到 Cloud Storage 存储桶:
A. Click the Burger Menu on the top left corner
B. Go to Storage > Transfer
C. Click Create Transfer
Under Select source, select Amazon S3 bucket.
In the Amazon S3 bucket text box, specify the source Amazon S3 bucket name. The bucket name is the name as it appears in the AWS Management Console.
In the respective text boxes, enter the Access key ID and Secret key associated with the Amazon S3 bucket.
To specify a subset of files in your source, click Specify file filters beneath the bucket field. You can include or exclude files based on file name prefix and file age.
Under Select destination, choose a sink bucket or create a new one.
- To choose an existing bucket, enter the name of the bucket (without the prefix gs://), or click Browse and browse to it.
- To transfer files to a new bucket, click Browse and then click the New bucket icon.
Enable overwrite/delete options if needed.
By default, your transfer job only overwrites an object when the source version is different from the sink version. No other objects are overwritten or deleted. Enable additional overwrite/delete options under Transfer options.
Under Configure transfer, schedule your transfer job to Run now (one time) or Run daily at the local time you specify.
Click Create.
在设置传输作业之前,请确保您已为您的帐户分配了必要的角色以及描述的所需权限 here .
另请注意,存储传输服务目前可用于某些 Amazon S3 区域,如设置传输作业的 AMAZON S3 选项卡中所述
传输作业也可以通过编程方式完成。更多信息here
请告诉我这是否有帮助。
编辑
传输服务或 gsutil 命令目前均不支持“临时安全凭证”,尽管 AWS 支持它们。执行您想要的操作的解决方法是更改 gsutil 命令的源代码。
我还提交了 Feature Request我代表您建议您为其加注星标,以便获取程序的更新。
关于python - 使用临时凭证将数据从 S3 存储桶传输到 GCP 存储桶,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59376583/