python - 如何将数据从存储桶逐行流式传输到Python脚本

标签 python google-cloud-platform

我正在处理存储在 Google Cloud 中的大型数据文件。我正在使用一个 Python 脚本，它首先下载一个包含 json 行的 blob，然后打开它来逐行分析数据。这种方法非常慢，我想知道是否存在更快的方法来做到这一点。从命令行我可以使用 gsutil cat 将数据流式传输到 stdout，在 Python 上有类似的方法吗？

这就是我目前读取数据的方法:

myClient = storage.Client()
bucket = myClient.get_bucket(bucketname)
blob = storage.blob.Blob(blobname, bucket)
current_blob.download_to_filename("filename.txt")

file = open("filename.txt", "r")
data = f.readlines()

for line in data:
    # Do stuff

我想逐行读取 blob，无需等待下载。

编辑:我发现了这个answer但我不清楚这个功能。我不知道如何读取流线。

最佳答案

在 answer you found , stream 是一个类似文件的对象，因此您应该能够使用它而不是打开特定的文件名。像这样的东西(未经测试):

myClient = storage.Client()
bucket = myClient.get_bucket(bucketname)
blob = storage.blob.Blob(blobname, bucket)
stream = open('myStream','wb', os.O_NONBLOCK)
streaming = blob.download_to_file(stream)

for line in stream.readlines():
    # Do stuff

关于python - 如何将数据从存储桶逐行流式传输到Python脚本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58171706/

上一篇：python - 为什么在 Windows 上启动新进程时 Python 的多处理模块会导入 main ？

下一篇：python - TensorFlow Lite 对象检测 iOS 不适用于自定义训练模型

python - 使用带有函数的循环在 Euler 17 上获取 IndexError

apache-spark - Spark-HBase - GCP 模板 (2/3) - json4s 的版本问题？

python-3.x - 如何将 Google Cloud AI Platform Jupyter Lab 升级到 Python 3.7+

python - 检查 QColorDialog 是否被取消

python - Pandas ，.resample ('B' 的意外行为)

google-bigquery - 自动将文件从 Google Cloud Storage 上传到 Bigquery

java - Google 云 Bigquery UDF 限制

python - Python 代码是否总是需要一个服务帐户来生成用于在谷歌云上上传的签名 url？

python - Julia 神经网络代码速度与 PyPy 相同

python - 如何将数据从存储桶逐行流式传输到Python脚本

上一篇：python - 为什么在 Windows 上启动新进程时 Python 的多处理模块会导入 __main__ ？

下一篇：python - TensorFlow Lite 对象检测 iOS 不适用于自定义训练模型

上一篇：python - 为什么在 Windows 上启动新进程时 Python 的多处理模块会导入 main ？