pandas - 从 BigQuery 加载大量数据到 python/pandas/dask

标签 pandas google-cloud-platform google-bigquery bigdata dask

我阅读了其他类似的帖子并在 Google 上搜索以找到更好的方法，但找不到任何可行的解决方案。

我在 BigQuery 中有一个大型表(假设每天插入 2000 万行)。我想要在 python/pandas/dask 中拥有大约 2000 万行、大约 50 列的数据来进行一些分析。我尝试过使用 bqclient、panda-gbq 和 bq 存储 API 方法，但在 python 中需要 30 分钟才能拥有 500 万行。还有其他方法吗？甚至有任何 Google 服务可以完成类似的工作吗？

最佳答案

您可以随时将内容导出到云存储 -> 本地下载 -> 加载到您的 dask/pandas 数据框中，而不是查询:

导出+下载:

bq --location=US extract --destination_format=CSV --print_header=false 'dataset.tablename' gs://mystoragebucket/data-*.csv &&  gsutil -m cp gs://mystoragebucket/data-*.csv /my/local/dir/

加载到 Dask 中:

>>> import dask.dataframe as dd
>>> df = dd.read_csv("/my/local/dir/*.csv")

希望有帮助。

关于pandas - 从 BigQuery 加载大量数据到 python/pandas/dask，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55033606/

上一篇：docker - 一个 docker 中的多个 dockerfile 或一个 dockerfile 中的多个镜像

下一篇：prolog - 为什么maplist/3不使用模板？

python - 比较两个 Pandas 数据帧行的最快方法？

python - 如何将 Dataframe 列中值的最后 3 位数字拆分为两个新的 Dataframe？

https - 谷歌云 ssl 证书 "The SSL certificate could not be parsed"

wordpress - 是否可以在 Google Cloud Platform 上的一个存储桶上托管多个网站？

google-bigquery - 通过 DataFlow 从 Cloud Storage 到 BigQuery(更新插入)

python - PySpark 无法正确读取 CSV

Python Pandas : Changing a Column Heading - Getting "Key Error"

tensorflow - 在 Google ML Engine 上以 .model .json 和 .h5 的形式部署 Keras/Tensorflow CNN 的最简单方法是什么？

google-bigquery - 更新bigquery表的不同方式