python - 从 Google Cloud 存储读取 csv 到 pandas 数据框

标签 python pandas csv google-cloud-platform google-cloud-storage

我正在尝试将 Google Cloud Storage 存储桶上的 csv 文件读取到 panda 数据帧中。

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from io import BytesIO

from google.cloud import storage

storage_client = storage.Client()
bucket = storage_client.get_bucket('createbucket123')
blob = bucket.blob('my.csv')
path = "gs://createbucket123/my.csv"
df = pd.read_csv(path)

它显示了这个错误信息:

FileNotFoundError: File b'gs://createbucket123/my.csv' does not exist

我做错了什么,我找不到任何不涉及 google datalab 的解决方案?

最佳答案

更新

从 pandas 0.24 版开始,read_csv 支持直接从 Google Cloud Storage 读取。只需像这样提供指向存储桶的链接:

df = pd.read_csv('gs://bucket/your_path.csv')

read_csv 然后将使用 gcsfs 模块来读取 Dataframe,这意味着它必须被安装(否则你会得到一个指向缺少依赖项的异常)。

为了完整起见,我留下了其他三个选项。

  • 自制代码
  • gcsfs
  • 黎明

我将在下面介绍它们。

困难的方法:自己动手编写代码

我编写了一些方便的函数来从 Google 存储中读取数据。为了使其更具可读性,我添加了类型注释。如果您碰巧在 Python 2 上,只需删除这些代码即可。

假设您已获得授权,它同样适用于公共(public)和私有(private)数据集。在这种方法中,您无需先将数据下载到本地驱动器。

使用方法:

fileobj = get_byte_fileobj('my-project', 'my-bucket', 'my-path')
df = pd.read_csv(fileobj)

代码:

from io import BytesIO, StringIO
from google.cloud import storage
from google.oauth2 import service_account

def get_byte_fileobj(project: str,
                     bucket: str,
                     path: str,
                     service_account_credentials_path: str = None) -> BytesIO:
    """
    Retrieve data from a given blob on Google Storage and pass it as a file object.
    :param path: path within the bucket
    :param project: name of the project
    :param bucket_name: name of the bucket
    :param service_account_credentials_path: path to credentials.
           TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
    :return: file object (BytesIO)
    """
    blob = _get_blob(bucket, path, project, service_account_credentials_path)
    byte_stream = BytesIO()
    blob.download_to_file(byte_stream)
    byte_stream.seek(0)
    return byte_stream

def get_bytestring(project: str,
                   bucket: str,
                   path: str,
                   service_account_credentials_path: str = None) -> bytes:
    """
    Retrieve data from a given blob on Google Storage and pass it as a byte-string.
    :param path: path within the bucket
    :param project: name of the project
    :param bucket_name: name of the bucket
    :param service_account_credentials_path: path to credentials.
           TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
    :return: byte-string (needs to be decoded)
    """
    blob = _get_blob(bucket, path, project, service_account_credentials_path)
    s = blob.download_as_string()
    return s


def _get_blob(bucket_name, path, project, service_account_credentials_path):
    credentials = service_account.Credentials.from_service_account_file(
        service_account_credentials_path) if service_account_credentials_path else None
    storage_client = storage.Client(project=project, credentials=credentials)
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(path)
    return blob

gcsfs

gcsfs是“用于 Google 云存储的 Pythonic 文件系统”。

使用方法:

import pandas as pd
import gcsfs

fs = gcsfs.GCSFileSystem(project='my-project')
with fs.open('bucket/path.csv') as f:
    df = pd.read_csv(f)

黎明

Dask “为分析提供高级并行性,为您喜爱的工具实现大规模性能”。当您需要在 Python 中处理大量数据时,它非常棒。 Dask 尝试模仿 pandas API 的大部分内容,使其易于新手使用。

这里是 read_csv

使用方法:

import dask.dataframe as dd

df = dd.read_csv('gs://bucket/data.csv')
df2 = dd.read_csv('gs://bucket/path/*.csv') # nice!

# df is now Dask dataframe, ready for distributed processing
# If you want to have the pandas version, simply:
df_pd = df.compute()

关于python - 从 Google Cloud 存储读取 csv 到 pandas 数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49357352/

相关文章:

python - pandas 中添加列的一些计算

javascript - 使用 JavaScript for IE 在本地读取 CSV 文件,无需 FileReader API

mysql - 如何将csv转换成数据库表

python - Pybrain:自定义错误/性能函数?

python - 根据分组依据中的值数量来透视数据,而不是完整的透视

python - Django SearchVector 不适用于包含空格的搜索查询

Python:如何找到一年中的第 n 个工作日?

python - Pandas - 将多列折叠为一列

python - Google Drive api python - 如何下载文件夹内的所有文件或将文件夹下载到特定的本地目标路径

python - 有效计算欧氏距离内的随机值(就时间而言)Python