python - 当数据帧包含混合数据类型时,Pyarrow from_pandas 会使解释器崩溃

标签 python pandas pyarrow

使用pyarrow 0.6.0(或更低版本),以下代码片段会导致 Python 解释器崩溃:

data = pd.DataFrame({'a': [1, True]})
pa.Table.from_pandas(data)

“Python解释器已停止工作”(在Windows下)

最佳答案

经过一些调查,该问题已根据此 Jira issuepyarrow 0.7.0 中得到解决。更准确地说this commit使用与问题中相同的代码片段,现在我们得到以下错误,而不是使解释器崩溃:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "table.pxi", line 755, in pyarrow.lib.Table.from_pandas
File "C:\Temp\tt\Tools\Anaconda3.4.3.1\envs\GMF_test3\lib\site-packages\pyarrow\pandas_compat.py", line 227, in dataframe_to_arrays
    col, type=type, timestamps_to_ms=timestamps_to_ms
File "array.pxi", line 225, in pyarrow.lib.Array.from_pandas
File "error.pxi", line 77, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Error converting from Python objects to Int64: Got Python object of type bool but can only handle these ty
pes: integer

解决该问题的一种可能性是,当您掌握数据时,在发生异常时转换具有混合数据类型的列,如下所示(并且可能会记录异常,因为这不是常见错误):

import pandas as pd
import pyarrow as pa
import logging

logger = logging.getLogger(__name__)

data = pd.DataFrame({'a': [1, True], 'b': [1, 2]})


def convert_type_if_needed(type_to_select, df, col_name):
    types = []
    for i in df[col_name]:
        types.append(type(i))
    if type_to_select in types:
        return df.astype({col_name: type_to_select})
    else:
        raise TypeError(str(type_to_select) + " is not in the dataframe, conversion impossible")


try:
    table = pa.Table.from_pandas(data)
except pa.lib.ArrowInvalid as e:
    logger.warning(e)
    data = convert_type_if_needed(int, data, 'a')
    table = pa.Table.from_pandas(data)

print(table)

最终产生:

pyarrow.Table
Error converting from Python objects to Int64: Got Python object of type bool but can only handle these types: integer
a: int32
b: int64
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"columns": [{"name": "a", "numpy_type": "int32", "pandas_type":'
            b' "int32", "metadata": null}, {"name": "b", "numpy_type": "int64"'
            b', "pandas_type": "int64", "metadata": null}, {"name": "__index_l'
            b'evel_0__", "numpy_type": "int64", "pandas_type": "int64", "metad'
            b'ata": null}], "index_columns": ["__index_level_0__"], "pandas_ve'
            b'rsion": "0.20.3"}'}

关于python - 当数据帧包含混合数据类型时,Pyarrow from_pandas 会使解释器崩溃,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47155431/

相关文章:

javascript - 如何将javascript代码插入Jupyter

python - 是否可以在Python中与一个固定系列进行运行关联?

python - 没有唯一列的透视数据框

python - 从 pandas.DataFrame.to_sql 将 SQL 输出为字符串

python-3.x - 使用 read_parquet 从 Parquet 文件中获取带有分类列的 Pandas DataFrame?

python - 使用 python 元素树将节点插入到 XML 中

python代码从列表中获取结束大括号索引

python - 如何从 numpy 数组有效地初始化 pyarrow 中的固定大小的 ListArray?

python - 使用 Python Splinter 库时出错

python - 组合或附加到 pyarrow.dataset.expressions