python - 溢出错误: size does not fit in an int

标签 python pandas dataframe azure-machine-learning-service

我正在编写一个在 AzureML 中使用的 python 脚本。我的数据集相当大。我有一个数据集,其中包含名为 ID(int) 和 DataType(text) 的列。我想将这些值连接到只有一列,其中包含 ID 和 DataType 并用逗号分隔的文本。

执行此操作时如何避免出现错误。我的代码有错误吗?

当我运行此代码时,出现以下错误:

Error 0085: The following error occurred during script evaluation, please view the output log for more information:
---------- Start of error message from Python interpreter ----------
data:text/plain,Caught exception while executing function: Traceback (most recent call last):
File "C:\server\invokepy.py", line 167, in batch
idfs.append(rutils.RUtils.RFileToDataFrame(infile))
File "C:\server\RReader\rutils.py", line 15, in RFileToDataFrame
rreader = RReaderFactory.construct_from_file(filename, compressed)
File "C:\server\RReader\rreaderfactory.py", line 25, in construct_from_file
return _RReaderFactory.construct_from_stream(stream)
File "C:\server\RReader\rreaderfactory.py", line 46, in construct_from_stream
return RReader(BinaryReader(RFactoryConstants.big_endian, stream.read()))
File "C:\pyhome\lib\gzip.py", line 254, in read
self._read(readsize)
File "C:\pyhome\lib\gzip.py", line 313, in _read
self._add_read_data( uncompress )
File "C:\pyhome\lib\gzip.py", line 329, in _add_read_data
self.crc = zlib.crc32(data, self.crc) & 0xffffffffL
OverflowError: size does not fit in an int

我的代码如下:

# The script MUST contain a function named azureml_main
# which is the entry point for this module.
#
# The entry point function can contain up to two input arguments:
#   Param<dataframe1>: a pandas.DataFrame
#   Param<dataframe2>: a pandas.DataFrame

def azureml_main(dataframe1):
import pandas as pd
dataframe1['SignalID,DataType'] = dataframe1['ID'] + " , " + dataframe1['DataType']
dataframe1 = dataframe1.drop('DataType')
dataframe1 = dataframe1.drop('ID')
# Return value must be of a sequence of pandas.DataFrame
return dataframe1

当我在 AzureML 中运行默认 python 代码时,出现相同的错误。所以我很确定我的数据不适合数据框。

默认脚本如下:

# The script MUST contain a function named azureml_main
# which is the entry point for this module.
#
# The entry point function can contain up to two input arguments:
#   Param<dataframe1>: a pandas.DataFrame
#   Param<dataframe2>: a pandas.DataFrame
def azureml_main(dataframe1 = None, dataframe2 = None):

    # Execution logic goes here
    print('Input pandas.DataFrame #1:\r\n\r\n{0}'.format(dataframe1))

    # If a zip file is connected to the third input port is connected,
    # it is unzipped under ".\Script Bundle". This directory is added
    # to sys.path. Therefore, if your zip file contains a Python file
    # mymodule.py you can import it using:
    # import mymodule

    # Return value must be of a sequence of pandas.DataFrame
    return dataframe1,

最佳答案

如果您需要将整数 ID 和字符串 DataType 列连接到新列 SignalID,请使用 astype 进行转换。然后你就可以dropDataTypeID 添加参数 axis=1:

import pandas as pd

def azureml_main(dataframe1):
    dataframe1['SignalID'] = dataframe1['ID'].astype(str) 
                                      + " , " 
                                      + dataframe1['DataType']

    dataframe1 = dataframe1.drop(['DataType', 'ID'], axis=1)
    # Return value must be of a sequence of pandas.DataFrame
    return dataframe1

关于python - 溢出错误: size does not fit in an int,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35197965/

相关文章:

java - python 有类似 php java 桥的东西吗?

python - 使用 NLTK 对自己的文本数据进行情感分类

python - 如何使用groupby方法组合多个(使用切片?)列或行?

python - 在 python df 中为变量分配一个连续编号

python - Pandas 试图通过获取某些字符串之间的行来转换数据框

python - Django:如何处理多个应用程序的网址

python - 如何使用 json.dumps() 将嵌套字典中的所有 int 转换为 str?

python - 具有 lambda 函数的 Pandas .filter() 方法

python - 为什么 Pandas .isin() 方法比 "=="更快

java - Spark DataFrame 类的 union() 方法在哪里?