python - 如何在不丢失大于 32 位整数的情况下将保存在 Pandas 中的数据帧作为 HDF5 文件加载到 R 中?

标签 python r pandas dataframe hdfs

当我尝试将保存在 pandas 中的数据框作为 R 中的 HDF5 文件加载时,我收到此警告消息:

Warning message: In H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem, : NAs produced by integer overflow while converting 64-bit integer or unsigned 32-bit integer from HDF5 to a 32-bit integer in R. Choose bit64conversion='bit64' or bit64conversion='double' to avoid data loss and see the vignette 'rhdf5' for more details about 64-bit integers.

例如,如果我在 pandas 中创建 HDF5 文件:

import pandas as pd

frame = pd.DataFrame({
    'time':[1234567001,1234515616515167005],
    'X2':[23.88,23.96]
},columns=['time','X2'])

store = pd.HDFStore('a.hdf5')
store['df'] =  frame
store.close()
print(frame)

返回:

                  time     X2
0           1234567001  23.88
1  1234515616515167005  23.96

并尝试在 R 中加载它:

#source("http://bioconductor.org/biocLite.R")
#biocLite("rhdf5")
library(rhdf5)

loadhdf5data <- function(h5File) {
  # Function taken from [How can I load a data frame saved in pandas as an HDF5 file in R?](https://stackoverflow.com/a/45024089/395857)
  listing <- h5ls(h5File)
  # Find all data nodes, values are stored in *_values and corresponding column
  # titles in *_items
  data_nodes <- grep("_values", listing$name)
  name_nodes <- grep("_items", listing$name)

  data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
  name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")

  columns = list()
  for (idx in seq(data_paths)) {
    print(idx)
    data <- data.frame(t(h5read(h5File, data_paths[idx])))
    names <- t(h5read(h5File, name_paths[idx],  bit64conversion='bit64'))
    #names <- t(h5read(h5File, name_paths[idx],  bit64conversion='double'))
    entry <- data.frame(data)
    colnames(entry) <- names
    columns <- append(columns, entry)
  }

  data <- data.frame(columns)

  return(data)
}

frame  = loadhdf5data("a.hdf5")

我收到此警告消息:

> frame = loadhdf5data("a.hdf5")
[1] 1
[1] 2
Warning message:
In H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  :
  NAs produced by integer overflow while converting 64-bit integer or unsigned 32-bit integer from HDF5 to a 32-bit integer in R. Choose bit64conversion='bit64' or bit64conversion='double' to avoid data loss and see the vignette 'rhdf5' for more details about 64-bit integers.

我可以看到其中一个时间值变成了 NA:

> frame
     X2       time
1 23.88 1234567001
2 23.96         NA

我该如何解决这个问题?选择 bit64conversion='bit64'bit64conversion='double' 不会改变任何内容。

> R.version
               _                           
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          4.0                         
year           2017                        
month          04                          
day            21                          
svn rev        72570                       
language       R                           
version.string R version 3.4.0 (2017-04-21)
nickname       You Stupid Darkness         

最佳答案

HDF5 Dataset Interface's documentation 说:

bit64conversion: Defines, how 64-bit integers are converted. Internally, R does not support 64-bit integers. All integers in R are 32-bit integers. By setting bit64conversion='int', a coercing to 32-bit integers is enforced, with the risc of data loss, but with the insurance that numbers are represented as integers. bit64conversion='double' coerces the 64-bit integers to floating point numbers. doubles can represent integers with up to 54-bits, but they are not represented as integer values anymore. For larger numbers there is again a data loss. bit64conversion='bit64' is recommended way of coercing. It represents the 64-bit integers as objects of class 'integer64' as defined in the package 'bit64'. Make sure that you have installed 'bit64'. The datatype 'integer64' is not part of base R, but defined in an external package. This can produce unexpected behaviour when working with the data.

因此,您应该安装 bit64 (install.packages("bit64")) 并加载它 (library(bit64))。您可以检查 integer64 是否已加载:

> integer64
Function (length = 0) 
{
    ret <- double(length)
    oldClass(ret) <- "integer64"
    ret
}
<bytecode: 0x000000001a7a95f0>
<environment: namespace :it64>

现在你可以运行:

library(bit64)
library(rhdf5)
loadhdf5data <- function(h5File) {

  listing <- h5ls(h5File)
  # Find all data nodes, values are stored in *_values and corresponding column
  # titles in *_items
  data_nodes <- grep("_values", listing$name)
  name_nodes <- grep("_items", listing$name)

  data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
  name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")

  columns = list()
  for (idx in seq(data_paths)) {
    print(idx)
    data <- data.frame(t(h5read(h5File, data_paths[idx],  bit64conversion='bit64')))
    names <- t(h5read(h5File, name_paths[idx],  bit64conversion='bit64'))
    entry <- data.frame(data)
    colnames(entry) <- names
    columns <- append(columns, entry)
  }

  data <- data.frame(columns)

  return(data)
}


frame = loadhdf5data("a.hdf5")

给出:

> frame
     X2                time
1 23.88          1234567001
2 23.96 1234515616515167005

关于python - 如何在不丢失大于 32 位整数的情况下将保存在 Pandas 中的数据帧作为 HDF5 文件加载到 R 中?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45091991/

相关文章:

python - 致命的 Python 错误 : Py_Initialize: can't initialize sys standard streams LookupError: unknown encoding: 65001

用于在大写字母前添加下划线的正则表达式

r - 按组填充多列的缺失值

r - 按列展开矩阵

python - 将文件夹中的多个 Excel 文件读取到 pandas 中

python - 自定义样式 Pandas 数据框

python - Numpy 融合乘法和加法以避免浪费内存

python - 启动通过套接字编程处理 HTTP 请求的 python 脚本时,CPU 达到 100%?

python-3.x - IF 语句应用不同的公式

python - 如何重用 Dense 层?