python - 使用 numpy 或 pandas 处理长格式的 csv 文件

标签 python csv file-io numpy pandas

我正在尝试编写一个简单的脚本,将 csv 输出文件从 Fortran 代码转换为 Pandas DataFrame 对象,以便我可以进行更多分析。 csv 有两列,但由多个附加的数据 block 组成,形状为 [n,2](每个样本名称的格式为 RN_x)。我得到了以下代码,但生成的 DataFrame 对象不允许分析。我还在下面附上了一个示例文件(比原始文件短得多)。顺便说一下,数据文件中的第一列是一个日期,但在输出中是一个数字,对应于 si=imulation 中的一天。任何建议将不胜感激。

import numpy as np
import pandas as pd
import csv as csv
readdata = csv.reader(open('C:/data/Test.csv', 'r'))
data = []
for row in readdata:
    data.append(row)
a = np.array(data).reshape(11,-1, order = 'F')
col = a[0,:4].reshape(4)
row = pd.Index(a[4:,0:1].reshape(7))
b = a[4:,5:]
df = pd.DataFrame(b, index = row, columns = col)

示例:

RN_48865,
1,Observed
1,0
259,Computed
1,0.000014
91,0.000014
182,0.000014
274,0.000014
366,0.000014
457,0.000014
548,0.000014
RN_7445,
1,Observed
1,0
259,Computed
1,0.000013
91,0.000013
182,0.000013
274,0.000013
366,0.000013
457,0.000013
548,0.000013
RN_9288,
1,Observed
1,0
259,Computed
1,0.000011
91,0.000011
182,0.000011
274,0.000011
366,0.000011
457,0.000011
548,0.000011
RN_10955,
1,Observed
1,0
259,Computed
1,0.000014
91,0.000014
182,0.000014
274,0.000014
366,0.000014
457,0.000014
548,0.000014

示例输出:

Index,RN_48865,RN_7445,RN_9288,RN_10955
1,0.000014,0.000013,0.000011,0.000014
91,0.000014,0.000013,0.000011,0.000014
182,0.000014,0.000013,0.000011,0.000014
274,0.000014,0.000013,0.000011,0.000014
366,0.000014,0.000013,0.000011,0.000014
457,0.000014,0.000013,0.000011,0.000014
548,0.000014,0.000013,0.000011,0.000014

最佳答案

您实际上是在问几个问题。这是我从所需输出中可以理解的内容:

source="""RN_48865,
    1,Observed
    1,0
    259,Computed
    1,0.000014
    91,0.000014
    182,0.000014
    274,0.000014
    366,0.000014
    457,0.000014
    548,0.000014
    RN_7445,
    1,Observed
    1,0
    259,Computed
    1,0.000013
    91,0.000013
    182,0.000013
    274,0.000013
    366,0.000013
    457,0.000013
    548,0.000013
    RN_9288,
    1,Observed
    1,0
    259,Computed
    1,0.000011
    91,0.000011
    182,0.000011
    274,0.000011
    366,0.000011
    457,0.000011
    548,0.000011
    RN_10955,
    1,Observed
    1,0
    259,Computed
    1,0.000014
    91,0.000014
    182,0.000014
    274,0.000014
    366,0.000014
    457,0.000014
    548,0.000014
"""
import pandas as pd
import numpy as np
import StringIO
df = pd.read_csv(StringIO.StringIO(source), header=None)
rns = np.where(df[0].apply(lambda x: x.lstrip().startswith('RN_')))[0]
length = rns[1] - rns[0]
index = df[0].iloc[4:length]
cols = df[0][::length].apply(lambda x: x.lstrip()).values
result_df = pd.DataFrame(index=index)
for col_num, col_start in enumerate(range(0, len(df), length)):
    result_df[cols[col_num]] = df[1][col_num*length+4 : (col_num+1)*length].values
print result_df

输出:

     RN_48865   RN_7445   RN_9288  RN_10955
1    0.000014  0.000013  0.000011  0.000014
91   0.000014  0.000013  0.000011  0.000014
182  0.000014  0.000013  0.000011  0.000014
274  0.000014  0.000013  0.000011  0.000014
366  0.000014  0.000013  0.000011  0.000014
457  0.000014  0.000013  0.000011  0.000014
548  0.000014  0.000013  0.000011  0.000014

对于日期使用:

pandas.read_csv('file',
  parse_date=0,  # 0th column
  date_parser=lambda x: pandas.Timestamp('1995-1-1')+timedelta(x))

关于python - 使用 numpy 或 pandas 处理长格式的 csv 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20277886/

相关文章:

python - 使用 JWT Auth 对 LDAP 服务进行身份验证

python - 对 DataFrame 中各个行的数据求和

python - 使用 int dtype 的 numpy 数组计算出错(它无法在需要时自动将 dtype 转换为 64 位)

ios - 使用 NSErrorPointer 解析 .csv 时出错

java - opencsv,不能将值与“

c - fopen_s 是否容易重构为 CreateFile

file - 使用 Lua 创建新的文件夹和文件

python - 没有名为 fuzzywuzzy 的模块

python - 如何根据给定的标准将一个csv文件拆分为多个csv?

c# - 使用 System.Data.SQLite 在 C# 应用程序中缓慢打开 SQLite 连接