Python Pandas : read csv with N columns where N is specified in the some other column

假设我有一个具有以下标题的数据集:

<id>  <timestamp>  <N>  <1>  <2> ... <N>

在此数据集中，每一行都有一个列“N”，该数字决定了其后有多少列数字标记。例如，我有如下一行:

5 142323151.14 800 5.3564 5.4534 ... 7.4839 (800 columns after the 3rd column)

保证所有行具有相同的列数。

如何使用 Pandas read_csv 读取此 CSV 文件并正确标记列？是否可以在一次调用中完成？我正在学习 Pandas，所以我想知道 Pandas 如何完成需要多行 Python 代码才能完成的任务。

感谢您的帮助!

编辑: 我试过了

pd.read_csv('file.csv', names=['id','timestamp','count',...],
                        delimiter=' ',
                        header=None)

我不知道该在...部分放什么

最佳答案

更新:

What if I have several trailing columns, such as <1> <2> ... , how do you use the value of n here?

In [320]: df
Out[320]:
   0             1   2   3   4   5   6   7   8   9   10    11    12
0   5  1.423232e+08   8   1   2   3   4   5   6   7   8  1000  1000
1   6  1.423236e+08   8  11  22  33  44  55  66  77  88  1000  1000

In [321]: ['id', 'timestamp', 'n'] + (df.columns[3:3+df.iat[0, 2]] - 2).tolist() + [11, 12]
Out[321]: ['id', 'timestamp', 'n', 1, 2, 3, 4, 5, 6, 7, 8, 11, 12]

In [322]: df.columns = ['id', 'timestamp', 'n'] + (df.columns[3:3+df.iat[0, 2]] - 2).tolist() + [11, 12]

In [323]: df
Out[323]:
   id     timestamp  n   1   2   3   4   5   6   7   8    11    12
0   5  1.423232e+08  8   1   2   3   4   5   6   7   8  1000  1000
1   6  1.423236e+08  8  11  22  33  44  55  66  77  88  1000  1000

如果您可以预定义尾随列名称，则可以执行以下操作:

In [328]: trailing_cols = ['max','min']

In [329]: ['id', 'timestamp', 'n'] + (df.columns[3:3+df.iat[0, 2]] - 2).tolist() + trailing_cols
Out[329]: ['id', 'timestamp', 'n', 1, 2, 3, 4, 5, 6, 7, 8, 'max', 'min']

旧答案:

我会这样做:

首先读取未指定列名称的 CSV:

df = pd.read_csv('file.csv', delim_whitespace=True, header=None)

In [271]: df
Out[271]:
   0             1    2   3   4   5   6   7   8   9   10
0   5  1.423232e+08  800   1   2   3   4   5   6   7   8
1   5  1.423232e+08  800  11  22  33  44  55  66  77  88

现在我们可以按如下方式重命名列:

In [272]: df.columns = ['id', 'timestamp', 'n'] + (df.columns[3:].values - 2).tolist()

In [273]: df
Out[273]:
   id     timestamp    n   1   2   3   4   5   6   7   8
0   5  1.423232e+08  800   1   2   3   4   5   6   7   8
1   5  1.423232e+08  800  11  22  33  44  55  66  77  88

关于Python Pandas : read csv with N columns where N is specified in the some other column，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41473321/

Python Pandas : read csv with N columns where N is specified in the some other column

上一篇：python - 单链表删除

下一篇：python - 如何为 Pandas 多索引数据框中的每个子索引添加一行？