我正在尝试重新格式化 CSV,将每个月列变成每个记录的单独行(本质上是旋转它),即:
进入:
要做到这一点,我认为最好的方法是:
- 循环遍历每一行,循环遍历每个月列(
Jan-17
、Feb-17
等...),然后复制该行。 - 然后将月份和总计插入
Date
和Totals
列。 - 然后删除重复的记录并从索引停止的位置开始(即,每个日期进行 5 次记录循环后,开始的索引将为 5)。
- 然后,当所有行都重复时,删除月份列(
Jan-17
、Feb-17
等...)
它对第一个数据行(即 brand1
)执行此操作,但在第一个外部循环完成后,它会中断:
the label [5] is not in the [index]
df['date'] = ''
df['totals'] = 0
months = ['Jan-17', 'Feb-17', 'Mar-17', 'Apr-17', 'May-17']
dropRowIndex = 0
nextDuplicateRowStartIndex = 0
totalRows = df.shape[0]
for i in range(0, totalRows):
print('--------------')
print(df)
for col in df:
if col in months:
# Insert a row above 0th index with 0th row's values
# Duplicate the row at this index for each month
# Then move on to the next "row", which would be the latest index count
df.loc[nextDuplicateRowStartIndex-1] = df.loc[nextDuplicateRowStartIndex].values
df.loc[nextDuplicateRowStartIndex-1, 'date'] = col
df.loc[nextDuplicateRowStartIndex-1, 'totals'] = df.loc[nextDuplicateRowStartIndex-1][col]
df.index = df.index + 1
df = df.sort_index()
dropRowIndex += 1
# Drop duplicated row by index
df.drop(dropRowIndex, inplace=True)
nextDuplicateRowStartIndex = dropRowIndex
# Remove months columns
for col in df:
if col in months:
df = df.drop(col, 1)
终端输出:
-------------- INITIAL DATA FRAME:
brand Jan-17 Feb-17 Mar-17 Apr-17 May-17 date totals
0 brand1 222 333 444 555 666 0
1 brand2 7777 8888 9999 1010 1111 0
2 brand3 12121 13131 14141 15151 16161 0
-------------- DATA FRAME AFTER FIRST OUTER LOOP (ROW) ITERATION:
brand Jan-17 Feb-17 Mar-17 Apr-17 May-17 date totals
0 brand1 222 333 444 555 666 May-17 666
1 brand1 222 333 444 555 666 Apr-17 555
2 brand1 222 333 444 555 666 Mar-17 444
3 brand1 222 333 444 555 666 Feb-17 333
4 brand1 222 333 444 555 666 Jan-17 222
6 brand2 7777 8888 9999 1010 1111 0
7 brand3 12121 13131 14141 15151 16161 0
Traceback (most recent call last):
File "/Users/danielturcotte/Sites/project/env/lib/python3.6/site-packages/pandas/core/indexing.py", line 1506, in _has_valid_type
error()
File "/Users/danielturcotte/Sites/project/env/lib/python3.6/site-packages/pandas/core/indexing.py", line 1501, in error
axis=self.obj._get_axis_name(axis)))
KeyError: 'the label [5] is not in the [index]'
错误
<小时/>KeyError: 'the label [5] is not in the [index]'
我的一个想法是因为我正在使用 .loc[index]
,其中索引是一个整数,可能是 .loc
doesn't work with integers ,但是.iloc[]
做。如果我这样做
df.iloc[nextDuplicateRowStartIndex-1] = df.iloc[nextDuplicateRowStartIndex].values
我收到错误:
ValueError: labels [10] not contained in axis
终端输出产生 NaN
s:
brand Jan-17 Feb-17 Mar-17 Apr-17 May-17 date totals
0 NaN NaN NaN NaN NaN NaN May-17 NaN
1 NaN NaN NaN NaN NaN NaN Apr-17 NaN
2 NaN NaN NaN NaN NaN NaN Mar-17 NaN
3 NaN NaN NaN NaN NaN NaN Feb-17 NaN
4 NaN NaN NaN NaN NaN NaN Jan-17 NaN
6 brand2 7777.0 8888.0 9999.0 1010.0 1111.0 0.0
7 NaN NaN NaN NaN NaN NaN Apr-17 NaN
虽然我不相信这就是问题所在,因为 print(df.iloc[0])
和print(df.loc[0])
产生相同的结果(即使我使用整数访问 loc[0]
)。
正在做melt
:
最佳答案
您可以使用melt
为了这。它允许您选择多个 ID 列和值列。在您的情况下,值列是除“品牌”之外的所有内容,因此我们可以忽略该参数。因此,您可以在一行中完成所有操作:
import pandas as pd
df = pd.DataFrame({
'brand': ['brand1', 'brand2', 'brand3'],
'Jan-17': [22, 232, 324],
'Feb-17': [333, 424, 999]
# ...
})
rearranged = pd.melt(df, id_vars=['brand'], var_name='Date',
value_name='Total')
print(rearranged)
打印:
brand Date Total
0 brand1 Feb-17 333
1 brand2 Feb-17 424
2 brand3 Feb-17 999
3 brand1 Jan-17 22
4 brand2 Jan-17 232
5 brand3 Jan-17 324
关于python - "the label [5] is not in the [index]"复制行时,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47354835/