给出一个像这样的数据框的简单示例:
sample chrom start stop count psi5
sampleA chr1 100 200 75 0.75
sampleA chr1 100 250 25 0.25
sampleB chr1 100 200 50 1.0
sampleC chr1 100 250 50 1.0
sampleD chr1 100 300 1 NaN
如果没有第 3 列(从 0 开始)的所有唯一值的观察,如何为每个样本添加行?
sampleA chr1 100 200 75 0.75
sampleA chr1 100 250 25 0.25
sampleB chr1 100 200 50 1.0
sampleC chr1 100 250 50 1.0
sampleD chr1 100 300 1 NaN
sampleA chr1 100 300 0 0
sampleB chr1 100 250 0 0
sampleB chr1 100 300 0 0
sampleC chr1 100 200 0 0
sampleC chr1 100 300 0 0
sampleD chr1 100 200 NaN NaN
sampleD chr1 100 250 NaN NaN
因此,sampleA
没有对第 3 列 = 300
进行观察,因此我们在第 4 列和第 5 列中添加了带有零的行。但棘手的部分发生在sampleD
,其 count
仅为 1
,因此它未通过标准,因此其值为 psi5
是 NaN,并且可以跳过其中一个,因为我可能会从中创建一个数据透视表并用 na 填充 emtpy,或者添加带有 NaN
s 的行。
这段代码做了我想做的事情,举一个小例子:https://gist.github.com/olgabot/1b4234c28b245e52bfc0
但它没有很好地矢量化。
最佳答案
我可能会使用 stack
和 unstack
以矢量化方式执行此操作。 SampleD 的 NaN 有点棘手,因为我需要使用由拆栈引起的 Nan 来填充停止列。但是你可以在开始时去掉sampleD,在最后将NaN添加到sampleD(这就是我要做的):
一次性完成:
df = df.set_index(['sample','chrom','start','stop'])
df = df.unstack(['sample','chrom','start']).fillna(0)
df = df.stack(['sample','chrom','start']).reset_index()
df.loc[df.sample == 'sampleD',['count','psi5']] = np.nan
print df
stop sample chrom start count psi5
0 200 sampleA chr1 100 75 0.75
1 200 sampleB chr1 100 50 1.00
2 200 sampleC chr1 100 0 0.00
3 200 sampleD chr1 100 NaN NaN
4 250 sampleA chr1 100 25 0.25
5 250 sampleB chr1 100 0 0.00
6 250 sampleC chr1 100 50 1.00
7 250 sampleD chr1 100 NaN NaN
8 300 sampleA chr1 100 0 0.00
9 300 sampleB chr1 100 0 0.00
10 300 sampleC chr1 100 0 0.00
11 300 sampleD chr1 100 NaN NaN
一步一步
1) 将 ['sample','chrom','start','stop'] 设置为索引:
df = df.set_index(['sample','chrom','start','stop'])
print df
count psi5
sample chrom start stop
sampleA chr1 100 200 75 0.75
250 25 0.25
sampleB chr1 100 200 50 1.00
sampleC chr1 100 250 50 1.00
sampleD chr1 100 300 1 NaN
2) 对除 stop 之外的所有索引进行 Unstack,并用零填充 unstack 创建的缺失值:
df = df.unstack(['sample','chrom','start'])
print df
count psi5
sample sampleA sampleB sampleC sampleD sampleA sampleB sampleC sampleD
chrom chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1
start 100 100 100 100 100 100 100 100
stop
200 75 50 NaN NaN 0.75 1 NaN NaN
250 25 NaN 50 NaN 0.25 NaN 1 NaN
300 NaN NaN NaN 1 NaN NaN NaN NaN
df = df.fillna(0)
print df
count psi5
sample sampleA sampleB sampleC sampleD sampleA sampleB sampleC sampleD
chrom chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1
start 100 100 100 100 100 100 100 100
stop
200 75 50 0 0 0.75 1 0 0
250 25 0 50 0 0.25 0 1 0
300 0 0 0 1 0.00 0 0 0
3) 现在重新堆叠回到旧的面板表单,但现在每个分组的停止值为 200、250 和 300:
df = df.stack(['sample','chrom','start']).reset_index()
print df
stop sample chrom start count psi5
0 200 sampleA chr1 100 75 0.75
1 200 sampleB chr1 100 50 1.00
2 200 sampleC chr1 100 0 0.00
3 200 sampleD chr1 100 0 0.00
4 250 sampleA chr1 100 25 0.25
5 250 sampleB chr1 100 0 0.00
6 250 sampleC chr1 100 50 1.00
7 250 sampleD chr1 100 0 0.00
8 300 sampleA chr1 100 0 0.00
9 300 sampleB chr1 100 0 0.00
10 300 sampleC chr1 100 0 0.00
11 300 sampleD chr1 100 1 0.00
4) 为sampleD 添加 NaN:
df.loc[df.sample == 'sampleD',['count','psi5']] = np.nan
关于python - Pandas groupby 添加带有交叉引用的行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23436345/