python - 解析并创建一个带有条件的新df

标签 python pandas parsing

我需要一些关于 python 和 pandas 的帮助。

我实际上有一个数据框,其中 seq1_id 列中包含物种 1 序列的 seq_id,第 2 列 中包含 sp2 序列。

我实际上在这些序列上传递了一个过滤器,并得到了两个数据帧(一个包含通过过滤器的所有 sp 1 序列)和(一个包含所有 sp2 序列)通过过滤器)。

然后我有 3 个数据帧。

因为在成对的情况下,一个 seq 可以通过过滤器,而另一个 seq 不能通过,因此仅保留在前两次过滤中保持的配对基因很重要,所以我需要做的实际上是解析我的第一个 df比如这个:

Seq_1.id    Seq_2.id
seq1_A     seq8_B
seq2_A     Seq9_B
seq3_A     Seq10_B
seq4_A     Seq11_B

并逐行检查df2中是否存在(例如第一行)seq1_A以及df3中是否存在seq8_B >,然后将此行保留在 df1 中,并将其添加到新的 df4 中。

这是一个想要输出的示例:

first df: 

Seq_1.id   Seq_2.id
seq1_A     seq8_B
seq2_A     Seq9_B
seq3_A     Seq10_B
seq4_A     Seq11_B

df2 (sp1) (seq3_A is absent)

    Seq_1.id   
    seq1_A     
    seq2_A         
    seq4_A  


df3 (sp2) (Seq11_B is absent)

   Seq_2.id
   seq8_B
   Seq9_B
   Seq10_B

然后,由于 Seq11_Bseq3_A 不存在,df4(输出)将为:

Seq_1.id   Seq_2.id
    seq1_A     seq8_B
    seq2_A     Seq9_B


candidates_0035=pd.read_csv("candidates_genes_filtering_0035",sep='\t')
candidates_0042=pd.read_csv("candidates_genes_filtering_0042",sep='\t')
dN_dS=pd.read_csv("dn_ds.out_sorted",sep='\t')

df4 =dN_dS[dN_dS['seq1_id'].isin(candidates_0042['gene'])&dN_dS['seq2_id'].isin(candidates_0035['gene'])]
    

我得到了一个空的输出,只有列名称,但它不应该是这样的。 如果您无法测试代码,请使用以下数据:

df1:

    Unnamed: 0  seq1_id seq2_id dN  dS  Dist_third_pos  Dist_brute  Length_seq_1    Length_seq_2    GC_content_seq1 GC_content_seq2 GC  Mean_length
0   0   g66097.t1_0035_0035 g13600.t1_0042_0042 0.10455938989199982 0.3122332927029104  0.23600000000000002 0.142   535.0   1024.0  49.1588785046729    51.171875   50.165376752336456  535.0
1   1   g45594.t1_0035_0035 g1464.t1_0042_0042  0.5208761055250978  5.430485421797574   0.7120000000000001  0.489   246.0   222.0   47.967479674796756  44.594594594594604  46.28103713469567   222.0
2   2   g50055.t1_0035_0035 g34744.t1_0042_0035 0.08040473491714645 0.4233916132491867  0.262   0.139   895.0   749.0   56.312849162011176  57.67690253671562   56.994875849363396  749.0
3   3   g34020.t1_0035_0035 g12096.t1_0042_0042 0.4385191689737516  26.834927363887587  0.5760000000000001  0.433   597.0   633.0   37.85594639865997   39.810426540284354  38.83318646947217   597.0
4   4   g28436.t1_0035_0042 g35222.t1_0042_0035 0.055299811368483165    0.1181241496387666  0.1 0.069   450.0   461.0   45.111111111111114  44.90238611713666   45.006748614123886  450.0
5   5   g1005.t1_0035_0035  g11524.t1_0042_0042 0.3528036631463747  19.32549458735676   0.71    0.512   3177.0  3804.0  39.06200818382121   52.944269190325976  46.0031386870736    3177.0
6   6   g28456.t1_0035_0035 g31669.t1_0042_0035 0.4608959702286786  26.823981621115166  0.6859999999999999  0.469   516.0   591.0   49.224806201550386  53.46869712351946   51.346751662534935  516.0
7   7   g6202.t1_0035_0035  g193.t1_0042_0042   0.4679458383555545  17.81312422445775   0.66    0.462   804.0   837.0   41.91542288557214   47.67025089605735   44.79283689081474   804.0
8   8   g60667.t1_0035_0035 g14327.t1_0042_0042 0.046056273155280165    0.13320612138898    0.122   0.067   348.0   408.0   56.89655172413793   55.392156862745104  56.1443542934415    348.0
9   9   g30148.t1_0035_0042 g37790.t1_0042_0035 0.05631607180881047 0.19747150378706246 0.12300000000000001 0.08800000000000001 405.0   320.0   59.012345679012356  58.4375 58.72492283950618   320.0
10  10  g24481.t1_0035_0035 g37405.t1_0042_0035 0.2151957757290965  0.15106487998618026 0.135   0.17600000000000002 270.0   276.0   51.111111111111114  51.44927536231884   51.28019323671497   270.0
11  11  g33270.t1_0035_0035 g21201.t1_0042_0035 0.2773062983971916  21.13839474189674   0.6940000000000001  0.401   297.0   357.0   54.882154882154886  50.42016806722689   52.65116147469089   297.0
12  12  EOG090X03YJ_0035_0035_1 EOG090X03YJ_0042_0042_1 0.5402471721616758  19.278839157918302  0.7070000000000001  0.488   1321.0  1719.0  38.53141559424678   43.92088423502036   41.22614991463357   1321.0
13  13  g13075.t1_0035_0042 g504.t1_0042_0035   0.3317504066721263  4.790120127840871   0.65    0.38799999999999996 372.0   408.0   59.40860215053763   51.470588235294116  55.43959519291587   372.0
14  14  g1026.t1_0035_0035  g7716.t1_0042_0042  0.21445770772761286 13.92799368027682   0.626   0.344   336.0   315.0   38.095238095238095  44.444444444444436  41.26984126984127   315.0
15  15  g18238.t1_0035_0042 g35401.t1_0042_0035 0.3889830456691637  20.33679494952895   0.6759999999999999  0.44799999999999995 320.0   366.0   50.9375 49.453551912568315  50.19552595628416   320.0

df2:

    Unnamed: 0 gene scaf_name   start   end cov_depth   GC
179806  g13600.t1_0042_0042 scaffold_6556   1   1149    2.42361684558216    0.528846153846154
315037  g34744.t1_0042_0035 scaffold_8076   17  765 3.49803921568627    0.386138613861386
317296  g35222.t1_0042_0035 scaffold_9018   1   614 93.071661237785 0.41
183513  g14327.t1_0042_0042 scaffold_9358   122 529 3.3184165232357996  0.36
328164  g37790.t1_0042_0035 scaffold_16356  1   320 2.73125 0.436241610738255
326617  g37405.t1_0042_0035 scaffold_14890  1   341 1.3061224489795902  0.36898395721925104
188515  g15510.t1_0042_0042 scaffold_20183  1   276 137.326086956522    0.669354838709677
184561  g14562.t1_0042_0042 scaffold_10427  1   494 157.993927125506    0.46145940390544704
290684  g30982.t1_0042_0035 scaffold_3800   440 940 174.499839537869    0.39823008849557506
179993  g13632.t1_0042_0042 scaffold_6654   29  1114    3.56506849315068    0.46153846153846206
181670  g13942.t1_0042_0042 scaffold_7830   1   811 5.307028360049321   0.529411764705882
196148  g20290.t1_0042_0035 scaffold_1145   2707    9712    78.84112231766741   0.367283950617284
313624  g34464.t1_0042_0035 scaffold_7610   1   480 7.740440324449589   0.549019607843137
303133  g32700.t1_0042_0035 scaffold_5119   1735    2373    118.436578171091    0.49074074074074103

df3:

    Unnamed: 0 gene scaf_name   start   end cov_depth   GC
428708  g66097.t1_0035_0035 scaffold_306390 1   695 32.2431654676259    0.389880952380952
342025  g50055.t1_0035_0035 scaffold_188566 15  954 7.062893081761009   0.351129363449692
214193  g28436.t1_0035_0042 scaffold_231066 1   842 25.9774346793349    0.348837209302326
400337  g60667.t1_0035_0035 scaffold_261197 309 656 15.873529411764698  0.353846153846154
224023  g30148.t1_0035_0042 scaffold_263686 10  414 23.2072538860104    0.34108527131782895
184987  g24481.t1_0035_0035 scaffold_65047  817 1593    27.7840552416824    0.533898305084746
249413  g34492.t1_0035_0035 scaffold_106432 1   511 3.2482544608223396  0.368318122555411
249418  g34493.t1_0035_0035 scaffold_106432 547 1230    3.2482544608223396  0.368318122555411
12667   g1120.t1_0035_0042  scaffold_2095   2294    2794    47.864745898359295  0.56203288490284
252797  g35042.t1_0035_0035 scaffold_108853 274 1276    20.269592476489 0.32735426008968604
255878  g36112.t1_0035_0042 scaffold_437464 1   540 74.8252551020408    0.27884615384615397
40058   g4082.t1_0035_0042  scaffold_11195  579 1535    33.4396168320219    0.48487467588591204
271053  g39343.t1_0035_0042 scaffold_590976 1   290 19.6666666666667    0.38636363636363596
89911   g10947.t1_0035_0035 scaffold_21433  1735    2373    32.4222503160556    0.408571428571429

最佳答案

这应该可以做到:

df4 = df1[df1['Seq_1.id'].isin(df2['Seq_1.id'])&df1['Seq_2.id'].isin(df3['Seq_2.id'])]
df4
#  Seq_1.id Seq_2.id
#0   seq1_A   seq8_B
#1   seq2_A   Seq9_B

编辑

您必须进行排列,这不会返回空:

df4 = dN_dS[(dN_dS['seq1_id'].isin(candidates_0035['gene']))&(dN_dS['seq2_id'].isin(candidates_0042['gene']))]

关于python - 解析并创建一个带有条件的新df,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50407509/

相关文章:

python - 如何在 1D 和 nD 数组之间广播以获得 (1+n)D 数组输出?

python - 从字典列表中优化平均计算

c# - 从框括号内的字符串中提取子字符串

Java、Stanford NLP : Unable to validate jar entry per:countries_of_residence. 仅在 Windows 上规则

python - UnicodeDecodeError : ('utf-8' codec) while reading a csv file

python - 在 NLTK 解析器中使用整数/日期作为终端

python - 如何将缩进从 2 个空格转换为 4 个空格

python - celery 生产优雅重启

python pandas-将带有两个参数的函数应用于列

python - 使用 numpy 和 pandas 加速 virtualenv 创建