python - 按列比较不同的 pandas 数据框与公差变化

标签 python pandas dataframe

我有 3 个这样的 pandas 数据框:

#0
                 A     C     G     T          uA          uC          uG          uT     cmA     cmC     cmG     cmT
    seq_1_0   47.0  47.0  54.0  52.0  100.978723  100.957447  100.370370  99.788462   5147.0  5144.0  5055.0  4968.0
    seq_1_50  47.0  47.0  54.0  52.0  101.829787  101.680851  99.092593   99.692308   5279.0  5256.0  4864.0  4953.0
    seq_2_0   47.0  47.0  54.0  52.0  100.978723  100.957447  100.370370  99.788462   5147.0  5144.0  5055.0  4968.0
    seq_2_50  47.0  47.0  54.0  52.0  101.468085  101.425532  99.000000   100.346154  5223.0  5216.0  4850.0  5052.0
    seq_3_0   47.0  47.0  54.0  52.0  100.212766  99.680851   100.870370  101.115385  5030.0  4952.0  5131.0  5169.0
    seq_3_50  46.0  47.0  53.0  54.0  100.173913  100.978723  100.924528  99.944444   5026.0  5148.0  5139.0  4990.0
    seq_4_0   45.0  47.0  54.0  54.0  99.044444   99.000000   101.407407  102.111111  4856.0  4851.0  5214.0  5323.0
    seq_4_50  47.0  47.0  53.0  53.0  101.872340  104.382979  97.849057   98.490566   5285.0  5686.0  4684.0  4776.0
    seq_5_0   54.0  34.0  37.0  75.0  90.462963   91.647059   90.756757   116.546667  3700.0  3848.0  3737.0  7915.0
    seq_5_50  48.0  33.0  37.0  82.0  94.937500   113.636364  113.162162  92.756098   4277.0  7337.0  7245.0  3990.0
    seq_6_0   60.0  50.0  48.0  42.0  98.500000   93.900000   106.125000  104.785714  4777.0  4139.0  5976.0  5752.0
    seq_6_50  59.0  46.0  52.0  43.0  98.338983   98.826087   102.615385  102.697674  4754.0  4825.0  5402.0  5415.0
#1
                 A     C     G     T          uA          uC          uG          uT     cmA     cmC     cmG     cmT
    seq_1_0   47.0  47.0  54.0  52.0  100.978723  100.957447  100.370370  99.788462   5147.0  5144.0  5055.0  4968.0
    seq_1_50  47.0  47.0  54.0  52.0  101.829787  101.680851  99.092593   99.692308   5279.0  5256.0  4864.0  4953.0
    seq_2_0   47.0  47.0  54.0  52.0  100.978723  100.957447  100.370370  99.788462   5147.0  5144.0  5055.0  4968.0
    seq_2_50  47.0  47.0  54.0  52.0  101.468085  101.425532  99.000000   100.346154  5223.0  5216.0  4850.0  5052.0
    seq_3_0   47.0  47.0  54.0  52.0  100.212766  99.680851   100.870370  101.115385  5030.0  4952.0  5131.0  5169.0
    seq_3_50  46.0  47.0  53.0  54.0  100.173913  100.978723  100.924528  99.944444   5026.0  5148.0  5139.0  4990.0
    seq_4_0   45.0  47.0  54.0  54.0  99.044444   99.000000   101.407407  102.111111  4856.0  4851.0  5214.0  5323.0
    seq_4_50  47.0  47.0  53.0  53.0  101.872340  104.382979  97.849057   98.490566   5285.0  5686.0  4684.0  4776.0
    seq_5_0   54.0  34.0  37.0  75.0  90.462963   91.647059   90.756757   116.546667  3700.0  3848.0  3737.0  7915.0
    seq_5_50  48.0  33.0  37.0  82.0  94.937500   113.636364  113.162162  92.756098   4277.0  7337.0  7245.0  3990.0
#2
                 A     C     G     T          uA          uC          uG          uT     cmA     cmC     cmG     cmT
    seq_1_0   48.0  48.0  53.0  51.0  100.291667  99.208333   101.943396  100.411765  5042.0  4882.0  5297.0  5062.0
    seq_1_50  48.0  47.0  54.0  51.0  100.083333  101.680851  99.092593   101.294118  5012.0  5256.0  4864.0  5196.0
    seq_2_0   47.0  47.0  54.0  52.0  100.978723  100.957447  100.370370  99.788462   5147.0  5144.0  5055.0  4968.0
    seq_2_50  47.0  47.0  54.0  52.0  101.468085  101.425532  99.000000   100.346154  5223.0  5216.0  4850.0  5052.0
    seq_3_0   50.0  47.0  53.0  50.0  98.980000   99.680851   101.490566  101.740000  4847.0  4952.0  5226.0  5265.0
    seq_3_50  49.0  47.0  52.0  52.0  95.857143   100.978723  102.519231  102.423077  4403.0  5148.0  5387.0  5371.0

我想将第一个数据帧 (#0) 的所有列与其他 2 个数据帧(#1 和 #2)进行比较,以确定哪个索引具有不同的列值(例如索引 seq_6_0 和 seq_6_50 出现在数据帧 #0 中,但不存在于其他两个数据帧中。

但我也想设置每列的容差变化,以将不同数据帧的列视为相等,例如:

数据帧 #0 的索引 seq_1_0 具有以下值:

A     C     G     T          uA          uC          uG          uT     cmA     cmC     cmG     cmT
47.0  47.0  54.0  52.0  100.978723  100.957447  100.370370  99.788462   5147.0  5144.0  5055.0  4968.0

而 daframe #2 的索引 seq_1_0 具有:

A     C     G     T          uA          uC          uG          uT     cmA     cmC     cmG     cmT
48.0  48.0  53.0  51.0  100.291667  99.208333   101.943396  100.411765  5042.0  4882.0  5297.0  5062.0

所以我想为每列设置差异容差值,例如对于列 ["A","C","T","G"] 我需要比较值之间的容差值为 90%,但对于其他列,我需要比较值之间的不同百分比。

我可以使用任何 pandas 函数来执行此操作吗?

最好,

最佳答案

使用np.isclose ,这使您可以精确控制比较的绝对和相对容差。

我假设您只想比较两个数据框中都存在标签的行。一个中存在但另一个中不存在的行将被忽略。此外,由于您对 A、C、G、T 使用相对标准,因此 compare(df0,df1)compare(df1,df0) 不同。它假设第二个参数是引用值。这与 np.isclose 的工作方式一致。

def compare(dfa, dfb):
    s = pd.Series(['A','C','G','T'])
    tmp = dfa.join(dfb, how='inner', lsuffix='_a', rsuffix='_b')

    # The A, C, G, T columns: within 90% of dfb
    lhs = tmp[s + '_a'].values
    rhs = tmp[s + '_b'].values
    compare1 = np.isclose(lhs, rhs, atol=0, rtol=0.9)

    # The uA, uC, uG, uT columns: within 1e-5
    lhs = tmp['u' + s + '_a'].values
    rhs = tmp['u' + s + '_b'].values
    compare2 = np.isclose(lhs, rhs, atol=1e-5, rtol=0)

    # The cmA, cmC, cmG, cmT columns: within 1e-3
    lhs = tmp['cm' + s + '_a'].values
    rhs = tmp['cm' + s + '_b'].values
    compare3 = np.isclose(lhs, rhs, atol=1e-3, rtol=0)

    # Assemble the result
    data = np.concatenate([compare1, compare2, compare3], axis=1)
    cols = pd.concat([s, 'u'+s, 'cm'+s])    
    result = pd.DataFrame(data, columns=cols, index=tmp.index)

    return result

compare(df0, df2)

为了轻松可视化结果:

def highlight_false(cell):
    return '' if cell else 'background-color: yellow'

result = compare(df0,df2)
result.style.applymap(highlight_false)

关于python - 按列比较不同的 pandas 数据框与公差变化,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59733859/

相关文章:

python - 在 SWIG 编译中 : In header file in interface is unable to resolve other header files.

python - 在数据框中保留连续的天数

r - 如何将数据框中的列转换为行名

python - DataFrame float 到整数?

python - pandas.read_csv : how do I parse two columns as datetimes in a hierarchically-indexed CSV?

python - Dart 相当于 Python zip 和列表理解,用于从两个列表生成小部件列表

python - 将静态 Django 数据库加载到内存中

python - 如何按索引对 Pandas DataFrame 进行排序?

python - 获取非重复值计数大于指定值的列

python - 与 Flask 或其他 Python webframework 并行生成和处理 Websocket 输出数据的后台线程