python - Pandas 数据框两列的交集

我有 2 个 pandas 数据帧:dataframe1 和 dataframe2，如下所示:

mydataframe1
Out[15]: 
    Start   End  
    100     200
    300     450
    500     700


mydataframe2
Out[16]:
  Start   End       Value     
  0       400       0  
  401     499       -1  
  500     1000      1  
  1001    1698      1

每一行对应一个段(开始-结束)。对于 dataframe1 中的每个段，我想根据分配给 dataframe2 中的段的值分配一个值。

例如:

dataframe1 的第一段 100 200 包含在 dataframe2 的第一段 0 400 那么我应该赋值 0

dataframe1 中的第二个段 300 450 包含在 dataframe2 的第一个 0 400 和第二个 401 499 段中。在这种情况下，我需要将这些段分成 2 部分并分配 2 个相应的值。即 300 400 -> 值 0 和 401 - 450 值 ->-1

最终的dataframe1应该是这样的

mydataframe1
Out[15]: 
    Start   End  Value
    100     200  0
    300     400  0
    401     450  -1
    500     700  1

我希望我更清楚..你能帮帮我吗？

最佳答案

我怀疑是否有 Pandas 方法可用于直接解决此问题。您必须手动计算交点才能获得所需的结果。 intervaltree库使间隔重叠计算至少更容易和更有效。

IntervalTree.search() 返回与提供的间隔重叠但不计算它们的交集的(完整)间隔。这就是为什么我还应用了我定义的 intersect() 函数。

import pandas as pd
from intervaltree import Interval, IntervalTree

def intersect(a, b):
    """Intersection of two intervals."""
    intersection = max(a[0], b[0]), min(a[1], b[1])
    if intersection[0] > intersection[1]:
        return None
    return intersection

def interval_df_intersection(df1, df2):
    """Calculate the intersection of two sets of intervals stored in DataFrames.
    The intervals are defined by the "Start" and "End" columns.
    The data in the rest of the columns of df1 is included with the resulting
    intervals."""
    tree = IntervalTree.from_tuples(zip(
            df1.Start.values,
            df1.End.values,
            df1.drop(["Start", "End"], axis=1).values.tolist()
        ))

    intersections = []
    for row in df2.itertuples():
        i1 = Interval(row.Start, row.End)
        intersections += [list(intersect(i1, i2)) + i2.data for i2 in tree[i1]]

    # Make sure the column names are in the correct order
    data_cols = list(df1.columns)
    data_cols.remove("Start")
    data_cols.remove("End")
    return pd.DataFrame(intersections, columns=["Start", "End"] + data_cols)

interval_df_intersection(mydataframe2, mydataframe1)

结果与您所追求的相同。

关于python - Pandas 数据框两列的交集，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42673656/

python - Pandas 数据框两列的交集

上一篇：python - 根据条件合并 3 个不同的数据框

下一篇：python - numpy.distutils 对架构的奇怪选择