python - 具有相似字符串模式的两个数据帧的匹配序列，保持索引和序列

我有两个数据帧 df 和 df1。我必须匹配序列或字符串，并获取索引号为 df 的唯一匹配字符串序列作为输出。

idx id_0 user string
0 008457 02 hello
1 990037 05 I 
2 774426 10 am
3 564389 08 sleeping
4 009124 17 today
5 000029 13 is
6 548751 21 a
7 479903 19 bright
8 897054 08 sunny
9 336588 7 day
10 294260 16 today 
11 908751 29 is 
12 558902 81 rainy
13 097856 19 with
14 110044 24 cold
15 775098 16 today 
16 665490 02 is
17 887099 07 sunday 
18 389011 18 ahhh
19 675510 11 weekend

df1

idx string
0 today
1 is
2 a
3 bright
4 sunny
5 day

输出:

idx id_0 user string
4 009124 17 today
5 000029 13 is
6 548751 21 a
7 479903 19 bright
8 897054 08 sunny
9 336588 7 day

我尝试了几种方法 pd.merge、pd.concat、pd.join，也使用 isin，但是，我得到了错误的索引号。

例如，

out = df1[df1['string'].isin(df.index().['string'])]

最佳答案

一种可能的方法如下:

df = pd.DataFrame([
    [0, "008457", "02", "hello"],
    [1, "990037", "05", "I"],
    [2, "774426", "10", "am"],
    [3, "564389", "08", "sleeping"],
    [4, "009124", "17", "today"],
    [5, "000029", "13", "is"],
    [6, "548751", "21", "a"],
    [7, "479903", "19", "bright"],
    [8, "897054", "08", "sunny"],
    [9, "336588", "7", "day"],
    [10, "294260", "16", "today"],
    [11, "908751", "29", "is"],
    [12, "558902", "81", "rainy"],
    [13, "097856", "19", "with"],
    [14, "110044", "24", "cold"],
    [15, "775098", "16", "today"],
    [16, "665490", "02", "is"],
    [17, "887099", "07", "sunday"],
    [18, "389011", "18", "ahhh"],
    [19, "675510", "11", "weekend"]
],
columns=["idx", "id_0", "user", "string"]
)
df = df.set_index('idx')

df1 = pd.DataFrame([
[0, "today"],
[1, "is"],
[2, "a"],
[3, "bright"],
[4, "sunny"],
[5, "day"]
],
columns=["idx", "string"]
)


matching_indices = []
for i in range(len(df)-len(df1)+1):
    if (df.string.iloc[i:i+len(df1)].values == df1.string.values).all():
        matching_indices += list(range(i,i+len(df1)))

df.iloc[matching_indices]

输出:

    id_0    user    string
idx         
4   009124  17  today
5   000029  13  is
6   548751  21  a
7   479903  19  bright
8   897054  08  sunny
9   336588  7   day

上面的代码将返回所有匹配的子序列及其正确的索引，而不仅仅是第一次出现。

如果您只想返回第一个匹配项，则可以在第一次识别到匹配项时中断循环，如下所示:

matching_indices = []
for i in range(len(df)-len(df1)+1):
    if (df.string.iloc[i:i+len(df1)].values == df1.string.values).all():
        matching_indices += list(range(i,i+len(df1)))
        break

df.iloc[matching_indices]

关于python - 具有相似字符串模式的两个数据帧的匹配序列，保持索引和序列，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62875836/

python - 具有相似字符串模式的两个数据帧的匹配序列，保持索引和序列

上一篇：office-js - Outlook/Word 插件 : How to highlight a part of the text (range)

下一篇：jq - 如果外部数组的名称 == 'something'，则将值添加到 JSON 数组