我有两个数据帧 df 和 df1。我必须匹配序列或字符串,并获取索引号为 df 的唯一匹配字符串序列作为输出。
df
idx id_0 user string
0 008457 02 hello
1 990037 05 I
2 774426 10 am
3 564389 08 sleeping
4 009124 17 today
5 000029 13 is
6 548751 21 a
7 479903 19 bright
8 897054 08 sunny
9 336588 7 day
10 294260 16 today
11 908751 29 is
12 558902 81 rainy
13 097856 19 with
14 110044 24 cold
15 775098 16 today
16 665490 02 is
17 887099 07 sunday
18 389011 18 ahhh
19 675510 11 weekend
df1
idx string
0 today
1 is
2 a
3 bright
4 sunny
5 day
输出:
idx id_0 user string
4 009124 17 today
5 000029 13 is
6 548751 21 a
7 479903 19 bright
8 897054 08 sunny
9 336588 7 day
我尝试了几种方法 pd.merge、pd.concat、pd.join,也使用 isin,但是,我得到了错误的索引号。
例如,
out = df1[df1['string'].isin(df.index().['string'])]
最佳答案
一种可能的方法如下:
df = pd.DataFrame([
[0, "008457", "02", "hello"],
[1, "990037", "05", "I"],
[2, "774426", "10", "am"],
[3, "564389", "08", "sleeping"],
[4, "009124", "17", "today"],
[5, "000029", "13", "is"],
[6, "548751", "21", "a"],
[7, "479903", "19", "bright"],
[8, "897054", "08", "sunny"],
[9, "336588", "7", "day"],
[10, "294260", "16", "today"],
[11, "908751", "29", "is"],
[12, "558902", "81", "rainy"],
[13, "097856", "19", "with"],
[14, "110044", "24", "cold"],
[15, "775098", "16", "today"],
[16, "665490", "02", "is"],
[17, "887099", "07", "sunday"],
[18, "389011", "18", "ahhh"],
[19, "675510", "11", "weekend"]
],
columns=["idx", "id_0", "user", "string"]
)
df = df.set_index('idx')
df1 = pd.DataFrame([
[0, "today"],
[1, "is"],
[2, "a"],
[3, "bright"],
[4, "sunny"],
[5, "day"]
],
columns=["idx", "string"]
)
matching_indices = []
for i in range(len(df)-len(df1)+1):
if (df.string.iloc[i:i+len(df1)].values == df1.string.values).all():
matching_indices += list(range(i,i+len(df1)))
df.iloc[matching_indices]
输出:
id_0 user string
idx
4 009124 17 today
5 000029 13 is
6 548751 21 a
7 479903 19 bright
8 897054 08 sunny
9 336588 7 day
上面的代码将返回所有匹配的子序列及其正确的索引,而不仅仅是第一次出现。
如果您只想返回第一个匹配项,则可以在第一次识别到匹配项时中断循环,如下所示:
matching_indices = []
for i in range(len(df)-len(df1)+1):
if (df.string.iloc[i:i+len(df1)].values == df1.string.values).all():
matching_indices += list(range(i,i+len(df1)))
break
df.iloc[matching_indices]
关于python - 具有相似字符串模式的两个数据帧的匹配序列,保持索引和序列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62875836/