python - 具有相似字符串模式的两个数据帧的匹配序列,保持索引和序列

标签 python pandas scikit-learn

我有两个数据帧 df 和 df1。我必须匹配序列或字符串,并获取索引号为 df 的唯一匹配字符串序列作为输出。

df

idx id_0 user string
0 008457 02 hello
1 990037 05 I 
2 774426 10 am
3 564389 08 sleeping
4 009124 17 today
5 000029 13 is
6 548751 21 a
7 479903 19 bright
8 897054 08 sunny
9 336588 7 day
10 294260 16 today 
11 908751 29 is 
12 558902 81 rainy
13 097856 19 with
14 110044 24 cold
15 775098 16 today 
16 665490 02 is
17 887099 07 sunday 
18 389011 18 ahhh
19 675510 11 weekend 

df1

idx string
0 today
1 is
2 a
3 bright
4 sunny
5 day

输出:

idx id_0 user string
4 009124 17 today
5 000029 13 is
6 548751 21 a
7 479903 19 bright
8 897054 08 sunny
9 336588 7 day

我尝试了几种方法 pd.merge、pd.concat、pd.join,也使用 isin,但是,我得到了错误的索引号。

例如,

out = df1[df1['string'].isin(df.index().['string'])]

最佳答案

一种可能的方法如下:

df = pd.DataFrame([
    [0, "008457", "02", "hello"],
    [1, "990037", "05", "I"],
    [2, "774426", "10", "am"],
    [3, "564389", "08", "sleeping"],
    [4, "009124", "17", "today"],
    [5, "000029", "13", "is"],
    [6, "548751", "21", "a"],
    [7, "479903", "19", "bright"],
    [8, "897054", "08", "sunny"],
    [9, "336588", "7", "day"],
    [10, "294260", "16", "today"],
    [11, "908751", "29", "is"],
    [12, "558902", "81", "rainy"],
    [13, "097856", "19", "with"],
    [14, "110044", "24", "cold"],
    [15, "775098", "16", "today"],
    [16, "665490", "02", "is"],
    [17, "887099", "07", "sunday"],
    [18, "389011", "18", "ahhh"],
    [19, "675510", "11", "weekend"]
],
columns=["idx", "id_0", "user", "string"]
)
df = df.set_index('idx')

df1 = pd.DataFrame([
[0, "today"],
[1, "is"],
[2, "a"],
[3, "bright"],
[4, "sunny"],
[5, "day"]
],
columns=["idx", "string"]
)


matching_indices = []
for i in range(len(df)-len(df1)+1):
    if (df.string.iloc[i:i+len(df1)].values == df1.string.values).all():
        matching_indices += list(range(i,i+len(df1)))

df.iloc[matching_indices]

输出:

    id_0    user    string
idx         
4   009124  17  today
5   000029  13  is
6   548751  21  a
7   479903  19  bright
8   897054  08  sunny
9   336588  7   day

上面的代码将返回所有匹配的子序列及其正确的索引,而不仅仅是第一次出现。

如果您只想返回第一个匹配项,则可以在第一次识别到匹配项时中断循环,如下所示:

matching_indices = []
for i in range(len(df)-len(df1)+1):
    if (df.string.iloc[i:i+len(df1)].values == df1.string.values).all():
        matching_indices += list(range(i,i+len(df1)))
        break

df.iloc[matching_indices]

关于python - 具有相似字符串模式的两个数据帧的匹配序列,保持索引和序列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62875836/

相关文章:

python - 对初始化列表进行排序

python - 删除评论标签但不满足于 BeautifulSoup

python - 使用 pandas .map 更改值

python - 如何在 python 对象中有效地使用 Session 来将 tensorflow 作为实现细节?

python - 如何设置列表框中项目的颜色

python - matplotlib 散点图对于大量数据是否慢?

python - 有没有办法从不同长度的 Pandas 数据框中移动多行?

python - pandas 排序列遗漏值

scikit-learn - 如何从sklearn中的不平衡数据集中获得平衡的类样本?

python - “help” 通过将 2 个特征绑定(bind)在一起的决策树