python - 当感兴趣的列缺少单元格时,如何使用 difflab 创建人工键列来合并两个数据集?

标签 python regex pandas python-2.7 difflib

目标:如果第 i 行 df2 中的名称是子字符串或与第 N 行 df1 中的名称以及州和地区完全匹配df1 中第 N 行的列与 df2 第 i 行的相应州和地区列相匹配,合并。

有人建议我使用 difflib 创建一个人工键列进行合并。

这个新列称为“名称”。 difflib.get_close_matches 在 df2 中查找相似的字符串。

当“CandidateName”列中的所有行都存在时,此方法效果很好,但我收到 IndexError: 当单元格丢失时列表索引超出范围。

我尝试通过用字符串“EMPTY”填充空列来解决此问题。但是仍然出现同样的错误。

# I used this method to replace empty cells
df1['CandidateName'] = df1['CandidateName'].replace('', 'EMPTY')


# I then proceeded to run the line again
df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
# Data Frame Samples

# Data Frame 1
CandidateName = ['Theodorick A. Bland','Aedanus Rutherford Burke','Jason Lewis','Barbara  Comstock','Theodorick Bland','Aedanus Burke','Jason Initial Lewis', '','']
State = ['VA', 'SC', 'MN','VA','VA', 'SC', 'MN','NH','NH']
District = [9,2,2,10,9,2,2,1,1]
Party = ['','', '','Democrat','','','Democrat','Whig','Whig']
data1 = {'CandidateName':CandidateName, 'State':State, 'District':District,'Party':Party }
df1 = pd.DataFrame(data = data1)

print df1

#        CandidateName         District   Party          State
#0  Theodorick A. Bland           9                       VA
#1  Aedanus Rutherford Burke      2                       SC
#2  Jason Lewis                   2       Democrat        MN
#3  Barbara Comstock             10       Democrat        VA
#4  Theodorick Bland              9                       VA
#5  Aedanus Burke                 2                       SC
#6  Jason Initial Lewis           2         Democrat      MN
#7  ''                            1         Whig          NH
#8  ''                            1         Whig          NH

Name = ['Theodorick Bland','Aedanus Burke','Jason Lewis', 'Barbara Comstock']
State = ['VA', 'SC', 'MN','VA']
District = [9,2,2,10]
Party = ['','', 'Democrat','Democrat']
data2 = {'Name':Name, 'State':State, 'District':District, 'Party':Party}
df2 = pd.DataFrame(data = data2)

print df2

#   CandidateName        District   Party      State
#0  Theodorick Bland        9                   VA
#1  Aedanus Burke           2                   SC
#2  Jason Lewis             2       Democrat    MN
#3  Barbara Comstock        10      Democrat    VA

import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])

df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'])

预期

print(df1)
#              CandidateName State  District     Party              Name
#0       Theodorick A. Bland    VA         9            Theodorick Bland
#1  Aedanus Rutherford Burke    SC         2               Aedanus Burke
#2               Jason Lewis    MN         2                 Jason Lewis
#3         Barbara  Comstock    VA        10  Democrat  Barbara Comstock
#4          Theodorick Bland    VA         9            Theodorick Bland
#5             Aedanus Burke    SC         2               Aedanus Burke
#6       Jason Initial Lewis    MN         2  Democrat       Jason Lewis
#7                              NH         1      Whig    
#8                              NH         1      Whig    

实际错误结果:

-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype)
---> 23 df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])

IndexError: list index out of range

最佳答案

您将返回一个 list 类型对象。这些列表没有索引0。这就是您收到此错误的原因。其次,我们需要将这些列表转换为字符串类型,以便能够进行如下合并:

注意:您不必使用:df1['CandidateName'] = df1['CandidateName'].replace('', 'EMPTY')

import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: ''.join(difflib.get_close_matches(x, df2['Name'])))

df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'], how='left')

print(df_merge)
              CandidateName State  District     Party              Name
0       Theodorick A. Bland    VA         9            Theodorick Bland
1  Aedanus Rutherford Burke    SC         2               Aedanus Burke
2               Jason Lewis    MN         2                 Jason Lewis
3         Barbara  Comstock    VA        10  Democrat  Barbara Comstock
4          Theodorick Bland    VA         9            Theodorick Bland
5             Aedanus Burke    SC         2               Aedanus Burke
6       Jason Initial Lewis    MN         2  Democrat       Jason Lewis
7                              NH         1      Whig                  
8                              NH         1      Whig                

注意我在合并中添加了how='left'参数,因为您想保留原始数据框的形状。

''.join()的解释
我们这样做是为了将列表转换为字符串,请参见示例:

lst = ['hello', 'world']

print(' '.join(lst))
'hello world'

关于python - 当感兴趣的列缺少单元格时,如何使用 difflab 创建人工键列来合并两个数据集?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55445922/

相关文章:

c# - 查找并替换所有以 # 开头的单词,并将带标签的文本包装在 HTML 中

java正则表达式返回false

python - 估计转移概率( Pandas )

php - CodeIgniter URI 中的百分比符号

python - OpenCv错误无法通过视频捕获打开相机

python - 如何使用 Mongrel2 提供 WSGI Python 应用程序?

python - Pandas 从具有值的列中选择并获取列名称

python-3.x - 如何检查 Pandas 单元格值是否为 nan

Python:循环中IF语句的处理不一致

python - 使用wtforms在ajax中上传文件