目标:如果第 i 行 df2 中的名称是子字符串或与第 N 行 df1 中的名称以及州和地区完全匹配df1 中第 N 行的列与 df2 第 i 行的相应州和地区列相匹配,合并。
有人建议我使用 difflib 创建一个人工键列进行合并。
这个新列称为“名称”。 difflib.get_close_matches 在 df2 中查找相似的字符串。
当“CandidateName”列中的所有行都存在时,此方法效果很好,但我收到 IndexError: 当单元格丢失时列表索引超出范围。
我尝试通过用字符串“EMPTY”填充空列来解决此问题。但是仍然出现同样的错误。
# I used this method to replace empty cells
df1['CandidateName'] = df1['CandidateName'].replace('', 'EMPTY')
# I then proceeded to run the line again
df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
# Data Frame Samples
# Data Frame 1
CandidateName = ['Theodorick A. Bland','Aedanus Rutherford Burke','Jason Lewis','Barbara Comstock','Theodorick Bland','Aedanus Burke','Jason Initial Lewis', '','']
State = ['VA', 'SC', 'MN','VA','VA', 'SC', 'MN','NH','NH']
District = [9,2,2,10,9,2,2,1,1]
Party = ['','', '','Democrat','','','Democrat','Whig','Whig']
data1 = {'CandidateName':CandidateName, 'State':State, 'District':District,'Party':Party }
df1 = pd.DataFrame(data = data1)
print df1
# CandidateName District Party State
#0 Theodorick A. Bland 9 VA
#1 Aedanus Rutherford Burke 2 SC
#2 Jason Lewis 2 Democrat MN
#3 Barbara Comstock 10 Democrat VA
#4 Theodorick Bland 9 VA
#5 Aedanus Burke 2 SC
#6 Jason Initial Lewis 2 Democrat MN
#7 '' 1 Whig NH
#8 '' 1 Whig NH
Name = ['Theodorick Bland','Aedanus Burke','Jason Lewis', 'Barbara Comstock']
State = ['VA', 'SC', 'MN','VA']
District = [9,2,2,10]
Party = ['','', 'Democrat','Democrat']
data2 = {'Name':Name, 'State':State, 'District':District, 'Party':Party}
df2 = pd.DataFrame(data = data2)
print df2
# CandidateName District Party State
#0 Theodorick Bland 9 VA
#1 Aedanus Burke 2 SC
#2 Jason Lewis 2 Democrat MN
#3 Barbara Comstock 10 Democrat VA
import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'])
预期
print(df1)
# CandidateName State District Party Name
#0 Theodorick A. Bland VA 9 Theodorick Bland
#1 Aedanus Rutherford Burke SC 2 Aedanus Burke
#2 Jason Lewis MN 2 Jason Lewis
#3 Barbara Comstock VA 10 Democrat Barbara Comstock
#4 Theodorick Bland VA 9 Theodorick Bland
#5 Aedanus Burke SC 2 Aedanus Burke
#6 Jason Initial Lewis MN 2 Democrat Jason Lewis
#7 NH 1 Whig
#8 NH 1 Whig
实际错误结果:
-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype)
---> 23 df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
IndexError: list index out of range
最佳答案
您将返回一个 list
类型对象。这些列表没有索引0
。这就是您收到此错误的原因。其次,我们需要将这些列表转换为字符串类型,以便能够进行如下合并:
注意:您不必使用:df1['CandidateName'] = df1['CandidateName'].replace('', 'EMPTY')
import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: ''.join(difflib.get_close_matches(x, df2['Name'])))
df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'], how='left')
print(df_merge)
CandidateName State District Party Name
0 Theodorick A. Bland VA 9 Theodorick Bland
1 Aedanus Rutherford Burke SC 2 Aedanus Burke
2 Jason Lewis MN 2 Jason Lewis
3 Barbara Comstock VA 10 Democrat Barbara Comstock
4 Theodorick Bland VA 9 Theodorick Bland
5 Aedanus Burke SC 2 Aedanus Burke
6 Jason Initial Lewis MN 2 Democrat Jason Lewis
7 NH 1 Whig
8 NH 1 Whig
注意我在合并
中添加了how='left'
参数,因为您想保留原始数据框的形状。
''.join()的解释
我们这样做是为了将列表转换为字符串,请参见示例:
lst = ['hello', 'world']
print(' '.join(lst))
'hello world'
关于python - 当感兴趣的列缺少单元格时,如何使用 difflab 创建人工键列来合并两个数据集?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55445922/