python - 当多个匹配时使用 apply() 将新值添加到数据帧

标签 python pandas dataframe

我有一个数据框 df 如下

d = {'letter_num' :['Nr. 1', 'Nr. 2', 'Nr. 3', 'Nr. 3']}

df = pd.DataFrame(d)
print(df)

   letter_num
0         Nr. 1
1         Nr. 2
2         Nr. 3
3         Nr. 3

letters = pd.DataFrame(d, columns=['letter_num'])

我想将以下字典的键和值作为新列添加到上述数据框中,条件是键(中的数字)与 df 中 letter_num 列中的现有(数字)值匹配。

labels = {'[1]': 'budget', '[2]': 'budget', '[3 a]': 'expensive', '[3 b]': 'sport'}


def apply_and_concat(dataframe, field, func, column_names):
    return pd.concat((
        dataframe,
        dataframe[field].apply(
            lambda cell: pd.Series(func(cell), index=column_names))), axis=1)

def matcher(k):
    for i,j in labels.items():
      num =  re.search('(\d+)', i).group()
      if num in k.split(' '): 
        return i,j

apply_and_concat(df, 'letter_num', matcher, ['letters','content'])

上面的代码给出的输出如下:

 letter_num letters content
0   Nr. 1   [1]     budget
1   Nr. 2   [2]     budget
2   Nr. 3   [3 a]   expensive
3   Nr. 3   [3 a]   expensive


Expected Output:

 letter_num letters content
0   Nr. 1   [1]     budget
1   Nr. 2   [2]     budget
2   Nr. 3   [3 a]   expensive
3   Nr. 3   [3 b]   sport

有人可以帮我吗?

最佳答案

使用有点不同的方法 - 想法是通过 labels 创建新的 DataFrame,通过 Series.str.extract 将数字获取到新的 Series主要通过 GroupBy.cumcount 添加它们的计数器.

在此解决方案中,通过 Series.str.cat 连接在一起并设置为两者的索引,所以最后可以使用 DataFrame.join :

d = {'letter_num' :['Nr. 1', 'Nr. 2', 'Nr. 3', 'Nr. 3']}

letters = pd.DataFrame(d, columns=['letter_num'])

labels = {'[1]': 'budget', '[2]': 'budget', '[3 a]': 'expensive', '[3 b]': 'sport'}

df1 = pd.DataFrame({(k, v) for k, v in labels.items()}, columns=['letters','content'])
num = df1['letters'].str.extract(r'(\d+)', expand=False)
df1.index = df1.groupby(num).cumcount().astype(str).str.cat(num, sep='|')
print (df1)
    letters    content
0|3   [3 a]  expensive
0|2     [2]     budget
0|1     [1]     budget
1|3   [3 b]      sport
<小时/>
df = pd.DataFrame(d)

num = df['letter_num'].str.extract(r'(\d+)', expand=False)
df.index = df.groupby(num).cumcount().astype(str).str.cat(num, sep='|')
print (df)

    letter_num
0|1      Nr. 1
0|2      Nr. 2
0|3      Nr. 3
1|3      Nr. 3
<小时/>
df = df.join(df1).reset_index(drop=True)
print (df)
  letter_num letters    content
0      Nr. 1     [1]     budget
1      Nr. 2     [2]     budget
2      Nr. 3   [3 a]  expensive
3      Nr. 3   [3 b]      sport

或者创建新列并使用 DataFrame.merge左连接:

d = {'letter_num' :['Nr. 1', 'Nr. 2', 'Nr. 3', 'Nr. 3']}

letters = pd.DataFrame(d, columns=['letter_num'])

labels = {'[1]': 'budget', '[2]': 'budget', '[3 a]': 'expensive', '[3 b]': 'sport'}

df1 = pd.DataFrame({(k, v) for k, v in labels.items()}, columns=['letters','content'])
df1['num'] = df1['letters'].str.extract(r'(\d+)', expand=False)
df1['g'] = df1.groupby('num').cumcount()
print (df1)
  letters    content num  g
0   [3 a]  expensive   3  0
1     [2]     budget   2  0
2     [1]     budget   1  0
3   [3 b]      sport   3  1
<小时/>
df = pd.DataFrame(d)
#print (df)

df['num'] = df['letter_num'].str.extract(r'(\d+)', expand=False)
df['g'] = df.groupby('num').cumcount()
print (df)
  letter_num num  g
0      Nr. 1   1  0
1      Nr. 2   2  0
2      Nr. 3   3  0
3      Nr. 3   3  1
<小时/>
df = df.merge(df1, on=['num','g'], how='left')
print (df)
  letter_num num  g letters    content
0      Nr. 1   1  0     [1]     budget
1      Nr. 2   2  0     [2]     budget
2      Nr. 3   3  0   [3 a]  expensive
3      Nr. 3   3  1   [3 b]      sport

df = df.drop(['num','g'], axis=1)
print (df)
  letter_num letters    content
0      Nr. 1     [1]     budget
1      Nr. 2     [2]     budget
2      Nr. 3   [3 a]  expensive
3      Nr. 3   [3 b]      sport

关于python - 当多个匹配时使用 apply() 将新值添加到数据帧,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58444231/

相关文章:

python - 如何在 Python 中停止一段代码以运行另一段代码?

python - 从重复轴重新索引

python - 使用 TensorFlow 的 TFRecordReader

python - Pandas pivot_table 因列和边距而失败

r - 有没有什么方法可以使用 Shiny 的操作按钮递归地将行添加到 data.frame 中?

python - Selenium +Python : check attribute value

python - 即使填充了大部分数据也无法插入数据帧

python - 列表未对齐?

python - 使用 groupby 获取组中具有最大值的行

python - 我需要根据列上的值从 Pandas 数据框中制作真值表