python - 如何检测第一个单词并将其包含在 python 中的字符串替换行中？

我想阅读一列，其中每行的第一个单词是进行调查的季度和年份，以及调查的名称。最初，我试图重命名调查名称，在整个专栏中保持季度和年度不变，但如果我针对其他季度的文件运行此脚本，那么整行将不会被检测到，我的脚本将无法工作。

我的例子:

        Survey Name
0       Q321 Your Voice - Information Tech
1       Q321 Your Voice - Information Tech
2       Q321 Your Voice - Information Tech
3       Q321 Your Voice - Information Tech
4       Q321 Your Voice - Information Tech
                
9630    Q321 Your Voice - Business Group
9631    Q321 Your Voice - Business Group

(第 321 季度 = 2021 年第 3 季度)

我的代码将其转换成什么:

Survey Name
0       Q321 YV - IT
1       Q321 YV - IT
2       Q321 YV - IT
3       Q321 YV - IT
4       Q321 YV - IT
                
9630    Q321 YV - BG
9631    Q321 YV - BG

我使用的代码:

print(df.loc[:, "Survey.Name"])

'isolate to column of interest and replace commonly incorrect string with the correct output'

df.loc[df['Survey.Name'].str.contains('Q321 Your Voice - Information Tech'), 'Survey.Name'] = \
    'Q321 YV - IT'

df.loc[df['Survey.Name'].str.contains('Q321 Your Voice - Business Group'), 'Survey.Name'] = \
    'Q321 YV - BG'

df.loc[df['Survey.Name'].str.contains('Q321 Your Voice - Study Group'), 'Survey.Name'] = \
    'Q321 YV - SG'
        
    
print(df.loc[:, "Survey.Name"])

但假设我针对不同季度(例如 2021 年第 4 季度)的文件运行此脚本:

Survey Name
0       Q421 Your Voice - Information Tech
1       Q421 Your Voice - Information Tech
2       Q421 Your Voice - Information Tech
3       Q421 Your Voice - Information Tech
4       Q421 Your Voice - Information Tech

9630    Q421 Your Voice - Business Group
9631    Q421 Your Voice - Business Group

每次使用新季度时，我都必须更改脚本。有没有办法让我“检测”第一个单词(幸运的是，它恰好是调查的季度和年份)并将其包含在转换后的版本中，同时替换该列中需要更改的字符串？

最佳答案

一种可能过于复杂的方法是使用带有捕获组的正则表达式，如下所示:

res = df["Survey Name"].str.replace(r"(Q\d+)\s+(\w)\w+ (\w)\w+ - (\w)\w+ (\w)\w+", r"\1 \2\3 - \4\5", regex=True)
print(res)

输出

0    Q321 YV - IT
1    Q321 YV - IT
2    Q321 YV - IT
3    Q321 YV - IT
4    Q321 YV - IT
5    Q321 YV - BG
6    Q321 YV - BG
Name: Survey Name, dtype: object

请注意，正则表达式模式捕获第一个单词和每个剩余单词的第一个字母。

另一种替代方法是使用带有替换功能的 apply:

def repl(x):
    head, tail = x.split("-")
    quarter, *chunk = head.split()

    head_initials = "".join(c[0] for c in chunk)
    tail_initials = "".join(c[0] for c in tail.split())

    return f"{quarter} {head_initials} - {tail_initials}"


res = df["Survey Name"].apply(repl)

输出

0    Q321 YV - IT
1    Q321 YV - IT
2    Q321 YV - IT
3    Q321 YV - IT
4    Q321 YV - IT
5    Q321 YV - BG
6    Q321 YV - BG
Name: Survey Name, dtype: object

更新

更换部件的更通用方法是:

replacements = {
    "Your Voice - Information Tech": "YV - IT Group",
    "Your Voice - Business Group": "YV - BG",
    "Your Voice - Human Resources": "YV - LRECS"
}


def repl(match, repls=replacements):
    quarter = match.group(1)
    key = " ".join(match.group(2).strip().split())

    return f"{quarter} {replacements.get(key, '')}"


res = df["Survey Name"].str.replace(r"(Q\d+)\s+(.+)", repl, regex=True)
print(res)

输出

0    Q321 YV - IT Group
1    Q321 YV - IT Group
2    Q321 YV - IT Group
3    Q321 YV - IT Group
4    Q321 YV - IT Group
5          Q321 YV - BG
6       Q321 YV - LRECS
Name: Survey Name, dtype: object

请注意，replacements 是一个字典，其中键是期望找到的字符串，值是相应的替换项。

上述输出的数据来自:

{'Survey Name': {0: 'Q321 Your Voice - Information Tech',
                 1: 'Q321 Your Voice - Information Tech',
                 2: 'Q321 Your Voice - Information Tech',
                 3: 'Q321 Your Voice - Information Tech',
                 4: 'Q321 Your Voice - Information Tech',
                 5: 'Q321 Your Voice - Business Group',
                 6: 'Q321 Your Voice - Human Resources'}}

关于python - 如何检测第一个单词并将其包含在 python 中的字符串替换行中？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/73137575/

python - 如何检测第一个单词并将其包含在 python 中的字符串替换行中？

上一篇：excel - SUMIFS excel 公式中省略空格

下一篇：ubuntu - 尝试使用 pip3 安装 mariadb 包时出错