我想阅读一列,其中每行的第一个单词是进行调查的季度和年份,以及调查的名称。最初,我试图重命名调查名称,在整个专栏中保持季度和年度不变,但如果我针对其他季度的文件运行此脚本,那么整行将不会被检测到,我的脚本将无法工作。
我的例子:
Survey Name
0 Q321 Your Voice - Information Tech
1 Q321 Your Voice - Information Tech
2 Q321 Your Voice - Information Tech
3 Q321 Your Voice - Information Tech
4 Q321 Your Voice - Information Tech
9630 Q321 Your Voice - Business Group
9631 Q321 Your Voice - Business Group
(第 321 季度 = 2021 年第 3 季度)
我的代码将其转换成什么:
Survey Name
0 Q321 YV - IT
1 Q321 YV - IT
2 Q321 YV - IT
3 Q321 YV - IT
4 Q321 YV - IT
9630 Q321 YV - BG
9631 Q321 YV - BG
我使用的代码:
print(df.loc[:, "Survey.Name"])
'isolate to column of interest and replace commonly incorrect string with the correct output'
df.loc[df['Survey.Name'].str.contains('Q321 Your Voice - Information Tech'), 'Survey.Name'] = \
'Q321 YV - IT'
df.loc[df['Survey.Name'].str.contains('Q321 Your Voice - Business Group'), 'Survey.Name'] = \
'Q321 YV - BG'
df.loc[df['Survey.Name'].str.contains('Q321 Your Voice - Study Group'), 'Survey.Name'] = \
'Q321 YV - SG'
print(df.loc[:, "Survey.Name"])
但假设我针对不同季度(例如 2021 年第 4 季度)的文件运行此脚本:
Survey Name
0 Q421 Your Voice - Information Tech
1 Q421 Your Voice - Information Tech
2 Q421 Your Voice - Information Tech
3 Q421 Your Voice - Information Tech
4 Q421 Your Voice - Information Tech
9630 Q421 Your Voice - Business Group
9631 Q421 Your Voice - Business Group
每次使用新季度时,我都必须更改脚本。有没有办法让我“检测”第一个单词(幸运的是,它恰好是调查的季度和年份)并将其包含在转换后的版本中,同时替换该列中需要更改的字符串?
最佳答案
一种可能过于复杂的方法是使用带有捕获组的正则表达式,如下所示:
res = df["Survey Name"].str.replace(r"(Q\d+)\s+(\w)\w+ (\w)\w+ - (\w)\w+ (\w)\w+", r"\1 \2\3 - \4\5", regex=True)
print(res)
输出
0 Q321 YV - IT
1 Q321 YV - IT
2 Q321 YV - IT
3 Q321 YV - IT
4 Q321 YV - IT
5 Q321 YV - BG
6 Q321 YV - BG
Name: Survey Name, dtype: object
请注意,正则表达式模式捕获第一个单词和每个剩余单词的第一个字母。
另一种替代方法是使用带有替换功能的 apply:
def repl(x):
head, tail = x.split("-")
quarter, *chunk = head.split()
head_initials = "".join(c[0] for c in chunk)
tail_initials = "".join(c[0] for c in tail.split())
return f"{quarter} {head_initials} - {tail_initials}"
res = df["Survey Name"].apply(repl)
输出
0 Q321 YV - IT
1 Q321 YV - IT
2 Q321 YV - IT
3 Q321 YV - IT
4 Q321 YV - IT
5 Q321 YV - BG
6 Q321 YV - BG
Name: Survey Name, dtype: object
更新
更换部件的更通用方法是:
replacements = {
"Your Voice - Information Tech": "YV - IT Group",
"Your Voice - Business Group": "YV - BG",
"Your Voice - Human Resources": "YV - LRECS"
}
def repl(match, repls=replacements):
quarter = match.group(1)
key = " ".join(match.group(2).strip().split())
return f"{quarter} {replacements.get(key, '')}"
res = df["Survey Name"].str.replace(r"(Q\d+)\s+(.+)", repl, regex=True)
print(res)
输出
0 Q321 YV - IT Group
1 Q321 YV - IT Group
2 Q321 YV - IT Group
3 Q321 YV - IT Group
4 Q321 YV - IT Group
5 Q321 YV - BG
6 Q321 YV - LRECS
Name: Survey Name, dtype: object
请注意,replacements
是一个字典,其中键是期望找到的字符串,值是相应的替换项。
上述输出的数据来自:
{'Survey Name': {0: 'Q321 Your Voice - Information Tech',
1: 'Q321 Your Voice - Information Tech',
2: 'Q321 Your Voice - Information Tech',
3: 'Q321 Your Voice - Information Tech',
4: 'Q321 Your Voice - Information Tech',
5: 'Q321 Your Voice - Business Group',
6: 'Q321 Your Voice - Human Resources'}}
关于python - 如何检测第一个单词并将其包含在 python 中的字符串替换行中?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/73137575/