我有以下df and function (see below)
。我可能把这件事复杂化了。如果有一双新的眼睛,我们将不胜感激。
df:
Site Name Plan Unique ID Atlas Placement ID
Affectv we11080301 11087207850894
Mashable we14880202 11087208009031
Alphr uk10790301 11087208005229
Alphr uk19350201 11087208005228
目标是:
首先通过
df['Plan Unique ID']
,如果有匹配,则搜索特定值(we_match
或uk_match
)检查字符串值是否大于该组中的某个值(
we12720203
或uk11350200
)如果该值大于则添加
we or uk value
到新专栏df['Consolidated ID']
.如果值较小或没有匹配项,则搜索
df['Atlas Placement ID']
与new_id_search
如果有匹配项,则将其添加到
df['Consolidated ID']
如果不是,则返回 0 到
df['Consolidated ID]
当前的问题是它返回一个空列。
def placement_extract(df="mediaplan_df", we_search="we\d{8}", uk_search="uk\d{8}", new_id_search= "(\d{14})"):
if type(df['Plan Unique ID']) is str:
we_match = re.search(we_search, df['Plan Unique ID'])
if we_match:
if we_match > "we12720203":
return we_match.group(0)
else:
uk_match = re.search(uk_search, df['Plan Unique ID'])
if uk_match:
if uk_match > "uk11350200":
return uk_match.group(0)
else:
match_new = re.search(new_id_search, df['Atlas Placement ID'])
if match_new:
return match_new.group(0)
return 0
mediaplan_df['Consolidated ID'] = mediaplan_df.apply(placement_extract, axis=1)
编辑:清理公式
我修改了following way (see below)
中gzl的函数:首先看df1中是否有14个数字。如果有,请添加。
下一步,理想情况下是获取一列 MediaPlanUnique
来自df2
并把它变成一个系列filtered_placements
:
we11080301
we12880304
we14880202
uk19350201
uk11560205
uk11560305
并查看 filtered_placements
中是否有任何值存在于 df['Plan Unique ID]
。如果匹配,则添加 df['Plan Unique ID]
到我们的最后一栏= df[ConsolidatedID]
当前的问题是它的结果全为0。我认为这是因为比较是进行1对1( first result of new_match
vs first result of filtered_placements
)而不是1对多( first result of new_match
vs all results of filtered_placements
)
有什么想法吗?
def placement_extract(df="mediaplan_df", new_id_search="[a-zA-Z]{2}\d{8}", old_id_search= "(\d{14})"):
if type(df['PlacementID']) is str:
old_match = re.search(old_id_search, df['PlacementID'])
if old_match:
return old_match.group(0)
else:
if type(df['Plan Unique ID']) is str:
if type(filtered_placements) is str:
new_match = re.search(new_id_search, df['Plan Unique ID'])
if new_match:
if filtered_placements.str.contains(new_match.group(0)):
return new_match.group(0)
return 0
mediaplan_df['ConsolidatedID'] = mediaplan_df.apply(placement_extract, axis=1)
最佳答案
我建议不要使用如此复杂的嵌套 if
语句。正如菲尔指出的那样,每项检查都是相互排斥的。因此,您可以在同一缩进 if
语句中检查“we”和“uk”,然后退回到默认流程。
def placement_extract(df="mediaplan_df", we_search="we\d{8}", uk_search="uk\d{8}", new_id_search= "(\d{14})"):
if type(df['Plan Unique ID']) is str:
we_match = re.search(we_search, df['Plan Unique ID'])
if we_match:
if we_match.group(0) > "we12720203":
return we_match.group(0)
uk_match = re.search(uk_search, df['Plan Unique ID'])
if uk_match:
if uk_match.group(0) > "uk11350200":
return uk_match.group(0)
match_new = re.search(new_id_search, df['Atlas Placement ID'])
if match_new:
return match_new.group(0)
return 0
测试:
In [37]: df.apply(placement_extract, axis=1)
Out[37]:
0 11087207850894
1 we14880202
2 11087208005229
3 uk19350201
dtype: object
关于python - Pandas Dataframe 上的条件正则表达式函数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43210179/