我有一个像这样的 csv 示例:
keys key_regex datatype detailed_datatype precedence val_regex val_regex_2 val_regex_3 max_words alpha_char_check
0 billingAddress original_billing_key_regex alphabetic address primary NaN NaN NaN NaN NaN
1 deliveryAddress original_delivery_key_regex alphabetic address primary NaN NaN NaN NaN NaN
2 notifyParty original_notify_party_regex alphabetic alphabetic primary NaN NaN NaN NaN NaN
3 originAddress original_seller_address_regex alphabetic address primary NaN NaN NaN NaN NaN
4 billingAddressAlt alternative_billing_key_regex alphabetic address tertiary NaN NaN NaN NaN NaN
5 deliveryAddressAlt alternative_delivery_key_regex alphabetic address tertiary NaN NaN NaN 5.0 1.0
6 originAddressAlt alternative_seller_key_regex alphabetic address tertiary NaN sample_val_re1 NaN NaN 0.0
我正在尝试将 keys
列的值作为 tertiary_row_replacement_dict
中的键的行替换为具有 keys
的行列值作为相应的值,然后将 precendence
列值从 'tertiary'
重命名为 'primary'
- 同时保持索引位置与前。
预期的输出是这样的:
keys key_regex datatype detailed_datatype precedence val_regex val_regex_2 val_regex_3 max_words alpha_char_check
0 billingAddress alternative_billing_key_regex alphabetic address primary NaN NaN NaN NaN NaN
1 deliveryAddress alternative_delivery_key_regex alphabetic address primary NaN NaN NaN 5.0 1.0
2 notifyParty original_notify_party_regex alphabetic alphabetic primary NaN NaN NaN NaN NaN
3 originAddress alternative_seller_key_regex alphabetic address primary NaN sample_val_re1 NaN NaN 0.0
有 3 个原始 csv - 每个都很大,有很多类似的情况,即具有第一优先级的键和具有第三优先级的替代键。我的字典的键如下所示:
tertiary_row_replacement_dict = {
"originAddress": "originAddressAlt",
"deliveryAddress": "deliveryAddressAlt",
# "totalAmount": "totalAmountAlt",
"billingAddress": "billingAddressAlt"
....
}
假设该字典的键和相应的值始终存在于 csv 中,我有以下代码:
for k, new_k in row_replacement_dict.items():
t2 = df.loc[df['keys']==new_k].index[0]
df.loc[df.loc[df['keys']==k].index[0]] = [i if i!='tertiary' else 'primary' for i in df.loc[t2]]
df = df.replace([new_k, 'tertiary'], [k, 'primary']).drop([t2])
它完成了我想做的事情。仅在测试 csv 上执行此操作大约需要 0.034 秒,并且可能不是处理仅替换行并替换单元格值的情况的最佳或优化方法。是否有任何更快的替代方法,前提是知道哪些行要替换为哪一行(即,不强制使用该字典,我们可以将其用作列表列表的元组列表以进行速度权衡)。
最佳答案
您可以使用replace
将三级键替换为主键,并使用groupby().first()
填写信息:
inverse_dict = {v:k for k,v in tertiary_row_replacement_dict.items()}
(df.groupby(df['keys'].replace(inverse_dict))
.first()
.reset_index(drop=True)
)
输出:
keys key_regex datatype detailed_datatype precedence val_regex val_regex_2 val_regex_3 max_words alpha_char_check
-- --------------- ----------------------------- ---------- ------------------- ------------ ----------- -------------- ------------- ----------- ------------------
0 billingAddress original_billing_key_regex alphabetic address primary nan nan nan nan nan
1 deliveryAddress original_delivery_key_regex alphabetic address primary nan nan nan 5 1
2 notifyParty original_notify_party_regex alphabetic alphabetic primary nan nan nan nan nan
3 originAddress original_seller_address_regex alphabetic address primary nan sample_val_re1 nan nan 0
关于python - 在数据帧的某个索引处用另一行替换一行并更改单元格值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62105742/