我正在努力解决如何获取数据集并输出一个结果,该结果在一列中找到重复的信息,而在另一列中找到非重复的项目。如果说第 0 列和第 2 列是完全重复的,我不关心这组数据,只有当第 0 列的行在第 2 列中有多个值的条目时,我才不关心。并且,如果是这种情况,我想要所有与第 0 列匹配的行数。
我首先使用 concat 将数据集缩小到具有重复项的行。我现在的问题是尝试仅获取第 2 列不同的行。
我的示例数据集是:
Pattern or URI,Route Filter Clause,Partition,Pattern Usage,Owning Object,Owning Object Partition,Cluster ID,Catalog Name,Route String,Device Name,Device Description
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFF0723AFE8,device1
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFF862FAF74,device2
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFFF2A8AA38,device3
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFFD2C0A2C6,device4
"22334",,Prod_P,Device,"22334",Prod_P,,,,SEPFFFFCF87AB31,device5
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCF87AAEA,device6
"33333",,Dummy_P,Device,"33333",Dummy_P,,,,SEPFFFF18FF65A0,device7
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCFCCAABB,device8
在这个集合中,我想要最后三行“33333”的结果,因为它们在第 2 列中有不止一种类型的值。“11111”仅匹配 Prod_P,所以我不关心它.
import pandas as pd
ignorelist = []
inputfile = "pandas-problem-data.txt"
data = pd.read_csv(inputfile)
data.columns = data.columns.str.replace(' ','_')
data = pd.concat(g for _, g in data.groupby("Pattern_or_URI") if len(g) > 1)
data = data.loc[(data["Pattern_Usage"]=="Device"), ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"]]
new_rows = []
tempdup = pd.DataFrame()
for i, row in data.iterrows():
if row["Pattern_or_URI"] in ignorelist:
continue
ignorelist.append(row["Pattern_or_URI"])
# testdup = pd.concat(h for _, h in (data.loc[(data["Pattern_or_URI"]==row["Pattern_or_URI"], ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"])]).groupby("Partition") if len(h) > 1)
# print(data.loc[(data["Pattern_or_URI"]==row["Pattern_or_URI"], ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"])])
newrow = data.loc[(data["Pattern_or_URI"]==row["Pattern_or_URI"], ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"])]
如果我取消注释尝试使用相同的连接来查找“Partition”> 1 的条目的行,则会收到错误ValueError:没有要连接的对象
。我知道它通过了第一个迭代器,并且 print 语句未注释。
有没有更简单或更好的方法来做到这一点?我对 Pandas 很陌生,一直在想可能有一种方法可以找到我还没有想到的方法。 谢谢。
期望的输出:
Pattern or URI,Route Filter Clause,Partition,Pattern Usage,Owning Object,Owning Object Partition,Cluster ID,Catalog Name,Route String,Device Name,Device Description
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCF87AAEA,device6
"33333",,Dummy_P,Device,"33333",Dummy_P,,,,SEPFFFF18FF65A0,device7
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCFCCAABB,device8
最佳答案
我认为说您正在寻找重复项有点误导。这确实是一个分组问题。
您希望在模式或URI
中查找与分区
系列中多个唯一值相对应的相同值组。
变换
+ nunique
s = df.groupby('Pattern or URI')['Partition'].transform('nunique').gt(1)
df.loc[s]
Pattern or URI Route Filter Clause Partition Pattern Usage Owning Object Owning Object Partition Cluster ID Catalog Name Route String Device Name Device Description
5 33333 NaN Prod_P Device 33333 Prod_P NaN NaN NaN SEPFFFFCF87AAEA device6
6 33333 NaN Dummy_P Device 33333 Dummy_P NaN NaN NaN SEPFFFF18FF65A0 device7
7 33333 NaN Prod_P Device 33333 Prod_P NaN NaN NaN SEPFFFFCFCCAABB device8
关于python - 查找一列中的重复项与另一列中的非重复项,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53345345/