python - 从字典中删除重复的 Pandas 数据框

标签 python pandas dictionary duplicates

我有一个包含具有相同列名的 pandas 数据框的字典,我想删除具有相同值和行 ID 的重复数据框。

假设这是我的数据框字典:

>>> dd[0]
              Origin           Destination                Time
0           New York                Boston 2016-03-28 02:00:00
1           New York           Los Angeles 2016-03-28 04:00:00
2             Boston           Los Angeles 2016-03-28 06:00:00
>>> dd[1]
              Origin           Destination                Time
0           New York                Boston 2016-03-28 02:00:00
1           New York           Los Angeles 2016-03-28 04:00:00
2             Boston           Los Angeles 2016-03-28 06:00:00
>>> dd[2]
              Origin           Destination                Time
0           New York                Boston 2016-03-28 02:00:00
1           New York           Los Angeles 2016-03-28 04:00:00
2             Boston           Los Angeles 2016-03-28 06:00:00
>>> dd[3]
              Origin           Destination                Time
1           New York           Los Angeles 2016-03-28 04:00:00
2           Los Angeles             Boston 2016-03-28 06:00:00
3             Boston              New York 2016-03-28 08:00:00
>>> dd[4]
              Origin           Destination                Time
1           New York           Los Angeles 2016-03-28 04:00:00
2           Los Angeles             Boston 2016-03-28 06:00:00
3             Boston              New York 2016-03-28 08:00:00
>>> dd[5]
              Origin           Destination                Time
3             Boston              New York 2016-03-28 08:00:00
4           New York           Los Angeles 2016-03-28 12:00:00
>>> dd[6]
              Origin           Destination                Time
3             Boston              New York 2016-03-28 08:00:00
4           New York           Los Angeles 2016-03-28 12:00:00

我希望结果看起来像这样:

>>> dd[0]
              Origin           Destination                Time
0           New York                Boston 2016-03-28 02:00:00
1           New York           Los Angeles 2016-03-28 04:00:00
2             Boston           Los Angeles 2016-03-28 06:00:00
>>> dd[3]
              Origin           Destination                Time
1           New York           Los Angeles 2016-03-28 04:00:00
2           Los Angeles             Boston 2016-03-28 06:00:00
3             Boston              New York 2016-03-28 08:00:00
>>> dd[5]
              Origin           Destination                Time
3             Boston              New York 2016-03-28 08:00:00
4           New York           Los Angeles 2016-03-28 12:00:00

这是导致上述示例的代码:

# Load data as pandas data frame
data = pd.read_csv("website.txt", names = ["Time", "Origin", `"Destination"])`
data["Time"] = pd.to_datetime(data["Time"], infer_datetime_format=True)
# Reverse data frame by index to loop backwards
data = data.reindex(index=df.index[::-1]) 
dd = {} 
for i, e in reverse.iterrows(): 
    dd[i] = data[ (data['Time'] > e['Time']-pd.Timedelta('4 hours')) & (data['Time'] < e['Time'] + pd.Timedelta('4 hours'))]

原文:

{"Time": "2016-03-28T02:00:00Z", "Origin": "New York", "Destination": "Boston"}
{"Time": "2016-03-28T02:00:00Z", "Origin": "New York", "Destination": "Boston"}
{"Time": "2016-03-28T02:00:00Z", "Origin": "New York", "Destination": "Boston"}
{"Time": "2016-03-28T04:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}
{"Time": "2016-03-28T04:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}
{"Time": "2016-03-28T04:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}
{"Time": "2016-03-28T06:00:00Z", "Origin": "Boston", "Destination": "Los Angeles"}
{"Time": "2016-03-28T06:00:00Z", "Origin": "Boston", "Destination": "Los Angeles"}
{"Time": "2016-03-28T06:00:00Z", "Origin": "Boston", "Destination": "Los Angeles"}
{"Time": "2016-03-28T08:00:00Z", "Origin": "Boston", "Destination": "New York"}
{"Time": "2016-03-28T08:00:00Z", "Origin": "Boston", "Destination": "New York"}
{"Time": "2016-03-28T12:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}
{"Time": "2016-03-28T12:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}

最佳答案

一个类轮

{k: v.unstack() for k, v in pd.DataFrame({k: v.stack() for k, v in dd.iteritems()}).T.drop_duplicates().iterrows()}

解释版

# iterate through key, value pairs of dictionary,
# stacking each dataframe into a series so that we
# can pass the resulting dataframe into the pd.DataFrame constructor.
df1 = pd.DataFrame({k: v.stack() for k, v in dd.iteritems()})
# Each column is now one key, value pair from the original dictionary
# Transpose and drop duplicates
df2 = df1.T.drop_duplicates()
# reverse the original stacking and convert back to dictionary
# we could have used df2.T.iteritems() but df2.iterrows() took
# one fewer operations and fewer characters to type.
dd_ = {k: v.unstack() for k, v in df2.iterrows()}

for k, v in dd_.iteritems():
    print 'key {}:'.format(k)
    print v
    print '-' * 10

key 0:
   a  b
0  1  2
1  3  4
----------
key 2:
   a  b
0  2  3
1  4  5
----------

设置以获得与我相同的结果(复制并粘贴)

from StringIO import StringIO
import pandas as pd

text0 = """              Origin           Destination                 Time
0           New York                Boston  2016-03-28 02:00:00
1           New York           Los Angeles  2016-03-28 04:00:00
2             Boston           Los Angeles  2016-03-28 06:00:00"""


text1 = """              Origin           Destination                 Time
0           New York                Boston  2016-03-28 02:00:00
1           New York           Los Angeles  2016-03-28 04:00:00
2             Boston           Los Angeles  2016-03-28 06:00:00"""

text2 = """              Origin           Destination                 Time
0           New York                Boston  2016-03-28 02:00:00
1           New York           Los Angeles  2016-03-28 04:00:00
2           Los Angeles             Boston  2016-03-28 06:00:00"""

dd = {}

dd[0] = pd.read_csv(StringIO(text0), sep='\s{2,}', index_col=0, engine='python')
dd[0].Time = pd.to_datetime(dd[0].Time)

dd[1] = pd.read_csv(StringIO(text1), sep='\s{2,}', index_col=0, engine='python')
dd[1].Time = pd.to_datetime(dd[1].Time)

dd[2] = pd.read_csv(StringIO(text2), sep='\s{2,}', index_col=0, engine='python')
dd[2].Time = pd.to_datetime(dd[2].Time)

# Then run solutions above:

df1 = pd.DataFrame({k: v.stack() for k, v in dd.iteritems()})
df2 = df1.T.drop_duplicates()
dd_ = {k: v.unstack() for k, v in df2.iterrows()}

for k, v in dd_.iteritems():
    print 'key {}:'.format(k)
    print v
    print '-' * 10

你应该得到这个:

key 0:
     Origin  Destination                 Time
0  New York       Boston  2016-03-28 02:00:00
1  New York  Los Angeles  2016-03-28 04:00:00
2    Boston  Los Angeles  2016-03-28 06:00:00
----------
key 2:
        Origin  Destination                 Time
0     New York       Boston  2016-03-28 02:00:00
1     New York  Los Angeles  2016-03-28 04:00:00
2  Los Angeles       Boston  2016-03-28 06:00:00
----------

版本

import sys
import pandas as pd
print sys.version
print pd.__version__

2.7.11 |Anaconda custom (x86_64)| (default, Dec  6 2015, 18:57:58) 
[GCC 4.2.1 (Apple Inc. build 5577)]
0.18.1

关于python - 从字典中删除重复的 Pandas 数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38290470/

相关文章:

python - 使用现有实例初始化 super?

python - 为列表的最后一项运行不同的函数

python - 如何在 Python 2 中下载大文件

python - 如何根据其他列中的一个而不是两个条件删除重复项

python - 将字典(键和值)写入 csv 文件

loops - 快速循环遍历字典

python - 根据端点递归划分列表

python - 如何获取包含列表或值的列 pandas 的唯一值?

python - 将多系列词典保存到 Excel

python - 如何将字典转换成csv文件