c++ - 在集合中查找重复元素并将其分组的快速算法是什么？

假设您有一个元素集合，那么如何才能挑选出重复的元素并将它们放入比较最少的每个组中？最好在C++中使用，但是算法比语言更重要。
例如
给定{E1，E2，E3，E4，E4，E2，E6，E4，E3}，我希望提取出{E2，E2}，{E3，E3}，{E4，E4，E4}。
您将选择哪种数据结构和算法？还请包括设置数据结构的成本，例如，是否是像std::multimap这样的预先排序的数据结构

更新

按照建议使事情更清晰。有一个约束条件:元素必须自己比较，以确保它们是重复项。

因此散列不适用，因为实际上它们将比较从重元素(例如，数据块)转移到轻元素(整数)，并减少了一些比较，但并没有消除它们，最后，我们回到了我们原来的问题，当在一个碰撞桶中时。

假设您每个都有一堆潜在的GB重复文件，那么每个人类都知道它们具有相同的哈希值。现在，您将发现真实的重复项。

不，这不是一个现实问题(即使MD5足以为现实文件生成唯一的哈希)。但是只是假装，以便我们可以专注于查找涉及比较最少的数据结构+算法。

我正在做的是

在STL std::list数据结构中表示(其中1)其元素删除比说一个 vector 便宜(2)其插入更便宜，不需要排序。)

弹出一个元素并将其与其余元素进行比较，如果找到重复元素，则将其从列表中拉出。一旦到达列表的末尾，就会找到一组重复项(如果有)。

重复上述两个步骤，直到列表为空。

在最佳情况下，它需要N-1，但是(N-1)!在最坏的情况下。

有什么更好的选择？

我的代码使用上面说明的方法:

// algorithm to consume the std::list container,
// supports: list<path_type>,list< pair<std::string, paths_type::const_iterater>>
template<class T>
struct consume_list
{
    groups_type operator()(list<T>& l)
    {
        // remove spurious identicals and group the rest
        // algorithm:  
        // 1. compare the first element with the remaining elements, 
        //    pick out all duplicated files including the first element itself.
        // 2. start over again with the shrinked list
        //     until the list contains one or zero elements.

        groups_type sub_groups;           
        group_type one_group; 
        one_group.reserve(1024);

        while(l.size() > 1)
        {
            T front(l.front());
            l.pop_front();

            item_predicate<T> ep(front);
            list<T>::iterator it     = l.begin(); 
            list<T>::iterator it_end = l.end();
            while(it != it_end)
            {
                if(ep.equals(*it))
                {
                    one_group.push_back(ep.extract_path(*(it))); // single it out
                    it = l.erase(it);
                }
                else
                {
                    it++;
                }
            }

            // save results
            if(!one_group.empty())
            {
                // save
                one_group.push_back(ep.extract_path(front));                    
                sub_groups.push_back(one_group);

                // clear, memory allocation not freed
                one_group.clear(); 
            }            
        }
        return sub_groups;
    }        
}; 


// type for item-item comparison within a stl container, e.g.  std::list 
template <class T>
struct item_predicate{};

// specialization for type path_type      
template <>
struct item_predicate<path_type>
{
public:
    item_predicate(const path_type& base)/*init list*/            
    {}
public:
    bool equals(const path_type& comparee)
    {
        bool  result;
        /* time-consuming operations here*/
        return result;
    }

    const path_type& extract_path(const path_type& p)
    {
        return p;
    }
private:
    // class members
}; 


};

感谢您提供以下答案，但是，在我的示例中，它们似乎误导了整数。实际上元素是不可知类型的(不一定是整数，字符串或任何其他POD)，并且相等谓词是自定义的，即，比较起来可能是很重的。

因此，也许我的问题应该是:使用哪种数据结构+算法需要较少的比较。

根据我的测试，使用像multiset这样的预先排序的容器，multimap并不更好，因为

插入时的排序已经做了比较，

以下相邻发现再次进行比较，

这些数据结构喜欢小于操作而不是相等操作，它们执行小于2的操作(

我看不到它如何保存比较。

以下是一些答案所忽略的另一件事，我需要将重复的组彼此区分开，而不仅仅是将它们保存在容器中。

结论

经过所有讨论，似乎有3种方法

我的原始天真方法，如上所述

从像std::vector这样的线性容器开始，对其进行排序，然后找到相等范围的

以相关的容器(如std::map<Type, vector<duplicates>>)开始，如Charles Bailey所建议的，在建立相关容器期间挑选出重复项。

我已经编写了一个示例，以测试下面发布的所有方法。

重复项的数量以及何时分发可能会影响最佳选择。

方法1在它们掉落在前面时效果最好，而在末端掉落时效果最差。排序不会改变分布，而是尾数。

方法3的平均性能最高

方法2永远不是最佳选择

感谢所有参与讨论的人。

一个输出，包含来自以下代码的20个示例项目。

Test with [ 20 10 6 5 4 3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 ]

and [ 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3 4 5 6 10 20 ] respectively

using std::vector -> sort() -> adjacent_find():

comparisons: [ '<' = 139, '==' = 23 ]

comparisons: [ '<' = 38, '==' = 23 ]

using std::list -> sort() -> shrink list:

comparisons: [ '<' = 50, '==' = 43 ]

comparisons: [ '<' = 52, '==' = 43 ]

using std::list -> shrink list:

comparisons: [ '<' = 0, '==' = 121 ]

comparisons: [ '<' = 0, '==' = 43 ]

using std::vector -> std::map>:

comparisons: [ '<' = 79, '==' = 0 ]

comparisons: [ '<' = 53, '==' = 0 ]

using std::vector -> std::multiset -> adjacent_find():

comparisons: [ '<' = 79, '==' = 7 ]

comparisons: [ '<' = 53, '==' = 7 ]

Code

// compile with VC++10: cl.exe /EHsc

#include <vector>
#include <deque>
#include <list>
#include <map>
#include <set>
#include <algorithm>
#include <iostream>
#include <sstream>

#include <boost/foreach.hpp>
#include <boost/tuple/tuple.hpp>
#include <boost/format.hpp>

using namespace std;

struct Type
{
    Type(int i) : m_i(i){}

    bool operator<(const Type& t) const
    {
        ++number_less_than_comparison;
        return m_i < t.m_i;
    }

    bool operator==(const Type& t) const
    {
        ++number_equal_comparison;    
        return m_i == t.m_i;
    }
public:
    static void log(const string& operation)
    {
        cout 
        << "comparison during " <<operation << ": [ "
        << "'<'  = " << number_less_than_comparison
        << ", "
        << "'==' = " << number_equal_comparison
        << " ]\n";

        reset();
    }

    int to_int() const
    {
        return m_i;
    }
private:
    static void reset()
    {
        number_less_than_comparison = 0;
        number_equal_comparison = 0;      
    }

public:
    static size_t number_less_than_comparison;
    static size_t number_equal_comparison;    
private:
    int m_i;
};

size_t Type::number_less_than_comparison = 0;
size_t Type::number_equal_comparison = 0;  

ostream& operator<<(ostream& os, const Type& t) 
{
    os << t.to_int();
    return os;
}

template< class Container >
struct Test
{    
    void recursive_run(size_t n)
    { 
        bool reserve_order = false;

        for(size_t i = 48; i < n; ++i)
        {
            run(i);
        }    
    }

    void run(size_t i)
    {
        cout << 
        boost::format("\n\nTest %1% sample elements\nusing method%2%:\n") 
        % i 
        % Description();

        generate_sample(i);
        sort();
        locate();   

        generate_reverse_sample(i);
        sort();
        locate(); 
    }

private:    
    void print_me(const string& when)
    {
        std::stringstream ss;
        ss << when <<" = [ ";
        BOOST_FOREACH(const Container::value_type& v, m_container)
        {
            ss << v << " ";
        }
        ss << "]\n";    
        cout << ss.str();
    }

    void generate_sample(size_t n)
    {
        m_container.clear();
        for(size_t i = 1; i <= n; ++i)
        {
            m_container.push_back(Type(n/i));    
        }
        print_me("init value");
        Type::log("setup");
    }

    void generate_reverse_sample(size_t n)
    {
        m_container.clear();
        for(size_t i = 0; i < n; ++i)
        {
            m_container.push_back(Type(n/(n-i)));     
        }
        print_me("init value(reverse order)");
        Type::log("setup");
    }    

    void sort()
    {    
        sort_it();

        Type::log("sort");
        print_me("after sort");

    }

    void locate()
    {
        locate_duplicates();

        Type::log("locate duplicate");
    }
protected:
    virtual string Description() = 0;
    virtual void sort_it() = 0;
    virtual void locate_duplicates() = 0;
protected:
    Container m_container;    
};

struct Vector : Test<vector<Type> >
{    
    string Description()
    {
        return "std::vector<Type> -> sort() -> adjacent_find()";
    } 

private:           
    void sort_it()
    {    
        std::sort(m_container.begin(), m_container.end()); 
    }

    void locate_duplicates()
    {
        using std::adjacent_find;
        typedef vector<Type>::iterator ITR;
        typedef vector<Type>::value_type  VALUE;

        typedef boost::tuple<VALUE, ITR, ITR> TUPLE;
        typedef vector<TUPLE> V_TUPLE;

        V_TUPLE results;

        ITR itr_begin(m_container.begin());
        ITR itr_end(m_container.end());       
        ITR itr(m_container.begin()); 
        ITR itr_range_begin(m_container.begin());  

        while(itr_begin != itr_end)
        {     
            // find  the start of one equal reange
            itr = adjacent_find(
            itr_begin, 
            itr_end, 
                []  (VALUE& v1, VALUE& v2)
                {
                    return v1 == v2;
                }
            );
            if(itr_end == itr) break; // end of container

            // find  the end of one equal reange
            VALUE start = *itr; 
            while(itr != itr_end)
            {
                if(!(*itr == start)) break;                
                itr++;
            }

            results.push_back(TUPLE(start, itr_range_begin, itr));

            // prepare for next iteration
            itr_begin = itr;
        }  
    }
};

struct List : Test<list<Type> >
{
    List(bool sorted) : m_sorted(sorted){}

    string Description()
    {
        return m_sorted ? "std::list -> sort() -> shrink list" : "std::list -> shrink list";
    }
private:    
    void sort_it()
    {
        if(m_sorted) m_container.sort();////std::sort(m_container.begin(), m_container.end()); 
    }

    void locate_duplicates()
    {       
        typedef list<Type>::value_type VALUE;
        typedef list<Type>::iterator ITR;

        typedef vector<VALUE>  GROUP;
        typedef vector<GROUP>  GROUPS;

        GROUPS sub_groups;
        GROUP one_group; 

        while(m_container.size() > 1)
        {
            VALUE front(m_container.front());
            m_container.pop_front();

            ITR it     = m_container.begin(); 
            ITR it_end = m_container.end();
            while(it != it_end)
            {
                if(front == (*it))
                {
                    one_group.push_back(*it); // single it out
                    it = m_container.erase(it); // shrink list by one
                }
                else
                {
                    it++;
                }
            }

            // save results
            if(!one_group.empty())
            {
                // save
                one_group.push_back(front);                    
                sub_groups.push_back(one_group);

                // clear, memory allocation not freed
                one_group.clear(); 
            }            
        }
    }        

private:
    bool m_sorted;
};

struct Map : Test<vector<Type>>
{    
    string Description()
    {
        return "std::vector -> std::map<Type, vector<Type>>" ;
    }
private:    
    void sort_it() {}

    void locate_duplicates()
    {
        typedef map<Type, vector<Type> > MAP;
        typedef MAP::iterator ITR;

        MAP local_map;

        BOOST_FOREACH(const vector<Type>::value_type& v, m_container)
        {
            pair<ITR, bool> mit; 
            mit = local_map.insert(make_pair(v, vector<Type>(1, v)));   
            if(!mit.second) (mit.first->second).push_back(v); 
         }

        ITR itr(local_map.begin());
        while(itr != local_map.end())
        {
            if(itr->second.empty()) local_map.erase(itr);

            itr++;
        }
    }        
};

struct Multiset :  Test<vector<Type>>
{
    string Description()
    {
        return "std::vector -> std::multiset<Type> -> adjacent_find()" ;
    }
private:
    void sort_it() {}

    void locate_duplicates()
    {   
        using std::adjacent_find;

        typedef set<Type> SET;
        typedef SET::iterator ITR;
        typedef SET::value_type  VALUE;

        typedef boost::tuple<VALUE, ITR, ITR> TUPLE;
        typedef vector<TUPLE> V_TUPLE;

        V_TUPLE results;

        SET local_set;
        BOOST_FOREACH(const vector<Type>::value_type& v, m_container)
        {
            local_set.insert(v);
        }

        ITR itr_begin(local_set.begin());
        ITR itr_end(local_set.end());       
        ITR itr(local_set.begin()); 
        ITR itr_range_begin(local_set.begin());  

        while(itr_begin != itr_end)
        {     
            // find  the start of one equal reange
            itr = adjacent_find(
            itr_begin, 
            itr_end, 
            []  (VALUE& v1, VALUE& v2)
                {
                    return v1 == v2;
                }
            );
            if(itr_end == itr) break; // end of container

            // find  the end of one equal reange
            VALUE start = *itr; 
            while(itr != itr_end)
            {
                if(!(*itr == start)) break;                
                itr++;
            }

            results.push_back(TUPLE(start, itr_range_begin, itr));

            // prepare for next iteration
            itr_begin = itr;
        }  
    } 
};

int main()
{
    size_t N = 20;

    Vector().run(20);
    List(true).run(20);
    List(false).run(20);
    Map().run(20);
    Multiset().run(20);
}

最佳答案

您可以使用从代表性元素到其他元素的列表/vector/双端队列的映射。这要求插入容器中的比较相对较少，这意味着您可以遍历结果组，而不必执行任何比较。

此示例始终将第一个代表性元素插入映射的双端队列存储中，因为这从逻辑上简化了通过组的后续迭代，但是如果此重复证明存在问题，则仅执行push_back if (!ins_pair.second)会很容易。

typedef std::map<Type, std::deque<Type> > Storage;

void Insert(Storage& s, const Type& t)
{
    std::pair<Storage::iterator, bool> ins_pair( s.insert(std::make_pair(t, std::deque<Type>())) );
    ins_pair.first->second.push_back(t);
}

这样，遍历各个组就变得相对简单且便宜:

void Iterate(const Storage& s)
{
    for (Storage::const_iterator i = s.begin(); i != s.end(); ++i)
    {
        for (std::deque<Type>::const_iterator j = i->second.begin(); j != i->second.end(); ++j)
        {
            // do something with *j
        }
    }
}

我进行了一些比较和对象计数的实验。在以100000个对象以随机顺序形成50000组(即每组平均2个对象)的测试中，上述方法需要进行以下比较和复制:

1630674 comparisons, 443290 copies

(我尝试减少副本数量，但实际上只能以比较为代价，这似乎是您的方案中成本较高的操作。)

使用多图，并在最终迭代中保留前一个元素以检测组转换会为此花费:

1756208 comparisons, 100000 copies

使用单个列表并弹出front元素并对其他组成员执行线性搜索的代价是:

1885879088 comparisons, 100000 copies

是的，与我最好的方法中的〜1.6m相比，这约为1.93b。为了使list方法能够在最佳数量的比较结果附近执行任何操作，必须对它进行排序，这将花费与首先建立固有排序的容器类似的比较次数。

编辑

我拿走了您发布的代码，并在与我之前使用的相同测试数据集上运行了隐式算法(我必须对代码进行一些假设(因为有一些假设的定义))，然后计算:

1885879088 comparisons, 420939 copies

即与我的哑表算法完全相同的比较次数，但副本数量更多。我认为这意味着我们在这种情况下使用了基本相似的算法。我看不到任何其他排序顺序的证据，但是您似乎想要一个包含多个等效元素的组列表。这可以通过添加Iterate子句在我的if (i->size > 1)函数中简单地实现。

我仍然看不到任何证据表明构建排序后的容器(例如此双端队列映射)不是一种好的(即使不是最佳)策略。

关于c++ - 在集合中查找重复元素并将其分组的快速算法是什么？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/1332527/

c++ - 在集合中查找重复元素并将其分组的快速算法是什么？

Code

上一篇：c++ - 静态库的全局变量的静态初始化和销毁不会在 g++ 中发生

下一篇：c++ - "atomic object"的定义

c++ - 在集合中查找重复元素并将其分组的快速算法是什么？

Code

上一篇：c++ - 静态库的全局变量的静态初始化和销毁​​不会在 g++ 中发生

下一篇：c++ - "atomic object"的定义

上一篇：c++ - 静态库的全局变量的静态初始化和销毁不会在 g++ 中发生