algorithm - 用于对时间序列事件进行聚类的不同聚类算法

我有一个非常大的输入文件，格式如下:

ID \t time \t duration \t Description \t status

状态列仅限于包含小写 a,s,i 或大写 A,S,I 或两者的混合(状态列中的示例元素:a,si, I, asi, ASI, aSI ，阿西...)

最终目的是对在“足够接近”时间开始和结束的事件进行聚类，以便认识到这些事件会促成更大的事件。此处足够接近可以由 θ 确定，假设现在是 1小时(或者可能是 2 小时，或更长时间等)。如果两个事件的开始时间在 1 小时内且结束时间在 1 小时内，我们会将它们聚集在一起并说它们是大事件的一部分。

另一件事是我只关心状态中全部小写字母的事件

下面是我的初始输入处理:

I filter the input file so that it only contains rows that have all lower case letters
This task is already done in Python using MapReduce and Hadoop

Then I take the output file and split it into "boxes" where each box represents a time range (ex: 11am-12pm - box1, 12pm-1pm - box2, 1pm-2pm - box 3, etc...)

Then use MapReduce again to sort each box based on start time (ascending order)

The final output is an ordered list of start time

现在我想开发一种算法，将上面的输出中的“类似事件”分组在一起。类似事件由开始和结束时间确定。

我当前的算法是:

pick the first item in the list

find any event in the list that has start time is within 1 hour with first event start time and duration is +/- 1 hour with first item duration (duration determines the end time). 

Then cluster them together (basically I want to cluster events that happen at the same time frame)

If no other event found, move to the next available event (the one that has not been clustered).

Keep doing this until no more events to be clustered.

我不确定我的算法是否有效或者是否有效。我正在尝试做一个比 O (n^2) 更好的算法，因此层次聚类不起作用。 K 均值可能也不起作用，因为我提前不知道需要多少个集群。

可能还有一些其他聚类算法可能更适合我的情况。我想我的算法中可能需要一些方程来计算集群中的距离以确定相似性。我对这个领域还很陌生，因此非常感谢任何指导我走上正确道路的帮助。

提前非常感谢。

最佳答案

你可以尝试DBSCAN基于密度的聚类算法，其时间复杂度为 O(n log n)(仅在使用 kd-tree、ball-tree 等索引数据结构的情况下才保证，否则为 O(n^2))。作为大型事件一部分的事件应该在高密度区域举行。这似乎非常适合您的问题。

您可能需要实现自己的距离函数才能执行邻域查询(以查找最近的事件)。此外，最好以 POSIX 时间格式表示事件时间。

Here就是一个例子。

根据您使用的环境，DBSCAN 实现采用 Java (ELKI)、Python (scikit-learn)、R (fpc) )。

关于algorithm - 用于对时间序列事件进行聚类的不同聚类算法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/28294530/

algorithm - 用于对时间序列事件进行聚类的不同聚类算法

上一篇：algorithm - 寻找区间图中两个节点之间的最有效路径

下一篇：arrays - 两两相加之和