查找高频率模式后跟一组文本消息的算法

我有大量短信。我想找到这些消息后跟的常见模式(比如 20 种最常见的模式)。示例消息:

msg1 = "Rahul, Your New Delhi (NDLS) - Agra Cantt (AGC) train booking is confirmed.\nPNR: 1234567890\nBooking ID: ABCDE123456789012\nView your Trip Here: https://xyz.app.link/A0b1cDEFGH\nFor any queries please write to some_url.com.\n\nHappy with our service? Rate us 5 stars: https://xyz.app.link/e/5stars"
msg2 = "Shyamu, Your Tenali Jn (TEL) - Secunderabad Jn (SC) train booking is confirmed.\nPNR: 2345678901\nBooking ID: ABCDE123456789011\nView your Trip Here: https://xyz.app.link/Ab0cdEFGHI\nFor any queries please write to some_url.com.\n\nHappy with our service? Rate us 5 stars: https://xyz.app.link/e/5stars"
msg3 = "Ramu, Sorry! Booking for your Jammu Tawi (JAT) - Kurukshetra Jn (KKDE) train with Booking ID: ABCDE123456789013 could not be confirmed due to payment failure.If money has been deducted from your account, your money will automatically be refunded to your account within 5 days.\nRe-book your ticket https://xyz.app.link/a0B1cDEFGH"

您可以看到 msg1 和 msg2 共享相同的模式/模板(见下文)，而 msg3 不同(可能有其他消息与 msg3 共享模式)。我的要求是在我的数据中找到这种高频出现的模板。对于上面的例子，模式/模板是这样的:

"<> Your <> - <> train booking is confirmed.\nPNR: <> ID: <> your Trip Here: <> any queries please write to some_url.com.\n\nHappy with our service? Rate us 5 stars: https://xyz.app.link/e/5stars"

我试过以下:

使用 CountVectorizer 对文本数据进行矢量化。
使用 DBSCAN 聚类查找所有聚类并根据聚类大小进行排序。
对于前 20 个集群:

i) 选择 10 条随机消息。

ii) 使用一些字符串操作找到他们遵循的模式。

上述方法有效，但聚类似乎是瓶颈，需要花费大量时间。 (我的系统上 100000 条消息大约需要 10 分钟)

查找聚类的 Python 函数:

from sklearn.cluster import DBSCAN
import collections

def find_cluster(M_vector):
    # M_vector: Vectorized messages
    dbscan = DBSCAN(eps = 2, min_samples = 5)
    cls = dbscan.fit_predict(M_vector)
    count_dict = collections.Counter(cls)
    count_dict = sorted(count_dict.items(), key = lambda kv:(kv[1], kv[0]), reverse = True)
    return cls, count_dict

我觉得不使用机器学习也可以解决问题，但我不知道如何在更短的时间内取得成果。 DBSCAN 的最坏情况时间复杂度似乎是 O(n^2)(平均 O(nlog(n)))。

我假设使用 Wagner-Fischer 算法会导致更长的时间，因为它会对每条消息和每条其他消息进行计算(O(n^2) 时间复杂度)。

最佳答案

对于此类数据，sklearn 中的索引无济于事，您确实会获得 O(n²) 运行时间。您应该对文本使用倒排索引!您可以更高效地找到类似文档。

然而，为了模板生成的目的，我会使用不同的方法。使用 minhash，一种用于查找几乎重复的网页的技术，但将其参数调整为更加宽容。然后检查非常频繁的哈希值。这些对应于出现在许多文本中的共享词。

这将只有 O(n) 的运行时间。

关于查找高频率模式后跟一组文本消息的算法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56682986/

查找高频率模式后跟一组文本消息的算法

上一篇：java - 使用父级对链式对象进行排序

下一篇：algorithm - 如何平衡多个消息队列