algorithm - 如何提取具有一定数量的公共(public)子节点的节点组

我正在解决一个测验，需要一些建议。

测验总结如下:

Analayze data of bookmarking service (like delicious, digg...) and extract the group of urls which has more than two common tags.

each bookmark data contains 1)user-id, 2)url, and 3)an array of tags.

size of all tags are relatively small compared to all urls. that is, people bookmark sites with limited set

All tags assigned to a URL are different

If different users bookmarked the same URL, you should not make group out of them.(However, this is a optional condition. You can just ignore user_id and suppose that all URLs are different.)

示例:

siteA - [tag1, tag2, tag3]
siteB - [tag1, tag2, tag4]
siteC - [tag1, tag3, tag5]
siteD - [tag1, tag2, tag6]

下面两组URL就是结果

(siteA, siteB, siteD), (siteA, siteC)

因为 (siteA, siteB, siteD) 共享两个公共(public)标签 (tag1, tag2) 并且 (siteA, siteC) 也共享两个公共(public)标签 (tag1, tag3)。

-- 条件 3,4 和添加的示例。谢谢@btilly。

我的问题是

如何快速解决(或应用哪种算法)？
是否有任何类型的代表性问题可以用与此问题类似的算法来解决？

最佳答案

我会创建一个新的数据结构，它是按标签，具有该标签的 URL 的散列。

然后对于每一对标签，您可以选择 URL 较少的标签，遍历它们，然后查找它是否在另一个标签中，生成共享该对标签的组。

如果您有 n 标签，每个标签平均有 m 个 url，则需要 O(n * m) 来生成新的数据结构，O(n * n * m) 生成组。

关于algorithm - 如何提取具有一定数量的公共(public)子节点的节点组，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/10949693/

algorithm - 如何提取具有一定数量的公共(public)子节点的节点组

上一篇：algorithm - 模型检查 Paxos

下一篇：c++ - 当我在平面上嵌入平面图时，如何找到包含预定义点的面