python - 从页面内容的字典创建层次结构树

以下键:值对是“页面”和“页面内容”。

{
  'section-a.html':{'contents':'section-b.html section-c.html section-d.html'},
  'section-b.html':{'contents':'section-d.html section-e.html'},
  'section-c.html':{'contents':'product-a.html product-b.html product-c.html product-d.html'},
  'section-d.html':{'contents':'product-a.html product-c.html'},
  'section-e.html':{'contents':'product-b.html product-d.html'},
  'product-a.html':{'contents':''},
  'product-b.html':{'contents':''},
  'product-c.html':{'contents':''},
  'product-d.html':{'contents':''}
}

对于任何给定的“项目”，我如何找到该项目的路径？在大多数情况下，由于我对数据结构的了解非常有限，我假设这将是一个层次结构树。如有错误请指正!

更新:抱歉，我应该更清楚地了解数据和我的预期结果。

假设“page-a”是一个索引，每个“页面”实际上都是出现在网站上的页面，而每个“项目”类似于出现在亚马逊、Newegg 等上的产品页面。

因此，我对“item-d”的预期输出将是该项目的一个或多个路径。例如(分隔符是任意的，此处用于说明): item-d 具有以下路径:

page-a > page-b > page-e > item-d
page-a > page-c > item-d

更新2:更新了我原来的dict以提供更准确和真实的数据。添加“.html”以进行澄清。

最佳答案

这里有一个简单的方法——它是 O(N 平方)，所以，并不是那么高度可扩展，但对于合理的书籍大小来说会很好地为你服务(如果你有数百万页，你需要考虑关于一种非常不同且不太简单的方法；-)。

首先，创建一个更可用的字典，将页面映射到内容集:例如，如果原始字典是 d，则创建另一个字典 mud 为:

mud = dict((p, set(d[p]['contents'].split())) for p in d)

然后，创建将每个页面映射到其父页面的字典:

parent = dict((p, [k for k in mud if p in mud[k]]) for p in mud)

在这里，我使用父页面列表(集合也可以)，但是对于像您的示例中那样具有 0 或 1 个父页面的页面也可以 - 您只需使用一个空列表来表示“无父项”，否则是一个列表，其中父项是唯一的项。这应该是一个非循环有向图(如果您有疑问，当然可以检查，但我会跳过该检查)。

现在，给定一个页面，查找从其父级到无父级父级(“根页面”)的路径只需“遍历”parent 字典即可。例如，在 0/1 父案例中:

path = [page]
while parent[path[-1]]:
  path.append(parent[path[-1]][0])

如果您可以更好地阐明您的规范(每本书的页数范围、每页的家长数量等)，那么这段代码无疑可以改进，但作为一个开始，我希望它能有所帮助。

编辑:正如OP澄清的那样，具有> 1个父级(因此，多个路径)的情况确实令人感兴趣，让我展示如何处理这个问题:

partial_paths = [ [page] ]
while partial_paths:
  path = partial_paths.pop()
  if parent[path[-1]]:
    # add as many partial paths as open from here
    for p in parent[path[-1]]:
      partial_paths.append(path + [p])
  else:
    # we've reached a root (parentless node)
    print(path)

当然，您可以在每条路径到达根时yield(将主体为生成器的函数)，而不是printing，否则以您需要的任何方式对待它。

再次编辑:评论者担心图表中的循环。如果这种担心是有道理的，那么跟踪路径中已经看到的节点并检测和警告任何循环并不困难。最快的方法是在每个代表部分路径的列表旁边保留一个集合(我们需要列表进行排序，但检查集合中的成员资格是 O(1) ，而检查列表中的 O(N) ):

partial_paths = [ ([page], set([page])) ]
while partial_paths:
  path, pset = partial_paths.pop()
  if parent[path[-1]]:
    # add as many partial paths as open from here
    for p in parent[path[-1]]:
      if p in pset:
        print('Cycle: %s (%s)' % (path, p))
        continue
      partial_paths.append((path + [p], pset.union([p])))
  else:
    # we've reached a root (parentless node)
    print('Path: %s' % (path,))

为了清楚起见，使用合适的方法将表示部分路径的列表和集合打包到一个小型实用程序类 Path 中可能是值得的。

关于python - 从页面内容的字典创建层次结构树，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/1809758/

python - 从页面内容的字典创建层次结构树

上一篇：python - 如何将秒转换为 hh :mm:ss with the Django's date template tag?

下一篇：python - 有没有好的网页摘要器？