python - 为 Python 中的类别查询解析 DMOZ 转储

我目前正在从事一个项目，该项目涉及查找与某个关键字相关的“知识领域”。我打算用 DMOZ 来做这件事。例如，“布拉德皮特”给出了

Arts: People: P: Pitt, Brad: Fan Pages (10)

Arts: People: P: Pitt, Brad: Articles and Interviews (5)

Arts: People: P: Pitt, Brad (4)

Arts: People: P: Pitt, Brad: Image Galleries (2)

Arts: People: P: Pitt, Brad: Movies (2)

等等……

我有来自 DMOZ 网站的 structure.rdf.u8 转储。有人跟我提过，如果我不需要 URL，只要这个文件就足够了(我不需要网站，只需要与关键字相关的类别)。或者我还需要内容文件吗？

此外，我想知道使用 Python(任何库)解析 structure 文件的最佳方法。我对 XML 没有任何了解，但我对 Python 很在行。

最佳答案

我从 https://github.com/kremso/dmoz-parser 开始并制作了一个简单的主题过滤器: https://github.com/lawrencecreates/dmoz-parser/blob/master/sample.py#L6

class LawrenceFilter:
  def __init__(self):
    self._file = open("seeds.txt", 'w')

  def page(self, page, content):
      if page != None and page != "":
          topic = content['topic']
          if topic.find('United_States/Kansas/Localities/L/Lawrence') > 0 :
              self._file.write(page + "\n")
              print "found page %s in topic %s" % (page , topic)

  def finish(self):
    self._file.close()

关于python - 为 Python 中的类别查询解析 DMOZ 转储，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/18044438/

上一篇：python - 在 django 模板中处理 u'\x00'

下一篇：python - 应用 pos_hint 后获取小部件 XY 位置

java - 使用 AsyncTask 在一个布局上加载大量图像？

java - 使用Java从JSONArray获取元素和属性

c++ - 如何只解析字符串的最后一个字符？

python - 拆分数组中的字符串

c - yylval 在 lex 和 yacc 中的作用

python - 如何找到多个文档中存在的所有最长公共(public)子串？

javascript - Cygwin 如何用于 python 编程？

java - XmlPullParser "eventType"表示什么？

python - 您如何允许使用未经身份验证的请求访问 Django Rest Framework 中的某些 View ？