python - 尝试将 Reddit JSON 扁平化为许多 "conversations"

标签 python list recursion reddit flatten

我正在尝试使用 Reddit 线程的评论作为机器学习程序的训练集。输入的示例为 https://old.reddit.com/r/bayarea/comments/cxxl9y/billionaires_yacht_docked_in_embarcadero.json .

我正在过滤掉 body、id 和parent_id,希望将嵌套的 JSON 转化为许多对话。

例如,如果输入为 ["A", ["B",["C", "D"]]],则输出应为 ["A",“B”,“C”],[“A”,“B”,“D”]

下面是我当前的代码:

json_url = "https://old.reddit.com/r/bayarea/comments/cxxl9y/billionaires_yacht_docked_in_embarcadero.json"
            r = requests.get(json_url, headers={"user-agent": "PostmanRuntime/7.15.2"})

            comments_tree_raw = fltr(r.json(), ["ups", "body", "id", "parent_id"])[1]["data"]

            comments_tree_raw = flatten([], comments_tree_raw["children"])
def remove_all_after(node, index):
    target = node.index(index)
    return node[:target]




training_threads = []
# input the children list
def flatten(output, children):
    global training_threads


    for child in children:
        try:
            child_obj = child["data"] if "body" in child["data"] else child
            child_comment = {
                "body": child_obj["body"],
                "id": child_obj["id"],
                "parent": child_obj["parent_id"]
            }
            output.append(child_comment)
        except KeyError:
            continue

        if "replies" not in child["data"]:

            training_threads.append(output.copy())

            parent_id = child_comment["parent"].split("_")[1]
            for i in output:
                if i["id"] == parent_id:
                    output = remove_all_after(output, i)
                    break


            continue

        flatten(output, child["data"]["replies"]["data"]["children"])

在这里,我尝试递归地解决问题,但它没有产生我需要的输出。这是我得到的输出:https://pastebin.com/GkpwGUtK .

非常感谢您的帮助!谢谢!

最佳答案

您可以使用带有生成器的简单递归:

data = ["A", ["B",["C", "D"]]]
def group(d, c = []):
   a, b = d
   if all(not isinstance(i, list) for i in b):
     yield from [c+[a, i] for i in b]
   else:
     yield from group(b, c+[a])

print(list(group(data)))

输出:

[['A', 'B', 'C'], ['A', 'B', 'D']]

编辑:使用itertools.groupby更完整的版本:

from itertools import groupby
def group(d, c = []):
  _d = [list(b) for _, b in groupby(d, key=lambda x:isinstance(x, list))]
  if len(_d) == 1:
    for i in _d[0]:
      if not isinstance(i, list):
         yield c+[i]
      else:
         yield from group(i, c)
  else:
     for i in range(0, len(_d), 2):
       for k in _d[i]:
         yield from group(_d[i+1], c+[k])

print(list(group([["C", ["D", "E"], ["C", ["D", "E"], ["C", ["D", "E"]]]]])))

输出:

[['C', 'D'], ['C', 'E'], ['C', 'C', 'D'], ['C', 'C', 'E'], ['C', 'C', 'C', 'D'], ['C', 'C', 'C', 'E']]

关于python - 尝试将 Reddit JSON 扁平化为许多 "conversations",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57742477/

相关文章:

python - 为什么我无法导入 ABC,但 ABCMeta 却正确导入?

python - 使用多进程杀死程序

python - 为列表的最后一项运行不同的函数

java - 将列表从 Java 返回到 Freemarker

c++ - 除了生成斐波那契数列之外,还有什么好的递归示例?

java - 在 Java 中获取一个 "for each"循环,每次都以不同的顺序运行

使用模板函数的 C++ 模板元编程

python - 为 scikit-learn 准备 scipy.io.loadarff 结果

python - Azure "App Service"- Django 和 SQLite

javascript - 如何获取 javascript 以在所有浏览器中使用链接对列表进行排序?