java - 使用 GSON 解析器迭代数据集

标签 java gson

我正在为 CORD19 数据集编写一个 GSON (Java) 解析器 https://pages.semanticscholar.org/coronavirus-research约 4 万篇科学论文已向所有人开放。我想使用 GSON 迭代 JSON 树并将它们转换为 HTML。特别是,我想迭代 JsonObject 元素的条目。

Q1:如果有人已经用 GSON 或其他 Java 解析器编写了 F/OSS CORD19 解析器,我会很高兴。

我的具体问题是迭代 JsonObject 的字段(条目)。

数据(大量剪裁,但如果剪裁删除,希望可以解析):

{
    "paper_id": "b801b7f92cff2155d98f0e3404229c67b60e2f9f",
    "metadata": {
        "title": "Realtime 2-5A kinetics suggests interferons \u03b2 and \uf06c evade global arrest of translation by RNase L",
        "authors": [
            {
                "first": "Alisha",
                "middle": [],
                "last": "Chitrakar",
                "suffix": "",
                "affiliation": {},
                "email": ""
            },
            ... SNIPPED
            {
                "first": "Alexei",
                "middle": [],
                "last": "Korennykh",
                "suffix": "",
                "affiliation": {},
                "email": "akorenny@princeton.edu"
            }
        ]
    },
    "abstract": [
        {
            "text": "Cells of all mammals recognize double-stranded RNA (dsRNA) as a foreign material. ...",
            "cite_spans": [],
            "ref_spans": [],
            "section": "Abstract"
        },
... SNIPPED
        {
            "text": "The 2-5A system is also a surveillance pathway for ...",
            "cite_spans": [],
            "ref_spans": [],
            "section": "Abstract"
        }
    ],
    "body_text": [
        {
            "text": "Interferons IFNs of type I (\uf061 and \u03b2) and type III ...",
            "cite_spans": [],
            "ref_spans": [],
            "section": "Introduction"
        },
        {
            "text": "To evaluate how the nuclear envelope ...",
            "cite_spans": [
                {
                    "start": 382,
                    "end": 384,
                    "text": "50",
                    "ref_id": null
                }
            ],
            "ref_spans": [],
            "section": "Diffusion calculations"
        }
    ],
    "bib_entries": {
        "BIBREF0": {
            "ref_id": "b0",
            "title": "Higher-order substrate recognition of eIF2alpha by the RNA-dependent protein kinase PKR",
            "authors": [
                {
                    "first": "A",
                    "middle": [
                        "C"
                    ],
                    "last": "Dar",
                    "suffix": ""
                },
... SNIPPED
                {
                    "first": "F",
                    "middle": [],
                    "last": "Sicheri",
                    "suffix": ""
                }
            ],
            "year": 2005,
            "venue": "Cell",
            "volume": "122",
            "issn": "",
            "pages": "887--900",
            "other_ids": {}
        },
        "BIBREF1": {
            "ref_id": "b1",
            "title": "Increased nuclease activity in cells treated with pppA2'p5'A2'p5' A",
            "authors": [
                {
                    "first": "A",
                    "middle": [
                        "G"
                    ],
                    "last": "Hovanessian",
                    "suffix": ""
                },
                ... SNIPPED
                {
                    "first": "L",
                    "middle": [],
                    "last": "Montagnier",
                    "suffix": ""
                }
            ],
            "year": 1979,
            "venue": "Proc Natl Acad Sci U S A",
            "volume": "76",
            "issn": "",
            "pages": "3261--3266",
            "other_ids": {}
        },
        "BIBREF2": {
            "ref_id": "b2",
            "title": "Interferon action--sequence specificity of the ppp(A2'p)nA-dependent ribonuclease",
            "authors": [
                {
                    "first": "D",
                    "middle": [
                        "H"
                    ],
                    "last": "Wreschner",
                    "suffix": ""
                },
                ... SNIPPED
                {
                    "first": "I",
                    "middle": [
                        "M"
                    ],
                    "last": "Kerr",
                    "suffix": ""
                }
            ],
            "year": 1981,
            "venue": "Nature",
            "volume": "289",
            "issn": "",
            "pages": "414--421",
            "other_ids": {}
        },
        ... SNIPPED
        "BIBREF47": {
            "ref_id": "b47",
            "title": "Size-dependent DNA mobility in cytoplasm and nucleus",
            "authors": [
                {
                    "first": "G",
                    "middle": [
                        "L"
                    ],
                    "last": "Lukacs",
                    "suffix": ""
                }
            ],
            "year": 2000,
            "venue": "J Biol Chem",
            "volume": "275",
            "issn": "",
            "pages": "1625--1634",
            "other_ids": {}
        },
        "BIBREF48": {
            "ref_id": "b48",
            "title": "Modeling transmembrane transport through cell membrane wounds created by acoustic cavitation",
            "authors": [
                {
                    "first": "V",
                    "middle": [],
                    "last": "Zarnitsyn",
                    "suffix": ""
                },
                ... SNIPPED
                {
                    "first": "M",
                    "middle": [
                        "R"
                    ],
                    "last": "Prausnitz",
                    "suffix": ""
                }
            ],
            "year": 2008,
            "venue": "Biophys J",
            "volume": "95",
            "issn": "",
            "pages": "4124--4162",
            "other_ids": {}
        }
    },
    ... SNIPPED
    "back_matter": [
        {
            "text": "We are grateful to Prof. Bonnie Bassler (Princeton University) for All NS All NS NS ** All ****",
            "cite_spans": [],
            "ref_spans": [],
            "section": "Acknowledgments:"
        }
    ]
}

CORD-19 站点上有一个架构,但是诸如 BIBREF1 ... BIBREF48 之类的条目每个数据集的数量各不相同。 (问 BIBREF 对象 - 条目的确切名称是什么?子项?)

我当前的代码是:

    @Test
    public void testReadJSON() {

        File jsonFile = new File(BIORXIV_MEDRXIV, "b801b7f92cff2155d98f0e3404229c67b60e2f9f.json");
        JsonObject oo = null;
        try {
            String resultsJsonString = IOUtils.toString(new FileInputStream(jsonFile), "UTF-8");
            JsonParser parser = new JsonParser();
            oo = (JsonObject) parser.parse(resultsJsonString);

        } catch (Exception e) {
            throw new RuntimeException("Cannot read CORD19 file: "+jsonFile, e);
        }

        String paperId = oo.get("paper_id").getAsString();
        System.out.println("id: "+paperId);

        JsonElement metadata = oo.get("metadata");
        JsonObject metadataObject = metadata.getAsJsonObject();
        String title = metadataObject.get("title").getAsString();
        System.out.println("title: "+title);

        JsonElement authorsObject = metadataObject.get("authors");
        System.out.println("Auth: "+authorsObject);
        JsonArray authors = authorsObject.getAsJsonArray();
        for (int i = 0; i < authors.size(); i++) {
            System.out.println(authors.get(i));
        }

        JsonElement abstrakt = oo.get("abstract");
        System.out.println("abstract: "+abstrakt);
        JsonArray texts = abstrakt.getAsJsonArray();
        for (int i = 0; i < texts.size(); i++) {
            System.out.println(texts.get(i));
        }

        JsonElement bodyText = oo.get("body_text");
        System.out.println("bodyText: "+bodyText);
        texts = bodyText.getAsJsonArray();
        for (int i = 0; i < texts.size(); i++) {
            System.out.println(texts.get(i));
        }

        JsonElement bibEntries = oo.get("bib_entries");
        System.out.println("bibEntries: "+bibEntries.getClass()+bibEntries);
        JsonObject obj = bibEntries.getAsJsonObject();
        // WHAT TO WRITE HERE?

    }

}

(附加问题。我正在学习 Java8,因此希望得到 Java8 流以及 Java7 中的答案)

(附加问题。 [我通常不会在 Stack Overflow 上“做广告”,但现在不是正常时期,我认为这将有助于拯救生命,并为 Stack Overflow 成员提供贡献技能的机会] 我已经设立了志愿者项目来破解这个数据集。我多年来一直从科学论文中提取知识,并相信现有的论文可能包含指向新科学知识的有值(value)的指示。

此外 - Stack Overflow 是否有办法收集专门针对 COVID-19 重复使用的专业知识?

最佳答案

GSON 的 JsonObject提供用于迭代内容的 entrySet() 方法。

for(Map.Entry<String,JsonElement> entry : obj.entrySet()) {
    String key = entry.getKey();     // BIBREF0
    JsonElement value = entry.getValue();   // details, can be cast to JsonObject
    processBibRef((JsonObject)value);       // For example
}

流在这里没有多大帮助,但应该使用子元素的单独方法来更好地构造代码,如示例中所示。

可以看出,使用 Java 手动解析 JSON 很麻烦,当您将 JSON 转换为 HTML 等时,转换为对象的额外步骤使得其他类型安全性较低的语言(例如 Javascript)更具吸引力。

关于java - 使用 GSON 解析器迭代数据集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60810617/

相关文章:

java - 多线程 Vert.x 每秒处理数千个连接

java - GAE Endpoints (Java) with objectify - 如何为部分数据建模(对于客户端)?

android - JsonSyntaxException 无法解析为类型

java - Android gson 无效 ClassCastException

java - 使用注释和@Valid的Spring表单验证

java - ListNode header value 不会打印

android - Android中使用GSON解析复杂的JSON对象

android - 如何在 json 反序列化期间从列表中删除某些元素?

java - 如何授予 Tomcat 9 访问其他文件的权限

java - Gson:要列出的索引对象