我正在为 CORD19 数据集编写一个 GSON (Java) 解析器 https://pages.semanticscholar.org/coronavirus-research约 4 万篇科学论文已向所有人开放。我想使用 GSON 迭代 JSON 树并将它们转换为 HTML。特别是,我想迭代 JsonObject 元素的条目。
Q1:如果有人已经用 GSON 或其他 Java 解析器编写了 F/OSS CORD19 解析器,我会很高兴。
我的具体问题是迭代 JsonObject 的字段(条目)。
数据(大量剪裁,但如果剪裁删除,希望可以解析):
{
"paper_id": "b801b7f92cff2155d98f0e3404229c67b60e2f9f",
"metadata": {
"title": "Realtime 2-5A kinetics suggests interferons \u03b2 and \uf06c evade global arrest of translation by RNase L",
"authors": [
{
"first": "Alisha",
"middle": [],
"last": "Chitrakar",
"suffix": "",
"affiliation": {},
"email": ""
},
... SNIPPED
{
"first": "Alexei",
"middle": [],
"last": "Korennykh",
"suffix": "",
"affiliation": {},
"email": "akorenny@princeton.edu"
}
]
},
"abstract": [
{
"text": "Cells of all mammals recognize double-stranded RNA (dsRNA) as a foreign material. ...",
"cite_spans": [],
"ref_spans": [],
"section": "Abstract"
},
... SNIPPED
{
"text": "The 2-5A system is also a surveillance pathway for ...",
"cite_spans": [],
"ref_spans": [],
"section": "Abstract"
}
],
"body_text": [
{
"text": "Interferons IFNs of type I (\uf061 and \u03b2) and type III ...",
"cite_spans": [],
"ref_spans": [],
"section": "Introduction"
},
{
"text": "To evaluate how the nuclear envelope ...",
"cite_spans": [
{
"start": 382,
"end": 384,
"text": "50",
"ref_id": null
}
],
"ref_spans": [],
"section": "Diffusion calculations"
}
],
"bib_entries": {
"BIBREF0": {
"ref_id": "b0",
"title": "Higher-order substrate recognition of eIF2alpha by the RNA-dependent protein kinase PKR",
"authors": [
{
"first": "A",
"middle": [
"C"
],
"last": "Dar",
"suffix": ""
},
... SNIPPED
{
"first": "F",
"middle": [],
"last": "Sicheri",
"suffix": ""
}
],
"year": 2005,
"venue": "Cell",
"volume": "122",
"issn": "",
"pages": "887--900",
"other_ids": {}
},
"BIBREF1": {
"ref_id": "b1",
"title": "Increased nuclease activity in cells treated with pppA2'p5'A2'p5' A",
"authors": [
{
"first": "A",
"middle": [
"G"
],
"last": "Hovanessian",
"suffix": ""
},
... SNIPPED
{
"first": "L",
"middle": [],
"last": "Montagnier",
"suffix": ""
}
],
"year": 1979,
"venue": "Proc Natl Acad Sci U S A",
"volume": "76",
"issn": "",
"pages": "3261--3266",
"other_ids": {}
},
"BIBREF2": {
"ref_id": "b2",
"title": "Interferon action--sequence specificity of the ppp(A2'p)nA-dependent ribonuclease",
"authors": [
{
"first": "D",
"middle": [
"H"
],
"last": "Wreschner",
"suffix": ""
},
... SNIPPED
{
"first": "I",
"middle": [
"M"
],
"last": "Kerr",
"suffix": ""
}
],
"year": 1981,
"venue": "Nature",
"volume": "289",
"issn": "",
"pages": "414--421",
"other_ids": {}
},
... SNIPPED
"BIBREF47": {
"ref_id": "b47",
"title": "Size-dependent DNA mobility in cytoplasm and nucleus",
"authors": [
{
"first": "G",
"middle": [
"L"
],
"last": "Lukacs",
"suffix": ""
}
],
"year": 2000,
"venue": "J Biol Chem",
"volume": "275",
"issn": "",
"pages": "1625--1634",
"other_ids": {}
},
"BIBREF48": {
"ref_id": "b48",
"title": "Modeling transmembrane transport through cell membrane wounds created by acoustic cavitation",
"authors": [
{
"first": "V",
"middle": [],
"last": "Zarnitsyn",
"suffix": ""
},
... SNIPPED
{
"first": "M",
"middle": [
"R"
],
"last": "Prausnitz",
"suffix": ""
}
],
"year": 2008,
"venue": "Biophys J",
"volume": "95",
"issn": "",
"pages": "4124--4162",
"other_ids": {}
}
},
... SNIPPED
"back_matter": [
{
"text": "We are grateful to Prof. Bonnie Bassler (Princeton University) for All NS All NS NS ** All ****",
"cite_spans": [],
"ref_spans": [],
"section": "Acknowledgments:"
}
]
}
CORD-19 站点上有一个架构,但是诸如 BIBREF1 ... BIBREF48
之类的条目每个数据集的数量各不相同。
(问 BIBREF
对象 - 条目的确切名称是什么?子项?)
我当前的代码是:
@Test
public void testReadJSON() {
File jsonFile = new File(BIORXIV_MEDRXIV, "b801b7f92cff2155d98f0e3404229c67b60e2f9f.json");
JsonObject oo = null;
try {
String resultsJsonString = IOUtils.toString(new FileInputStream(jsonFile), "UTF-8");
JsonParser parser = new JsonParser();
oo = (JsonObject) parser.parse(resultsJsonString);
} catch (Exception e) {
throw new RuntimeException("Cannot read CORD19 file: "+jsonFile, e);
}
String paperId = oo.get("paper_id").getAsString();
System.out.println("id: "+paperId);
JsonElement metadata = oo.get("metadata");
JsonObject metadataObject = metadata.getAsJsonObject();
String title = metadataObject.get("title").getAsString();
System.out.println("title: "+title);
JsonElement authorsObject = metadataObject.get("authors");
System.out.println("Auth: "+authorsObject);
JsonArray authors = authorsObject.getAsJsonArray();
for (int i = 0; i < authors.size(); i++) {
System.out.println(authors.get(i));
}
JsonElement abstrakt = oo.get("abstract");
System.out.println("abstract: "+abstrakt);
JsonArray texts = abstrakt.getAsJsonArray();
for (int i = 0; i < texts.size(); i++) {
System.out.println(texts.get(i));
}
JsonElement bodyText = oo.get("body_text");
System.out.println("bodyText: "+bodyText);
texts = bodyText.getAsJsonArray();
for (int i = 0; i < texts.size(); i++) {
System.out.println(texts.get(i));
}
JsonElement bibEntries = oo.get("bib_entries");
System.out.println("bibEntries: "+bibEntries.getClass()+bibEntries);
JsonObject obj = bibEntries.getAsJsonObject();
// WHAT TO WRITE HERE?
}
}
(附加问题。我正在学习 Java8,因此希望得到 Java8 流以及 Java7 中的答案)
(附加问题。 [我通常不会在 Stack Overflow 上“做广告”,但现在不是正常时期,我认为这将有助于拯救生命,并为 Stack Overflow 成员提供贡献技能的机会] 我已经设立了志愿者项目来破解这个数据集。我多年来一直从科学论文中提取知识,并相信现有的论文可能包含指向新科学知识的有值(value)的指示。
- GitHub 项目 https://github.com/petermr/openVirus
- 维基媒体项目 https://en.wikiversity.org/wiki/WikiJournal_Preprints/Aggregation_of_scholarly_publications_and_extracted_knowledge_on_COVID19_and_epidemics使用维基数据来注释文章。
此外 - Stack Overflow 是否有办法收集专门针对 COVID-19 重复使用的专业知识?
最佳答案
GSON 的 JsonObject提供用于迭代内容的 entrySet()
方法。
for(Map.Entry<String,JsonElement> entry : obj.entrySet()) {
String key = entry.getKey(); // BIBREF0
JsonElement value = entry.getValue(); // details, can be cast to JsonObject
processBibRef((JsonObject)value); // For example
}
流在这里没有多大帮助,但应该使用子元素的单独方法来更好地构造代码,如示例中所示。
可以看出,使用 Java 手动解析 JSON 很麻烦,当您将 JSON 转换为 HTML 等时,转换为对象的额外步骤使得其他类型安全性较低的语言(例如 Javascript)更具吸引力。
关于java - 使用 GSON 解析器迭代数据集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60810617/