python - 解析 JSON Lines 文件

标签 python json csv

我需要找到一种方法将 json 文件中的数据解析为 csv 或 xlsx。 然而,我在线使用的每个 JSON 验证器都会给我一个错误,指出 JSON 文件无效。

JSON 文件示例如下:

{"id": "someID1.docx",
 "language": {"detected": "cs"},
 "title": "Name - Title - FileName",
 "text": "Long string of text",
 "entities": [
 {"standardForm": "Svářečský průkaz", "type": "car"},
 {"standardForm": "email1@gmail.com", "type": "email"},
 {"standardForm": "english", "type": "languages"},
 {"standardForm": "Práce na PC", "type": "abilities"},
 {"standardForm": "MS Office", "type": "abilities"},
 {"standardForm": "Automechanik", "type": "education"},
 {"standardForm": "Střední průmyslová škola", "type": "education"},
 {"standardForm": "Angličtina-Němčina", "type": "languages"},
 {"standardForm": "mechanic", "type": "position"},
 {"standardForm": "Praha", "type": "region"},
 {"standardForm": "B2 - středně pokročilý", "type": "en_level"},
 {"standardForm": "Skupina B", "type": "drivinglicense"}
 ]}
{"id": "someID2.pdf",
 "language": {"detected": "cs"},
 "title": "Name - Title - FileName2",
 "text": "Long string of text2",
 "entities": [
 {"standardForm": "german", "type": "languages"},
 {"standardForm": "high school", "type": "education"},
 {"standardForm": "Angličtina-Němčina", "type": "languages"},
 {"standardForm": "driver", "type": "position"},
 {"standardForm": "english", "type": "languages"},
 {"standardForm": "university", "type": "education"},
 {"standardForm": "email2@seznam.cz", "type": "email"},
 {"standardForm": "Středočeský", "type": "region"},
 {"standardForm": "Střední", "type": "edulevel"},
 {"standardForm": "manager", "type": "lastposition"},
 {"standardForm": "? – nerozpoznáno", "type": "de_level"},
 {"standardForm": "? – nerozpoznáno", "type": "en_level"},
 {"standardForm": "Skupina C", "type": "drivinglicense"}
 ]}
 ...

我可以使用 Python 加载此 JSON:

import pandas as pd
jsonfile = [json.loads(line) for line in open('jsonfile.json', 'r', encoding='utf-8')]

但我无法以任何方式将其转换为 csv。我需要能够存储与所有 id 相关的所有实体,最好是在 csv 中。有什么办法吗?我需要不同的 JSON 吗?

谢谢

编辑: 我需要上面示例的 csv 输出如下:

ID;title;languages;education
someID1.docx;Name-Title-FileName;english,Angličtina-Němčina;Automechanik;Střední Prům. škola
seomeID2.pdf;Name-Title-FileName2; german,Angličtina-Němčina,english;high school, university

最佳答案

由于您已经导入了 pandas,因此您可以使用其 pandas.DataFrame

df = pd.DataFrame(jsonfile)
df['languages'] = df.apply(lambda x: [item['standardForm'] 
                                      for item in x.entities 
                                      if item['type'] == 'languages'], 
                           axis=1)
df['education'] = df.apply(lambda x: [item['standardForm'] 
                                      for item in x.entities 
                                      if item['type'] == 'education'],
                           axis=1)


df.to_csv(<filename>, columns=['id', 'title', 'languages', 'education'])

关于python - 解析 JSON Lines 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55296372/

相关文章:

python - Django:如何在表单/管理中使用 MultipleChoiceField

python pycurl 获取最终 url 重定向

python - 统一码错误 : UTF-16 stream does not start with BOM

java - 如何使用数组将 csv 转换为 json

postgresql - Rails 对 csv 格式的原始查询,将通过 Controller 返回

python - 在 100k len 的单词列表中查找 4k 个单词

python - PIL 模块错误

xml - 有没有办法将 xml 转换为 json 而不使用 mule esb 中的 java 代码?

json - JQ if then 语句范围

ios - swift 将 NSDictionary 分配给全局 NSMUtableDictionary