我正在尝试从下面的 json 数据中获取嵌套值。
{
"region_id": 60763,
"phone": "",
"address": {
"region": "NY",
"street-address": "147 West 43rd Street",
"postal-code": "10036",
"locality": "New York City"
},
"id": 113317,
"name": "Casablanca Hotel Times Square"
}
{
"region_id": 32655,
"phone": "",
"address": {
"region": "CA",
"street-address": "300 S Doheny Dr",
"postal-code": "90048",
"locality": "Los Angeles"
},
"id": 76049,
"name": "Four Seasons Hotel Los Angeles at Beverly Hills"
}
我刚刚使用以下方法将上述数据加载到我的 pandas 数据框中:
with open("file path") as f:
df = pd.DataFrame(json.loads(line) for line in f)
现在我的数据框看起来像这样:
address Phone
0 {u'region': u'NY', u'street-address': u'147 We...
1 {u'region': u'CA', u'street-address': u'300 S ...
id name region_id
0 113317 Casablanca Hotel Times Square 60763
1 76049 Four Seasons Hotel Los Angeles at Beverly Hills 32655
我可以使用这个获取列子集 - data = df[['id', 'name']]
但不确定如何获取 region
和 street-address
以及 id
和 name
的值>。我的输出数据框应该有 id, name, region, street-address
。
注意:我试图弹出并将此嵌套列 address
与我的数据框连接起来。但是由于我的数据很大 - 348MB,当我尝试按列 - (轴 - 1)时,连接会占用大量内存。
我也在寻找一种有效的方法来处理这个问题,我是否应该使用直接使用 C 扩展的 Numpy。或者写入 MongoDB 等数据库。我正在考虑这一点,因为在对这些数据进行子集化之后,我需要根据 id 列加入其他数据集以获取其他几个字段。
最佳答案
下面的方法可行(但是,我在下面添加了一个更有效的解决方案;只需向下滚动到EDIT):
import pandas as pd
# read the updated json file
df = pd.read_json('data.json')
# convert column with the nested json structure
tempdf = pd.concat([pd.DataFrame.from_dict(item, orient='index').T for item in df.address])
# get rid of the converted column
df.drop('address', 1, inplace=True)
# prepare concat
tempdf.index = df.index
# merge the two dataframes back together
df = pd.concat([df, tempdf], axis=1)
输出:
id name phone region_id \
0 113317 Casablanca Hotel Times Square 60763
1 76049 Four Seasons Hotel Los Angeles at Beverly Hills 32655
region street-address postal-code locality
0 NY 147 West 43rd Street 10036 New York City
1 CA 300 S Doheny Dr 90048 Los Angeles
现在您可以使用 drop
命令删除不需要的列。
我修改了你的 json 文件,实际上是无效的;你可以检查它,例如在 JSONLint :
[{
"region_id": 60763,
"phone": "",
"address": {
"region": "NY",
"street-address": "147 West 43rd Street",
"postal-code": "10036",
"locality": "New York City"
},
"id": 113317,
"name": "Casablanca Hotel Times Square"
}, {
"region_id": 32655,
"phone": "",
"address": {
"region": "CA",
"street-address": "300 S Doheny Dr",
"postal-code": "90048",
"locality": "Los Angeles"
},
"id": 76049,
"name": "Four Seasons Hotel Los Angeles at Beverly Hills"
}]
编辑
在@MaxU 的回答(对我不起作用)的基础上,您还可以执行以下操作:
import pandas as pd
import ujson
from pandas.io.json import json_normalize
# this is the json file from above
with open('data.json') as f:
data = ujson.load(f)
现在,按照@MaxU 的建议,您可以使用json_normalize摆脱嵌套结构:
df3 = json_normalize(data)
这给你:
address.locality address.postal-code address.region address.street-address id name phone region_id
0 New York City 10036 NY 147 West 43rd Street 113317 Casablanca Hotel Times Square 60763
1 Los Angeles 90048 CA 300 S Doheny Dr 76049 Four Seasons Hotel Los Angeles at Beverly Hills 32655
您可以像这样重命名要保留的列:
df3.rename(columns={'address.region': 'region', 'address.street-address': 'street-address'}, inplace=True)
然后选择您要保留的列:
df3 = df3[['id', 'name', 'region', 'street-address']]
它给你想要的输出:
id name region street-address
0 113317 Casablanca Hotel Times Square NY 147 West 43rd Street
1 76049 Four Seasons Hotel Los Angeles at Beverly Hills CA 300 S Doheny Dr
关于python - Pandas 将嵌套值与其他列一起切片,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35522653/