使用 Hive 解析 json

标签 json hadoop hive

我的 json 文件如下:

{
    "total": 3666,
    "offset": 1,
    "len": 2,
    "workflows": [
        {
            "appName": "xxxx1",
            "externalId": null,
            "conf": null,
            "run": 0,
            "acl": null,
            "appPath": null,
            "parentId": null,
            "lastModTime": "Tue, 08 Aug 2017 22:15:11 GMT",
            "consoleUrl": "http://**************:11000/oozie?job=0000130-170807105041043-oozie-oozi-W",
            "createdTime": "Tue, 08 Aug 2017 22:02:13 GMT",
            "startTime": "Tue, 08 Aug 2017 22:02:13 GMT",
            "toString": "Workflow id[0000130-170807105041043-oozie-oozi-W] status[SUCCEEDED]",
            "id": "0000130-170807105041043-oozie-oozi-W",
            "endTime": "Tue, 08 Aug 2017 22:15:11 GMT",
            "user": "user1",
            "actions": [],
            "status": "SUCCEEDED",
            "group": null
        },
        {
            "appName": "xxxx2",
            "externalId": null,
            "conf": null,
            "run": 0,
            "acl": null,
            "appPath": null,
            "parentId": null,
            "lastModTime": "Mon, 07 Aug 2017 20:16:20 GMT",
            "consoleUrl": "http://**************:11000/oozie?job=0000031-170807105041043-oozie-oozi-W",
            "createdTime": "Mon, 07 Aug 2017 20:15:02 GMT",
            "startTime": "Mon, 07 Aug 2017 20:15:02 GMT",
            "toString": "Workflow id[0000031-170807105041043-oozie-oozi-W] status[SUCCEEDED]",
            "id": "0000031-170807105041043-oozie-oozi-W",
            "endTime": "Mon, 07 Aug 2017 20:16:20 GMT",
            "user": "user1",
            "actions": [],
            "status": "SUCCEEDED",
            "group": null
        }
    ]
}

我正在尝试解析它并放入配置单元表中。我尝试了以下两种方法。

方法一:

select 
get_json_object(json_data,'$.workflows[5].id') as id,
get_json_object(json_data,'$.workflows[5].appName') as app_name,
get_json_object(json_data,'$.workflows[5].createdTime') as created_time,
get_json_object(json_data,'$.workflows[5].startTime') as start_time,
get_json_object(json_data,'$.workflows[5].endTime') as end_time,
get_json_object(json_data,'$.workflows[5].user') as user,
get_json_object(json_data,'$.workflows[5].status') as status
from 
leap_frog_audit.oozie_json_file

这给了我以下错误:

Only a single expression in the SELECT clause is supported with UDTF's

方法二:

CREATE TABLE default.oozie_metrics AS
SELECT m.col AS id,
       k.col AS appName,
       c.col AS createdTime,
       s.col AS start_time,
       e.col AS end_time,
       u.col AS user_name,
       st.col AS status
FROM leap_frog_audit.oozie_json_file LATERAL VIEW explode(split(regexp_replace(get_json_object(json_data,'$.workflows[*].id'),'\\[\\"|\\"\\]',''),'\\"\\,\\"')) m LATERAL VIEW explode(split(regexp_replace(get_json_object(json_data,'$.workflows[*].appName'),'\\[\\"|\\"\\]',''),'\\"\\,\\"')) k LATERAL VIEW explode(split(regexp_replace(get_json_object(json_data,'$.workflows[*].createdTime'),'\\[\\"|\\"\\]',''),'\\"\\,\\"')) c LATERAL VIEW explode(split(regexp_replace(get_json_object(json_data,'$.workflows[*].startTime'),'\\[\\"|\\"\\]',''),'\\"\\,\\"')) s LATERAL VIEW explode(split(regexp_replace(get_json_object(json_data,'$.workflows[*].endTime'),'\\[\\"|\\"\\]',''),'\\"\\,\\"')) e LATERAL VIEW explode(split(regexp_replace(get_json_object(json_data,'$.workflows[*].user'),'\\[\\"|\\"\\]',''),'\\"\\,\\"')) u LATERAL VIEW explode(split(regexp_replace(get_json_object(json_data,'$.workflows[*].status'),'\\[\\"|\\"\\]',''),'\\"\\,\\"')) st

这需要很长时间才能执行。 有什么有效的方法可以得到下面的输出吗?

+--------------------------------------+----------+-------------------------------+-------------------------------+-------------------------------+-------+-----------+
| id                                   | app_name | created_time                  | start_time                    | end_time                      | user  | status    |
+--------------------------------------+----------+-------------------------------+-------------------------------+-------------------------------+-------+-----------+
| 0000130-170807105041043-oozie-oozi-W | xxxx1    | Tue, 08 Aug 2017 22:02:13 GMT | Tue, 08 Aug 2017 22:02:13 GMT | Tue, 08 Aug 2017 22:15:11 GMT | user1 | SUCCEEDED |
+--------------------------------------+----------+-------------------------------+-------------------------------+-------------------------------+-------+-----------+
| 0000031-170807105041043-oozie-oozi-W | xxxx2    | Mon, 07 Aug 2017 20:15:02 GMT | Mon, 07 Aug 2017 20:15:02 GMT | Mon, 07 Aug 2017 20:16:20 GMT | user1 | SUCCEEDED |
+--------------------------------------+----------+-------------------------------+-------------------------------+-------------------------------+-------+-----------+

最佳答案

select  jt.*
from    oozie_json_file ojf
        lateral view  explode (split(substr(get_json_object(ojf.json_data,'$.workflows[*]'),2),'(?<=\\}),(?=\\{)')) e as app
        lateral view  json_tuple (e.app,'id','appName','createdTime','startTime','endTime','user','status') jt as `id`,`appName`,`createdTime`,`startTime`,`endTime`,`user`,`status`
;

+--------------------------------------+---------+-------------------------------+-------------------------------+-------------------------------+-------+-----------+
| id                                   | appname | createdtime                   | starttime                     | endtime                       | user  | status    |
+--------------------------------------+---------+-------------------------------+-------------------------------+-------------------------------+-------+-----------+
| 0000130-170807105041043-oozie-oozi-W | xxxx1   | Tue, 08 Aug 2017 22:02:13 GMT | Tue, 08 Aug 2017 22:02:13 GMT | Tue, 08 Aug 2017 22:15:11 GMT | user1 | SUCCEEDED |
+--------------------------------------+---------+-------------------------------+-------------------------------+-------------------------------+-------+-----------+
| 0000031-170807105041043-oozie-oozi-W | xxxx2   | Mon, 07 Aug 2017 20:15:02 GMT | Mon, 07 Aug 2017 20:15:02 GMT | Mon, 07 Aug 2017 20:16:20 GMT | user1 | SUCCEEDED |
+--------------------------------------+---------+-------------------------------+-------------------------------+-------------------------------+-------+-----------+

关于使用 Hive 解析 json,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45580821/

相关文章:

json - 如何在 Swift 中解码 HTML 实体?

php - 从远程服务器解析 JSON 数据

hadoop - 使用Hadoop及相关项目分析不断变化的使用模式

hadoop - 为什么在从本地文件系统 Hive hadoop 加载数据时复制数据而不移动数据

java - 将java转换为Json时是否可以忽略内部类名和变量

c# - 如何获取 JSON 字符串值?

hadoop - 无法使用直线在配置单元表中插入值

hadoop - 如果使用Hive不可能或很难做到的hadoop,该怎么办?

json - 在 Avro 模式中为简单的 json 创建嵌套记录

java - Hiveserver2 Java API