我有一个 XML blob(如下所示)存储在配置单元日志表中。
<user>
<uid>1424324325</uid>
<attribs>
<field>
...
</field>
<field>
<name>first</name>
<value>Joh,n</value>
</field>
<field>
...
</field>
<field>
<name>last</name>
<value>D,oe</value>
</field>
<field>
...
</field>
</attribs>
</user>
hive 表中的每一行都有关于不同用户的信息,我想提取 uid、名字和姓氏的值(删除名称中的任何逗号)。
1424324325 John Doe
1424435463 Jane Smith
我能够从 XML 中提取值。
SELECT uid, fn, ln
FROM log_table
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/uid/text()')) uids as uid
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "first_name"]/value/text()')) fns as fn
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "last_name"]/value/text()')) lns as ln;
但是,我在尝试从名字和姓氏中删除不必要的逗号(如果存在的话)时遇到了困难。
当我尝试使用下面显示的任何方法提取名字时,结果为空。
LATERAL VIEW explode(xpath(logs['users_updates'], '/users/attribs/field[name = "first_name"]/value/replace(text(),",","")')) fns as fn
LATERAL VIEW explode(xpath(logs['users_updates'], '/users/attribs/field[name = "first_name"]/value/translate(text(),",","")')) fns as fn
当我尝试如下所示时,replace 提示无效函数,而 translate 提取数据而不删除多余的逗号。
LATERAL VIEW explode(xpath(logs['users_updates'], replace('/subscriberUpdates/updates/field[name = "first_name"]/value/text()',",",""))) fns as fn
LATERAL VIEW explode(xpath(logs['users_updates'], translate('/subscriberUpdates/updates/field[name = "first_name"]/value/text()',",",""))) fns as fn
如何提取名称值中不带逗号的信息?
1424324325 John Doe
1424435463 Jane Smith
最终解决方案: 这是 Jens 建议后的最终工作查询
SELECT uid, regexp_replace(fn,","," ") as fname, regexp_replace(ln,","," ") as lname
FROM log_table
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/uid/text()')) uids as uid
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "first_name"]/value/text()')) fns as fn
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "last_name"]/value/text()')) lns as ln;
最佳答案
Hive 不支持 XPath 2.0。这会影响您的问题两次:
- 不允许在轴步骤中调用函数。同时
//value/translate(text(), ',', '')
(它为每个<value/>
元素调用 translate)是有效的 XPath 2.0,您不能在 XPath 1.0 中执行此操作。translate(//value, ',', '')
另一方面返回所有<value/>
中的所有文本节点项目连接为一个字符串。 - 没有
replace
XPath 1.0 中的函数。
只传递包含逗号的值并在 Hive 中进行字符串操作可能更容易。
补充说明,因为您还没有 XPath 2.0:translate
只需要一个字符串作为第一个参数。你需要 string-join
关于xml - HiveQL & XPath - 如何提取值和替换一些字符,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22084184/