hadoop - 从文本文件将多行xml数据加载到Hive表中

我在MSSQL中的表具有如下架构:(id bigint, xmldata xml, other bigint)我想创建与SQL表相同的Hive表。通过使用Sqoop，我在HDFS上获得了原始数据文件，例如:

1  5678|<root><l1><l2><productid>1234</productid><description>Just to show</description></l2></l1></root>|1002020
2  5679|<root><l1><l2><productid>1239</productid><description>Just to show</description></l2></l1></root>|4213212
3  5680||112345
4  ....
8  5688|<root><l1><l2><productid>1248</productid><description>Just
9  to 
10 show
11 </description></l2></l1></root>|12391023
12 5689|<root><l1><l2><productid>1259</productid><description>Just to
13 show</description></l2></l1></root>|12391021

第一个数字1,2,3是行号。我使用|分隔列。如您所见，看到一些xml字段跨越多行。

我的问题是:如何创建配置单元表并加载此原始数据？

我已经阅读了有关SO的相关问题，但就我而言没有人问津。
我试过了:CREATE TABLE test (id Bigint, xmldata String, other Bigint) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';Load data inpath 'path/raw' into table test
Hive表不正确，例如:

5678    <root><l1><l2><productid>1234</productid><description>Just to show</description></l2></l1></root>   1002020
5679    <root><l1><l2><productid>1239</productid><description>Just to show</description></l2></l1></root>   4213212
5680        112345
5688    <root><l1><l2><productid>1248</productid><description>Just  NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    12391023    NULL
5689    <root><l1><l2><productid>1259</productid><description>Just to   NULL
NULL    12391021    NULL

更新:

尝试过:

CREATE TABLE ts_test (id String, xmldata String, other String) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES (
  "input.regex" = "^([^\|]*)[\|]([^\|]*)[\|]([^\|]*)$",
  "output.format.string" = "%1$s %2$s %3$s"
);

Load data inpath 'path/test.xml' into table ts_test;
输出表不正确，例如:

5678    <root><l1><l2><productid>1234</productid><description>Just to show</description></l2></l1></root>   1002020
5679    <root><l1><l2><productid>1239</productid><description>Just to show</description></l2></l1></root>   4213212
5680        112345
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL

最佳答案

您需要一个自定义的InputFormat。与其重复，不如看一下
Using FileFormat v Serde to read custom text files

关于hadoop - 从文本文件将多行xml数据加载到Hive表中，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/24815464/

hadoop - 从文本文件将多行xml数据加载到Hive表中

上一篇：java - 启动hadoop，没有错误，但无法连接到服务器

下一篇：hadoop - 在 MapReduce 程序的 Reduce 方法中使用 iterable 的集合对象的类型是什么