我有一个像这样的输入数据集,
“用户 ID”|“州”、“城市”、“国家/地区”|“区号”
“203448”|“英国,不适用,艾尔斯伯里”|\N
这里 , 和 |充当分隔符
如何在配置单元中创建表时使用这两个分隔符。
最佳答案
我建议将输入文件的每一行完整地提取到具有单个字符串列的暂存表中,然后使用将在逗号和管道上键入的正则表达式拆分每个输入行。例如:
DROP TABLE IF EXISTS staging;
CREATE TABLE staging (rawdata STRING);
LOAD DATA LOCAL INPATH 'test.data' INTO TABLE staging;
-- I put your data into a local file called "test.data" - change your path accordingly
因此,使用您的数据,暂存表现在如下所示:
hive> SELECT * FROM staging;
OK
"UserID"|"State","City","Country"|"Area Code"
"203448"|"aylesbury, n/a, united kingdom"|\N
Time taken: 0.452 seconds, Fetched: 2 row(s)
然后你可以创建你的最终表(我随意将其命名为“target”,替换为你自己的名字):
DROP TABLE IF EXISTS target;
CREATE TABLE target AS SELECT
i[0] AS columnNameA,
i[1] AS columnNameB,
i[2] AS columnNameC,
i[3] AS columnNameD,
i[4] AS columnNameE
FROM (SELECT split(rawdata, ",|\\|") AS i FROM staging) t;
将列名称替换为所需的列标题。无论如何,这都是创建后目标表的结果内容(我已通过 sed
传输显示的结果,以使用 ::
而不是制表符分隔字段,这我觉得不可读):
# hive -e "select * from target" 2>/dev/null | sed 's/\t/ :: /g'
"UserID" :: "State" :: "City" :: "Country" :: "Area Code"
"203448" :: "aylesbury :: n/a :: united kingdom" :: NULL
关于hive - 如何在 hive 中使用多个分隔符,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28260554/