csv - Pig CSVExcelStorage 双引号逗号

标签 csv hadoop apache-pig delimiter

我正在将 csv 格式的文件(字段以逗号分隔并用双引号引起来)接收到 HDFS 中,并开发了一个 pig 脚本,该脚本在我使用 HQL 脚本将数据插入 Hive 之前删除了标题行并去掉了双引号。

这个过程一直运行良好;但是,今天我发现其中一个表存在数据问题。该表的文件特别有一个字符串字段,可以在双引号内包含多个逗号。这会导致某些记录的数据被错误地加载到 Hive 中的错误列中。

我无法更改源文件的格式。

目前我正在使用 PiggyBank CSVExcelStorage 来处理 csv 格式,如下所示。可以修改它以产生正确的结果吗?我还有哪些其他选择?我注意到现在还有一个 CSVLoader,但还没有找到任何示例来说明如何使用/实现它。 Pig CSVLoader

USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','NOCHANGE','SKIP_INPUT_HEADER')

编辑以添加额外的样本数据和测试结果:

示例输入文件数据:

"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO"    
"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This Sample Name of A, B, and C","3234","c_name","R"
"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name","3235","c_name2","Q"

将 CSVExcelLoader 与上面提供的格式一起使用:

SAMPLEPNAME,123456,789123,SAMPLECNAME,Upload,SAMPLEINAME,This Sample Name of A, B, and C,3234,This Sample Name of A, B, and C,3234,c_name,R
SAMPLEPNAME2,123457,789124,SAMPLECNAME2,Download,SAMPLEINAME2,This Sample Name,3235,This Sample Name,3235,c_name2,Q

将 CSVLoader 用作 CSVLoader(): 注意 - 没有看到任何要提供给构造函数的参数选项

P_NAME,,,C_NAME,C_TYPE,PROT,I_NAME,,A_NAME,,C_NM,CO 
SAMPLEPNAME,123456,789123,SAMPLECNAME,Upload,SAMPLEINAME,This Sample Name of A, B, and C,3234,This Sample Name of A, B, and C,3234,c_name,R
SAMPLEPNAME2,123457,789124,SAMPLECNAME2,Download,SAMPLEINAME2,This Sample Name,3235,This Sample Name,3235,c_name2,Q

我看到的唯一真正区别是 CSVLoader 没有删除标题行,因为我没有看到选择它的选项,而是删除了一些标题名称。

我做错了什么吗?一个可行的解决方案将不胜感激。

最佳答案

要解决字段中的逗号问题,您可以尝试此解决方法。

将数据加载为一行。
将“,”视为分隔符并将其替换为竖线字符,即“|”。
将开始和结束引号 "替换为空字符串。
使用“|”将行加载到配置单元中作为分隔符。

A = LOAD 'test1.csv' AS (lines:chararray);
ranked = rank A;
B = FILTER ranked BY (rank_A > 1);
C = FOREACH B GENERATE REPLACE($1,'","','|');
D = FOREACH C GENERATE REPLACE($0,'"','');
DUMP D;

A = LOAD 'test1.csv' AS (lines:chararray);

"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO"
"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This Sample Name of A, B, and C","3234","c_name","R"
"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name","3235","c_name2","Q"

排名=排名A;

(1,"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO")
(2,"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This S
ample Name of A, B, and C","3234","c_name","R")
(3,"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name
","3235","c_name2","Q")

B = FILTER 排名依据 (rank_A > 1);

(2,"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This S
ample Name of A, B, and C","3234","c_name","R")
(3,"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name
","3235","c_name2","Q")

C = FOREACH B GENERATE REPLACE($1,'","','|');

("SAMPLEPNAME|123456|789123|SAMPLECNAME|Upload|SAMPLEINAME|This Sample Name of A, B, and C|3234|This S
ample Name of A, B, and C|3234|c_name|R")
("SAMPLEPNAME2|123457|789124|SAMPLECNAME2|Download|SAMPLEINAME2|This Sample Name|3235|This Sample Name
|3235|c_name2|Q")

D = FOREACH C GENERATE REPLACE($0,'"','');

(SAMPLEPNAME|123456|789123|SAMPLECNAME|Upload|SAMPLEINAME|This Sample Name of A, B, and C|3234|This S
ample Name of A, B, and C|3234|c_name|R)
(SAMPLEPNAME2|123457|789124|SAMPLECNAME2|Download|SAMPLEINAME2|This Sample Name|3235|This Sample Name
|3235|c_name2|Q)

您现在可以使用“|”将此数据加载到配置单元作为分隔符。

enter image description here

关于csv - Pig CSVExcelStorage 双引号逗号,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38315557/

相关文章:

java - 同时读取 .csv 文件

hadoop - 在HDFS上以 super 用户身份有多个组-Cloudera

hadoop - 编写用于计算记录数的 MApreduce 代码

java - 解析 Java 服务器异常日志

hadoop - 在 Pig 中合并元组

python - 将字典写入 csv 文件,每 'key: value' 一行

csv - 从 csv 文件中读取列上的多重索引

apache-pig - 启动 PIG 时出错 : ERROR 2998: Unhandled internal error. 找到接口(interface) jline.Terminal,但需要类

c# - 将 CSV 导入 SQL-Server - 性能

Hadoop HDFS - 保留许多部分文件或 concat?