csv - Pig CSVExcelStorage 双引号逗号

我正在将 csv 格式的文件(字段以逗号分隔并用双引号引起来)接收到 HDFS 中，并开发了一个 pig 脚本，该脚本在我使用 HQL 脚本将数据插入 Hive 之前删除了标题行并去掉了双引号。

这个过程一直运行良好；但是，今天我发现其中一个表存在数据问题。该表的文件特别有一个字符串字段，可以在双引号内包含多个逗号。这会导致某些记录的数据被错误地加载到 Hive 中的错误列中。

我无法更改源文件的格式。

目前我正在使用 PiggyBank CSVExcelStorage 来处理 csv 格式，如下所示。可以修改它以产生正确的结果吗？我还有哪些其他选择？我注意到现在还有一个 CSVLoader，但还没有找到任何示例来说明如何使用/实现它。 Pig CSVLoader

USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','NOCHANGE','SKIP_INPUT_HEADER')

编辑以添加额外的样本数据和测试结果:

示例输入文件数据:

"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO"    
"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This Sample Name of A, B, and C","3234","c_name","R"
"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name","3235","c_name2","Q"

将 CSVExcelLoader 与上面提供的格式一起使用:

SAMPLEPNAME,123456,789123,SAMPLECNAME,Upload,SAMPLEINAME,This Sample Name of A, B, and C,3234,This Sample Name of A, B, and C,3234,c_name,R
SAMPLEPNAME2,123457,789124,SAMPLECNAME2,Download,SAMPLEINAME2,This Sample Name,3235,This Sample Name,3235,c_name2,Q

将 CSVLoader 用作 CSVLoader(): 注意 - 没有看到任何要提供给构造函数的参数选项

P_NAME,,,C_NAME,C_TYPE,PROT,I_NAME,,A_NAME,,C_NM,CO 
SAMPLEPNAME,123456,789123,SAMPLECNAME,Upload,SAMPLEINAME,This Sample Name of A, B, and C,3234,This Sample Name of A, B, and C,3234,c_name,R
SAMPLEPNAME2,123457,789124,SAMPLECNAME2,Download,SAMPLEINAME2,This Sample Name,3235,This Sample Name,3235,c_name2,Q

我看到的唯一真正区别是 CSVLoader 没有删除标题行，因为我没有看到选择它的选项，而是删除了一些标题名称。

我做错了什么吗？一个可行的解决方案将不胜感激。

最佳答案

要解决字段中的逗号问题，您可以尝试此解决方法。

将数据加载为一行。
将“,”视为分隔符并将其替换为竖线字符，即“|”。
将开始和结束引号 "替换为空字符串。
使用“|”将行加载到配置单元中作为分隔符。

A = LOAD 'test1.csv' AS (lines:chararray);
ranked = rank A;
B = FILTER ranked BY (rank_A > 1);
C = FOREACH B GENERATE REPLACE($1,'","','|');
D = FOREACH C GENERATE REPLACE($0,'"','');
DUMP D;

A = LOAD 'test1.csv' AS (lines:chararray);

"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO"
"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This Sample Name of A, B, and C","3234","c_name","R"
"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name","3235","c_name2","Q"

排名=排名A；

(1,"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO")
(2,"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This S
ample Name of A, B, and C","3234","c_name","R")
(3,"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name
","3235","c_name2","Q")

B = FILTER 排名依据 (rank_A > 1)；

(2,"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This S
ample Name of A, B, and C","3234","c_name","R")
(3,"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name
","3235","c_name2","Q")

C = FOREACH B GENERATE REPLACE($1,'","','|');

("SAMPLEPNAME|123456|789123|SAMPLECNAME|Upload|SAMPLEINAME|This Sample Name of A, B, and C|3234|This S
ample Name of A, B, and C|3234|c_name|R")
("SAMPLEPNAME2|123457|789124|SAMPLECNAME2|Download|SAMPLEINAME2|This Sample Name|3235|This Sample Name
|3235|c_name2|Q")

D = FOREACH C GENERATE REPLACE($0,'"','');

(SAMPLEPNAME|123456|789123|SAMPLECNAME|Upload|SAMPLEINAME|This Sample Name of A, B, and C|3234|This S
ample Name of A, B, and C|3234|c_name|R)
(SAMPLEPNAME2|123457|789124|SAMPLECNAME2|Download|SAMPLEINAME2|This Sample Name|3235|This Sample Name
|3235|c_name2|Q)

您现在可以使用“|”将此数据加载到配置单元作为分隔符。

关于csv - Pig CSVExcelStorage 双引号逗号，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38315557/

csv - Pig CSVExcelStorage 双引号逗号

上一篇：hadoop - 动态计算oozie参数(MR Action 的reducer数量)

下一篇：java - Hadoop:是否可以将多个文件放在一起？