regex - Pig - 移除换行、回车和制表符

标签 regex hadoop apache-pig

我试图从 Pig 的列中删除字符:\n、\t 和\r,但我得到了错误的输出。

这是我正在做的:

qr_1 = LOAD 'hdfs://localhost:9000/sample.csv' USING PigStorage(',') as (Id:int,PostTypeId:int,AcceptedAnswerId:int,ParentId:int,CreationDate:chararray,DeletionDate:chararray,Score:int,ViewCount:int,Body:chararray,OwnerUserId:int,OwnerDisplayName:chararray,LastEditorUserId:int,LastEditorDisplayName:chararray,LastEditDate:chararray,LastActivityDate:chararray,Title:chararray,Tags:chararray,AnswerCount:int,CommentCount:int,FavoriteCount:int,ClosedDate:chararray,CommunityOwnedDate:chararray);
qr_1 = FOREACH qr_1 GENERATE Id .. ViewCount, REPLACE(Body,'\n','') as Body, OwnerUserId .. ;
qr_1 = FOREACH qr_1 GENERATE Id .. ViewCount, REPLACE(Body,'\r','') as Body, OwnerUserId .. ;   
qr_1 = FOREACH qr_1 GENERATE Id .. ViewCount, REPLACE(Body,'\t','') as Body, OwnerUserId .. ;   

输入:

5585779,1,5585800,,2011-04-07 18:27:54,,1432,3090250,"<p>How can I convert a <code>String</code> to an <code>int</code> in Java?</p>

<p>My String contains only numbers and I want to return the number it represents.</p>

<p>For example, given the string <code>""""1234""""</code> the result should be the number <code>1234</code>.</p>",537967,,2756409,user166390,2015-09-10 21:30:42,2016-03-07 00:42:49,Converting String to Int in Java?,<java><string><type-conversion>,12,0,239

输出:

(5585779,1,5585800,,2011-04-07 18:27:54,,1432,3090250,"<p>How can I convert a <code>String</code> to an <code>int</code> in Java?</p>,,,,,,,,,,,,,)
(,,,,,,,,,,,,,,,,,,,,,)
(,,,,,,,,,,,,,,,,,,,,)
(,,,,,,,,,,,,,,,,,,,,,)
(,,537967,,2756409,user166390,,,Converting String to Int in Java?,,12,0,239,,,,,,,,,)

我在做什么?

谢谢。

“\\n”也没有区别。

最佳答案

您的数据中有逗号,这就是字段和架构不匹配的原因。使用 CSVLoader然后使用 REPLACE 命令替换 '\\t','\\n','\\r'

<p>For example, given the string

关于regex - Pig - 移除换行、回车和制表符,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36212024/

相关文章:

hadoop - 如何在Hadoop中为多个作业分配特定数量的映射器?

javascript - 正则表达式:在 url 登录/测试中获取字符串

hadoop - 带有 'yarn-client' 的 Spark-shell 尝试从错误的位置加载配置

eclipse-plugin - hadoop eclipse-plugin支持参数吗

hadoop - 从 oozie 以本地模式运行 PIG

hadoop - Hadoop上的提取失败太多

javascript - 如何使用正则表达式使用数据表进行多值或列搜索?

json - 如何在 bash 中用\"替换 JSON 文件中的引号?

regex - 如何在机器人框架中使用正则表达式选择子字符串

hadoop - Hive查询传递字符串作为参数