我需要数字母而不是数单词。 但是我在使用 Apache Pig 版本 0.8.1-cdh3u1 实现这个时遇到了问题
给定以下输入:
989;850;abcccc
29;395;aabbcc
输出应该是:
989;850;a;1
989;850;b;1
989;850;c;4
29;395;a;2
29;395;b;2
29;395;c;2
这是我尝试过的:
A = LOAD 'input' using PigStorage(';') as (x:int, y:int, content:chararray);
B = foreach A generate x, y, FLATTEN(STRSPLIT(content, '(?<=.)(?=.)', 6)) as letters;
C = foreach B generate x, y, FLATTEN(TOBAG(*)) as letters;
D = foreach C generate x, y, letters.letters as letter;
E = GROUP D BY (x,y,letter);
F = foreach E generate group.x as x, group.y as y, group.letter as letter, COUNT(D.letter) as count;
A、B 和 C 可以转储,但“转储 D”会导致“错误 2997:无法从支持的错误中重新创建异常:java.lang.ClassCastException:java.lang.Integer 无法转换为 org.apache。 pig .data.元组"
转储 C 显示(尽管第三个值是一个奇怪的元组):
(989,850,a)
(989,850,b)
(989,850,c)
(989,850,c)
(989,850,c)
(989,850,c)
(29,395,a)
(29,395,a)
(29,395,b)
(29,395,b)
(29,395,c)
(29,395,c)
这是模式:
grunt> describe A; describe B; describe C; describe D; describe E; describe F;
A: {x: int,y: int,content: chararray}
B: {x: int,y: int,letters: bytearray}
C: {x: int,y: int,letters: (x: int,y: int,letters: bytearray)}
D: {x: int,y: int,letter: bytearray}
E: {group: (x: int,y: int,letter: bytearray),D: {x: int,y: int,letter: bytearray}}
F: {x: int,y: int,letter: bytearray,count: long}
这个 pig 版本似乎不支持 TOBAG($2..$8),因此 TOBAG(*) 也包括 x 和 y,但稍后可以在句法上进行排序...... 我想避免编写 UDF,否则我会直接使用 Java API。
但我并没有真正得到转换错误。谁能解释一下。
最佳答案
我建议改为编写一个自定义UDF
。一个快速的、原始的实现看起来像这样:
package com.example;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
public class CharacterCount extends EvalFunc<DataBag> {
private static final BagFactory bagFactory = BagFactory.getInstance();
private static final TupleFactory tupleFactory = TupleFactory.getInstance();
@Override
public DataBag exec(Tuple input) throws IOException {
try {
Map<Character, Integer> charMap = new HashMap<Character, Integer>();
DataBag result = bagFactory.newDefaultBag();
int x = (Integer) input.get(0);
int y = (Integer) input.get(1);
String content = (String) input.get(2);
for (int i = 0; i < content.length(); i++){
char c = content.charAt(i);
Integer count = charMap.get(c);
count = (count == null) ? 1 : count + 1;
charMap.put(c, count);
}
for (Map.Entry<Character, Integer> entry : charMap.entrySet()) {
Tuple res = tupleFactory.newTuple(4);
res.set(0, x);
res.set(1, y);
res.set(2, String.valueOf(entry.getKey()));
res.set(3, entry.getValue());
result.add(res);
}
return result;
} catch (Exception e) {
throw new RuntimeException("CharacterCount error", e);
}
}
}
将其打包到jar中并执行:
register '/home/user/test/myjar.jar';
A = LOAD '/user/hadoop/store/sample/charcount.txt' using PigStorage(';')
as (x:int, y:int, content:chararray);
B = foreach A generate flatten(com.example.CharacterCount(x,y,content))
as (x:int, y:int, letter:chararray, count:int);
dump B;
(989,850,b,1)
(989,850,c,4)
(989,850,a,1)
(29,395,b,2)
(29,395,c,2)
(29,395,a,2)
关于hadoop - Pig Mapreduce 计算连续的字母,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13248845/