我正在尝试解决以下 RecordReader 问题。 输入文件示例:
1,1
2,2
3,3
4,4
5,5
6,6
7,7
.......
.......
我希望我的 RecordReader 返回
key | Value
0 |1,1:2,2:3,3:4,4:5,5
4 |2,2:3,3:......6,6
6 |3,3:4,4......6,6,7,7
(对于第一个值,前五行,对于第二个值,从第二行开始五行,对于第三个值,从第三行开始五行,依此类推)
public class MyRecordReader extends RecordReader<LongWritable, Text> {
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
while (pos < end) {
key.set(pos);
// five line logic
Text nextLine=new Text();
int newSize = in.readLine(value, maxLineLength,
Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),
maxLineLength));
fileSeek+=newSize;
for(int n=0;n<4;n++)
{
fileSeek+=in.readLine(nextLine, maxLineLength,
Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),
maxLineLength));
value.append(":".getBytes(), 0,1);
value.append(nextLine.getBytes(), 0, nextLine.getLength());
}
if (newSize == 0) {
return false;
}
pos += newSize;
if (newSize < maxLineLength) {
return true;
}
// line too long. try again
LOG.info("Skipped line of size " + newSize + " at pos " + (pos - newSize));
}
return false;
}
但是这是返回值作为
key | Value
0 |1,1:2,2:3,3:4,4:5,5
4 |6,6:7,7.......10,10
6 |11,11:12,12:......14,14
有人可以帮我处理这段代码吗?或者 RecodeReader 的新代码也可以吗? Requirement of the problem (may help you understand the use case) 谢谢
最佳答案
我想我理解了这个问题...这是我要做的:包装另一个 RecordReader 并将其中的键/值缓冲到本地队列中。
public class MyRecordReader extends RecordReader<LongWritable, Text> {
private static final int BUFFER_SIZE = 5;
private static final String DELIMITER = ":";
private Queue<String> valueBuffer = new LinkedList<String>();
private Queue<Long> keyBuffer = new LinkedList<Long>();
private LongWritable key = new LongWritable();
private Text value = new Text();
private RecordReader<LongWritable, Text> rr;
public MyRecordReader(RecordReader<LongWritable, Text> rr) {
this.rr = rr;
}
@Override
public void close() throws IOException {
rr.close();
}
@Override
public LongWritable getCurrentKey() throws IOException, InterruptedException {
return key;
}
@Override
public Text getCurrentValue() throws IOException, InterruptedException {
return value;
}
@Override
public float getProgress() throws IOException, InterruptedException {
return rr.getProgress();
}
@Override
public void initialize(InputSplit arg0, TaskAttemptContext arg1)
throws IOException, InterruptedException {
rr.initialize(arg0, arg1);
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (valueBuffer.isEmpty()) {
while (valueBuffer.size() < BUFFER_SIZE) {
if (rr.nextKeyValue()) {
keyBuffer.add(rr.getCurrentKey().get());
valueBuffer.add(rr.getCurrentValue().toString());
} else {
return false;
}
}
} else {
if (rr.nextKeyValue()) {
keyBuffer.add(rr.getCurrentKey().get());
valueBuffer.add(rr.getCurrentValue().toString());
keyBuffer.remove();
valueBuffer.remove();
} else {
return false;
}
}
key.set(keyBuffer.peek());
value.set(getValue());
return true;
}
private String getValue() {
StringBuilder sb = new StringBuilder();
Iterator<String> iter = valueBuffer.iterator();
while (iter.hasNext()) {
sb.append(iter.next());
if (iter.hasNext()) sb.append(DELIMITER);
}
return sb.toString();
}
}
例如,您可以拥有一个自定义 InputFormat,它扩展 TextInputFormat 并覆盖 createRecordReader
方法来调用 super.createRecordReader
并返回包含在 中的结果MyRecordReader
,像这样:
public class MyTextInputFormat extends TextInputFormat {
@Override
public RecordReader<LongWritable, Text> createRecordReader(
InputSplit arg0, TaskAttemptContext arg1) {
return new MyRecordReader(super.createRecordReader(arg0, arg1));
}
}
关于java - Hadoop Map-Reduce 。记录阅读器,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12418847/