hadoop - 自定义 Hadoop 类型的 ArrayWritable 实现

如何为自定义 Hadoop 类型定义 ArrayWritable？我正在尝试在 Hadoop 中实现倒排索引，使用自定义 Hadoop 类型来存储数据

我有一个 Individual Posting 类，它存储术语频率、文档 ID 和文档中术语的字节偏移列表。

我有一个 Posting 类，它有文档频率(该术语出现的文档数量)和个人帖子列表

我已经为 IndividualPostings 中的字节偏移列表定义了一个扩展 ArrayWritable 类的 LongArrayWritable

当我为 IndividualPosting 定义自定义 ArrayWritable 时，我在本地部署(使用 Karmasphere、Eclipse)后遇到了一些问题。

Posting 类列表中的所有 IndividualPosting 实例都是相同的，即使我在 Reduce 方法中得到不同的值也是如此

最佳答案

来自 ArrayWritable 的文档:

A Writable for arrays containing instances of a class. The elements of this writable must all be instances of the same class. If this writable will be the input for a Reducer, you will need to create a subclass that sets the value to be of the proper type. For example: public class IntArrayWritable extends ArrayWritable { public IntArrayWritable() { super(IntWritable.class); } }

您已经引用过使用 WritableComparable 执行此操作Hadoop 定义的类型。这是我假设您的实现看起来像 LongWritable 的样子:

public static class LongArrayWritable extends ArrayWritable
{
    public LongArrayWritable() {
        super(LongWritable.class);
    }
    public LongArrayWritable(LongWritable[] values) {
        super(LongWritable.class, values);
    }
}

您应该能够对任何实现了 WritableComparable 的类型执行此操作，由 the documentation 给出.使用他们的例子:

public class MyWritableComparable implements
        WritableComparable<MyWritableComparable> {

    // Some data
    private int counter;
    private long timestamp;

    public void write(DataOutput out) throws IOException {
        out.writeInt(counter);
        out.writeLong(timestamp);
    }

    public void readFields(DataInput in) throws IOException {
        counter = in.readInt();
        timestamp = in.readLong();
    }

    public int compareTo(MyWritableComparable other) {
        int thisValue = this.counter;
        int thatValue = other.counter;
        return (thisValue < thatValue ? -1 : (thisValue == thatValue ? 0 : 1));
    }
}

应该就是这样。这假设您使用的是 Hadoop API 的修订版 0.20.2 或 0.21.0。

关于hadoop - 自定义 Hadoop 类型的 ArrayWritable 实现，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/4386781/

hadoop - 自定义 Hadoop 类型的 ArrayWritable 实现

上一篇：Hadoop 作业跟踪器只能从本地主机访问

下一篇：hadoop - 在不创建 _temporary 文件夹的情况下将 Spark 数据帧作为 Parquet 写入 S3