hadoop - 在 Hadoop 中序列化一个长字符串

标签 hadoop mapreduce

我有一个在 Hadoop 中实现 WritableComparable 类的类。这个类有两个字符串变量,一个很短,一个很长。我用 writeChars写这些变量和readLine阅读它们,但似乎我遇到了某种错误。在 Hadoop 中序列化这么长的字符串的最佳方法是什么?

最佳答案

我认为您可以使用 byteswritable 来提高效率。检查以下具有 BytesWritable 类型作为 callId 的自定义键。

public class CustomMRKey implements WritableComparable<CustomMRKey> {
private BytesWritable callId;
private IntWritable mapperType;

/**
 * @default constructor
 */
public CustomMRKey() {
    set(new BytesWritable(), new IntWritable());
}

/**
 * Constructor
 * 
 * @param callId
 * @param mapperType
 */
public CustomMRKey(BytesWritable callId, IntWritable mapperType) {
    set(callId, mapperType);
}

/**
 * sets the call id and mapper type
 * 
 * @param callId
 * @param mapperType
 */
public void set(BytesWritable callId, IntWritable mapperType) {
    this.callId = callId;
    this.mapperType = mapperType;
}

/**
 * This method returns the callId
 * 
 * @return callId
 */
public BytesWritable getCallId() {
    return callId;
}

/**
 * This method sets the callId given a callId
 * 
 * @param callId
 */
public void setCallId(BytesWritable callId) {
    this.callId = callId;
}

/**
 * This method returns the mapper type
 * 
 * 
 * @return
 */
public IntWritable getMapperType() {
    return mapperType;
}

/**
 * This method is set to store the mapper type
 * 
 * @param mapperType
 */
public void setMapperType(IntWritable mapperType) {
    this.mapperType = mapperType;
}

@Override
public void readFields(DataInput in) throws IOException {
    callId.readFields(in);
    mapperType.readFields(in);

}

@Override
public void write(DataOutput out) throws IOException {
    callId.write(out);
    mapperType.write(out);
}

@Override
public boolean equals(Object obj) {
    if (obj instanceof CustomMRCdrKey) {
        CustomMRCdrKey key = (CustomMRCdrKey) obj;
        return callId.equals(key.callId)
                && mapperType.equals(key.mapperType);
    }
    return false;
}

@Override
public int compareTo(CustomMRCdrKey key) {
    int cmp = callId.compareTo(key.getCallId());
    if (cmp != 0) {
        return cmp;
    }
    return mapperType.compareTo(key.getMapperType());
}

}

要在说映射器代码中使用说,您可以使用以下内容生成 BytesWritable 表单的 key :-

您可以调用为:

CustomMRKey customKey=new CustomMRKey(new BytesWritable(),new IntWritable());
customKey.setCallId(makeKey(value, this.resultKey));
customKey.setMapperType(this.mapTypeIndicator);

然后 makeKey 方法如下所示:-
public BytesWritable makeKey(Text value, BytesWritable key) throws IOException {
    try {
        ByteArrayOutputStream byteKey = new ByteArrayOutputStream(Constants.MR_DEFAULT_KEY_SIZE);
        for (String field : keyFields) {
            byte[] bytes = value.getString(field).getBytes();
            byteKey.write(bytes,0,bytes.length);
        }
        if(key==null){
            return new BytesWritable(byteKey.toByteArray());
        }else{
            key.set(byteKey.toByteArray(), 0, byteKey.size());
            return  key;
        }
    } catch (Exception ex) {
        throw new IOException("Could not generate key", ex);
    }
}

希望这可能会有所帮助。

关于hadoop - 在 Hadoop 中序列化一个长字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20670404/

相关文章:

Hadoop MultipleOutputs 输出文件 "part-day-26"

javascript - 如何在mongoDb中找到一个字段的最长和最短长度?

hadoop - MultipleInputs 不工作 - Hadoop 2.5.0

optimization - 旅行推销员和 map /减少 : Abandon Channel

hadoop - Hadoop名称节点未启动CDH4.7

java - 使用水槽反序列化Json文件并下沉到HDFS

bash - 使用 shell 条件检查 hdfs 中是否存在目录

hadoop - 我想使用sqoop导入作业将数据压缩到配置单元列分区表中。我们应该怎么做?

hadoop - 当某些键集的值过多时如何平衡 reducer ?

java - Hadoop MultipleOutputs.addNamedOutput 抛出 "cannot find symbol"