java - bigdata hadoop java codefor wordcount已修改

我必须修改hadoop wordcount示例，以计算以前缀“cons”开头的单词的数量，然后需要按结果的降序对结果进行排序。谁能告诉我如何为此编写映射器和简化器代码？

码:

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> 
{ 
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException 
    { 
        //Replacing all digits and punctuation with an empty string 
        String line =  value.toString().replaceAll("\\p{Punct}|\\d", "").toLowerCase();
        //Extracting the words 
        StringTokenizer record = new StringTokenizer(line); 
        //Emitting each word as a key and one as itsvalue 
        while (record.hasMoreTokens()) 
            context.write(new Text(record.nextToken()), new IntWritable(1)); 
    } 
}

最佳答案

要计算以“cons”开头的单词数，您可以在从mapper发射时丢弃所有其他单词。

public void map(Object key, Text value, Context context) throws IOException,
        InterruptedException {
    IntWritable one = new IntWritable(1);
    String[] words = value.toString().split(" ");
    for (String word : words) {
        if (word.startsWith("cons"))
              context.write(new Text("cons_count"), one);
    }
}

现在，reducer将仅收到一个key = cons_count，您可以将这些值相加以获得计数。

要根据频率对以“cons”开头的单词进行排序，应将以“cons”开头的单词分配给相同的reducer，reducer应该对其进行汇总和排序。要做到这一点，

public class MyMapper extends Mapper<Object, Text, Text, Text> {


@Override
public void map(Object key, Text value, Context output) throws IOException,
        InterruptedException {
      String[] words = value.toString().split(" ");
      for (String word : words) {
        if (word.startsWith("cons"))
              context.write(new Text("cons"), new Text(word));
    }
 }
}

reducer :

public class MyReducer extends Reducer<Text, Text, Text, IntWritable> {

@Override
public void reduce(Text key, Iterable<Text> values, Context output)
        throws IOException, InterruptedException {
    Map<String,Integer> wordCountMap = new HashMap<String,Integer>();
    for(Text value: values){
        word = value.get();
        if (wordCountMap.contains(word) {
           Integer count = wordCountMap.get(key);
           count++;
           wordCountMap.put(word,count)
        }else {
         wordCountMap.put(word,new Integer(1));
        }
    }

    //use some sorting mechanism to sort the map based on values.
    // ...

    for (Map.Entry<String, Integer> entry : wordCountMap.entrySet()) {
        context.write(new Word(entry.getKey(),new IntWritable(entry.getValue());
    } 
}

}

关于java - bigdata hadoop java codefor wordcount已修改，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/26170827/

java - bigdata hadoop java codefor wordcount已修改

上一篇：hadoop - 如何覆盖Amazon Simple Workflow StartToCloseTimeout？

下一篇：java - 有没有办法在另一个EMR作业中调用EMR群集的JobFlowId？