java - 找不到MapReduce停用词

标签 java hadoop stop-words

我是MapReduce的新手,正在尝试编写一个程序来计算文件中停用词的数量。我从命令行引用了stopword.txt文件,但是每次运行此命令时,结果都是Stop Words = 0和Good Words = 30(应为5和25)。我没有任何异常,它正在编译并可以正常运行。我坚持尝试其他方法。
下面是我的代码。 Hadoop版本为2.0。

StopWord.java

public class StopWord {

public enum COUNTERS {
      STOPWORDS, GOODWORDS
     }
public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();
    GenericOptionsParser parser = new GenericOptionsParser(conf, args);
    args = parser.getRemainingArgs();

    Job job = new Job(conf, "StopWord");
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    job.setJarByClass(StopWord.class);
    job.setMapperClass(MyMapper.class);
    job.setNumReduceTasks(0);
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

    List<String> other_args = new ArrayList<String>();
    for (int i = 0; i < args.length; i++) {
        if ("-skip".equals(args[i])) {
            DistributedCache.addCacheFile(new Path(args[++i]).toUri(),
                    job.getConfiguration());
            if (i+1 < args.length)
            {
                i++;
            }
            else
            {
                break;
            }
        }

        other_args.add(args[i]);
    }

    FileInputFormat.setInputPaths(job, new Path(other_args.get(0)));
    FileOutputFormat.setOutputPath(job, new Path(other_args.get(1)));
    job.waitForCompletion(true);
    Counters counters = job.getCounters();
    System.out.printf("Good Words: %d, Stop Words: %d\n",
              counters.findCounter(COUNTERS.GOODWORDS).getValue(),
              counters.findCounter(COUNTERS.STOPWORDS).getValue());
         }
    }

MyMapper.java
public class MyMapper extends Mapper<LongWritable, Text, Text, NullWritable> {

private Text word = new Text();
private Set<String> stopWordList = new HashSet<String>();
private BufferedReader fis;

protected void setup(Context context) throws java.io.IOException,
        InterruptedException {

    try {
        Path[] stopWordFiles = new Path[0];
        stopWordFiles = context.getLocalCacheFiles();
        System.out.println(stopWordFiles.toString());
        if (stopWordFiles != null && stopWordFiles.length > 0) {
            for (Path stopWordFile : stopWordFiles) {
                readStopWordFile(stopWordFile);
            }
        }
    } catch (IOException e) {
        System.err.println("Exception reading stop word file: " + e);
    }
}

 //reading the stop word file
private void readStopWordFile(Path stopWordFile) {
    try {
        fis = new BufferedReader(new FileReader(stopWordFile.toString()));
        String stopWord = null;
        while ((stopWord = fis.readLine()) != null) {
            stopWordList.add(stopWord);
        }
    } catch (IOException e) {
        System.err.println("Exception while reading stop word file '"
                + stopWordFile + "' : " + e.toString());
    }
}

public void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);

    while (tokenizer.hasMoreTokens()) {
        String token = tokenizer.nextToken();
        if (stopWordList.contains(token)) {
            context.getCounter(StopWord.COUNTERS.STOPWORDS)
                    .increment(1);
        } else {
            context.getCounter(StopWord.COUNTERS.GOODWORDS)
                    .increment(1);
            word.set(token);
            context.write(word, null);
        }
    }
}
}

最佳答案

从我可以看到您的stopWordFiles可能为空,
您在作业初始化后添加分布式缓存。

查看此帖子以获取更多信息
Accessing files in hadoop distributed cache

关于java - 找不到MapReduce停用词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37144140/

相关文章:

java - 从 DataInputStream 读取几个 byte[]

java - 在 Internet Explorer 中使用 Java 在新选项卡中打开 HTM 文件

java - 使用 Java API 在 Hadoop 中移动文件?

nlp - Spacy - 自定义停用词不起作用

java - java中的复选框数组

java - 对数组进行快速排序并在 Java 中进行二进制搜索

hadoop - 为什么Reducer.class 不能在Hadoop MapReduce 中用作真正的reducer?

hadoop - 如何排除此 Hadoop 文件系统安装错误?

python nltk循环打印标题而不是值

postgresql - 在 postgresql 中删除停用词而不阻止