Java 字符串变量在运行时损坏

标签 java hadoop mapreduce

嘿,我在为学校项目编写的小型 Java 程序中遇到了一个奇怪的错误。我很清楚代码是多​​么草率(它仍在进行中),但无论如何,我的字符串变量“year”在跳出循环后不知何故被破坏了。我将 Java 与 Mapreduce 和 hadoop 一起使用来计算单字母组和双字母组并按年份/作者对它们进行排序。使用 print 语句,我确定当我将它设置为 temp 时确实设置了“year”,但是在设置它的循环之后的任何时候,该变量都会以某种方式损坏。年份数字被大量空白所取代(至少它在控制台中是这样显示的)。我已经尝试设置 year=year.trim() 并使用正则表达式 year=year.replaceAll("[^0-9]",""),但都不起作用.有人有什么想法吗? 我只包含了 map 类,因为这就是问题所在。还应该注意的是,被解析的文本文件是来自 Gutenberg 项目的文件。我正在处理来自该项目的大约 40 个随机文本的小样本。

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {
 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text(); 
    public synchronized void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        line = line.toLowerCase();
        line = line.replaceAll("[^0-9a-z\\s-*]", "").replaceAll("\\s+", " "); 
        String year=""; // variable to hold date -- somehow this gets cleared out before I need it
        String temp=""; // variable to hold each token
        StringTokenizer tokenizer = new StringTokenizer(line); // Splits document into individual words for parsing
        while (tokenizer.hasMoreTokens()) {

            temp = tokenizer.nextToken(); // grab first token of document
            if (temp.equals("***")) // hit first triple star, break out and move to next while loop
                break;

            if (temp.equals("release")&&tokenizer.hasMoreTokens()){ // if token is "release" followed by "date", extract year
                if (tokenizer.nextToken().equals("date")){
                    while(tokenizer.hasMoreTokens()){
                        temp = tokenizer.nextToken();
                        for (int i = 0; i<temp.length();i++){
                            if (Character.isDigit(temp.charAt(0))){
                                if (temp.length()>3||Integer.parseInt(temp)>=40){
                                    year = temp; // set year = token if token is a number greater than 40 or has >3 digits
                                    break;
                                }
                            }
                        }
                        if (!year.equals("")){ //if date isn't an empty string, it means we have date and break
                            break;            // out of first while loop
                        }
                    }
                    System.out.println("\n"+year+"\n");// year will still print here
                }
            } // but it is gone if I try to print past this point 
        }

        while (tokenizer.hasMoreTokens()){ // keep grabbing tokens until hit another "***", then break and
            temp = tokenizer.nextToken();  // can begin counting unigrams/bigrams
            if (temp.equals("***"))
                break;
        }

        line = line.substring(line.indexOf(temp)); // form a new document starting from location of previous "***"
        line = line.replaceAll("[^a-z\\s-]", "").replaceAll("\\s+", " ");
        line = line.replaceAll("-+", "-");  /*Many calls to remove excess whitespace and punctuation from entire document*/
        line = line.replaceAll(" - ", " "); 
        line = line.replaceAll("- ", " "); 
        line = line.replaceAll(" -", " ");
        line = line.replaceAll("\\s+", " ");

        StringTokenizer toke = new StringTokenizer(line); //start a new tokenizer with re-formatted file

        while(toke.hasMoreTokens()){//continue to grab tokens until EOF
            temp = toke.nextToken();
            //System.out.println(date);

            if (temp.charAt(0)=='-')
                temp = temp.substring(1);//if word starts or ends with hyphen, remove it
            if (temp.length()>1&&temp.charAt(temp.length()-1)=='-')
                temp = temp.replace('-', ' ');

            if ((!temp.equals(" "))){
                word.set(temp+"\t"+year);   
                context.write(word,one); 
            }
        }
    }
 } 

最佳答案

您的代码中有 year = temp。这似乎取决于您输入的内容。

可能的错误:

for (int i = 0; i<temp.length();i++){
    if (Character.isDigit(temp.charAt(0))){

恕我直言,您的意思是 i 而不是 charAt 中的 0:

for (int i = 0; i<temp.length();i++){
    if (Character.isDigit(temp.charAt(i))){

同时考虑不要使用 StringTokenizer:

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

The following example illustrates how the String.split method can be used to break up a string into its basic tokens:

 String[] result = "this is a test".split("\\s");
 for (int x=0; x<result.length; x++)
     System.out.println(result[x]);

关于Java 字符串变量在运行时损坏,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35281581/

相关文章:

hadoop - 如果两者具有相同的位置,删除托管表将删除外部表?

Hadoop 映射器从 2 个不同的源输入文件中读取

Hadoop:在hadoop 2.0.0中编写Map reduce程序的主要区别是什么

python - 对两列求和,在MapReduce中计算最大值,最小值和平均值

java - 为什么 Google 的 Multimap 的 entries() 方法不返回键/集合对?

hadoop - 启用 Kerberos 后无法访问 Hadoop CLI

maven - 使用 Maven 包含 JAXB

hadoop - 为什么会出现 Hadoop Startup SafemodeException

java - 如何根据视角修改 Java 编辑器的上下文菜单?

java - 打包java文件