嘿,我在为学校项目编写的小型 Java 程序中遇到了一个奇怪的错误。我很清楚代码是多么草率(它仍在进行中),但无论如何,我的字符串变量“year”在跳出循环后不知何故被破坏了。我将 Java 与 Mapreduce 和 hadoop 一起使用来计算单字母组和双字母组并按年份/作者对它们进行排序。使用 print 语句,我确定当我将它设置为 temp 时确实设置了“year”,但是在设置它的循环之后的任何时候,该变量都会以某种方式损坏。年份数字被大量空白所取代(至少它在控制台中是这样显示的)。我已经尝试设置 year=year.trim()
并使用正则表达式 year=year.replaceAll("[^0-9]","")
,但都不起作用.有人有什么想法吗?
我只包含了 map 类,因为这就是问题所在。还应该注意的是,被解析的文本文件是来自 Gutenberg 项目的文件。我正在处理来自该项目的大约 40 个随机文本的小样本。
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public synchronized void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
line = line.toLowerCase();
line = line.replaceAll("[^0-9a-z\\s-*]", "").replaceAll("\\s+", " ");
String year=""; // variable to hold date -- somehow this gets cleared out before I need it
String temp=""; // variable to hold each token
StringTokenizer tokenizer = new StringTokenizer(line); // Splits document into individual words for parsing
while (tokenizer.hasMoreTokens()) {
temp = tokenizer.nextToken(); // grab first token of document
if (temp.equals("***")) // hit first triple star, break out and move to next while loop
break;
if (temp.equals("release")&&tokenizer.hasMoreTokens()){ // if token is "release" followed by "date", extract year
if (tokenizer.nextToken().equals("date")){
while(tokenizer.hasMoreTokens()){
temp = tokenizer.nextToken();
for (int i = 0; i<temp.length();i++){
if (Character.isDigit(temp.charAt(0))){
if (temp.length()>3||Integer.parseInt(temp)>=40){
year = temp; // set year = token if token is a number greater than 40 or has >3 digits
break;
}
}
}
if (!year.equals("")){ //if date isn't an empty string, it means we have date and break
break; // out of first while loop
}
}
System.out.println("\n"+year+"\n");// year will still print here
}
} // but it is gone if I try to print past this point
}
while (tokenizer.hasMoreTokens()){ // keep grabbing tokens until hit another "***", then break and
temp = tokenizer.nextToken(); // can begin counting unigrams/bigrams
if (temp.equals("***"))
break;
}
line = line.substring(line.indexOf(temp)); // form a new document starting from location of previous "***"
line = line.replaceAll("[^a-z\\s-]", "").replaceAll("\\s+", " ");
line = line.replaceAll("-+", "-"); /*Many calls to remove excess whitespace and punctuation from entire document*/
line = line.replaceAll(" - ", " ");
line = line.replaceAll("- ", " ");
line = line.replaceAll(" -", " ");
line = line.replaceAll("\\s+", " ");
StringTokenizer toke = new StringTokenizer(line); //start a new tokenizer with re-formatted file
while(toke.hasMoreTokens()){//continue to grab tokens until EOF
temp = toke.nextToken();
//System.out.println(date);
if (temp.charAt(0)=='-')
temp = temp.substring(1);//if word starts or ends with hyphen, remove it
if (temp.length()>1&&temp.charAt(temp.length()-1)=='-')
temp = temp.replace('-', ' ');
if ((!temp.equals(" "))){
word.set(temp+"\t"+year);
context.write(word,one);
}
}
}
}
最佳答案
您的代码中有 year = temp
。这似乎取决于您输入的内容。
可能的错误:
for (int i = 0; i<temp.length();i++){
if (Character.isDigit(temp.charAt(0))){
恕我直言,您的意思是 i
而不是 charAt 中的 0:
for (int i = 0; i<temp.length();i++){
if (Character.isDigit(temp.charAt(i))){
同时考虑不要使用 StringTokenizer:
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
The following example illustrates how the String.split method can be used to break up a string into its basic tokens:
String[] result = "this is a test".split("\\s"); for (int x=0; x<result.length; x++) System.out.println(result[x]);
关于Java 字符串变量在运行时损坏,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35281581/