java - 如何在Java中从文本中获取20个单词

标签 java hadoop

我需要阅读文本文件中的每一行才能找到前20个字,
该代码使用学生ID的参数从文件中选择10,000条随机行。
这是代码:

import java.io.File;
import java.lang.reflect.Array;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.*;

public class MP1 {

    Random generator;
    String userName;
    String inputFileName;

    String delimiters = " \t,;.?!-:@[](){}_*/";


    String[] stopWordsArray = {"i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours",
            "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its",
            "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that",
            "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having",
            "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while",
            "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before",
            "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again",
            "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each",
            "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than",
            "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"};


    void initialRandomGenerator(String seed) throws   
        NoSuchAlgorithmException {
        MessageDigest messageDigest = MessageDigest.getInstance("SHA");
        messageDigest.update(seed.toLowerCase().trim().getBytes());
        byte[] seedMD5 = messageDigest.digest();

        long longSeed = 0;
        for (int i = 0; i < seedMD5.length; i++) {
            longSeed += ((long) seedMD5[i] & 0xffL) << (8 * i);
        }

        this.generator = new Random(longSeed);
    }

    Integer[] getIndexes() throws NoSuchAlgorithmException {
        Integer n = 10000;
        Integer number_of_lines = 50000;
        Integer[] ret = new Integer[n];
        this.initialRandomGenerator(this.userName);
        for (int i = 0; i < n; i++) {
            ret[i] = generator.nextInt(number_of_lines);
        }
        return ret;
    }

    public MP1(String userName, String inputFileName) {
        this.userName = userName;
        this.inputFileName = inputFileName;
    }

    public String[] process() throws Exception {
        String[] ret = new String[20];

        //TODO


        return ret;
    }

    public static void main(String[] args) throws Exception {
        if (args.length < 1){
            System.out.println("MP1 <User ID>");
        }
        else {
            String userName = args[0];
            String inputFileName = "./input.txt";
            MP1 mp = new MP1(userName, inputFileName);
            String[] topItems = mp.process();
            for (String item: topItems){
                System.out.println(item);
            }
        }
    }
}

我需要在数组rest内插入前20个单词并将其返回。
我编写了代码来读取每一行中的文件,但是不知道如何调用getIndex()函数来生成该行并计算前20个单词,还需要忽略“stopWordsArray”变量中提供的所有常见单词。 :-
public String[] process() throws Exception {
    String[] ret = new String[20];

    //TODO
     FileReader in = new FileReader(inputFileName);
     BufferedReader br = new BufferedReader(in);
     String line = br.readLine();


    return ret;
}

感谢您的帮助。

谢谢,

最佳答案

解决方案的代码如下所示:

public String[] process() throws Exception {
        String[] ret = new String[20];

        Map<String, Integer> map = new HashMap<String, Integer>();
        FileReader in = new FileReader(inputFileName);
        BufferedReader br = new BufferedReader(in);
        String line;
        while ((line = br.readLine()) != null) {
            StringTokenizer st = new StringTokenizer(line, delimiters + " ");

            while (st.hasMoreTokens()) {
                String token = st.nextToken().toLowerCase();
                if (!Arrays.asList(stopWordsArray).contains(token))
                    if (map.containsKey(token)) {
                        map.put(token, map.get(token) + 1);
                    } else
                        map.put(token, 1);
                    }
        }
        br.close();
        for(int i=0;i<ret.length;i++){
            int max=0;
            String word="";
            for(Entry<String,Integer> entry:map.entrySet()){
                if(entry.getValue()>max){
                    max=entry.getValue();
                    word=entry.getKey();
                }
            }
            map.remove(word);
            ret[i]=word;
        }

        return ret;
    }

希望对您有所帮助。

关于java - 如何在Java中从文本中获取20个单词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32232062/

相关文章:

java - 到字符串值

performance - Spark 本地 vs hdfs 性能

hadoop - 哪个版本的Sqoop可与Hadoop 0.20.2一起使用?

apache-spark - Spark 独立集群 :Configuring Distributed File System

java - int[] Java中hashCode和equals的实现

java - 自动装箱错误

java - 如何解决java.io.IOException : Read error when reading Excel file using apache POI

hadoop - 通过 SOCKS 代理使用 Hadoop?

hadoop - Hive 2.1.0:无法移动源

java - 如何在eclipse中运行这个spring rest maven项目