java - 使用词边界和 POS 将句子拆分为固定大小的 block

标签 java string recursion split nlp

我正在尝试使用 Java 根据单词边界和 POS(词性)将句子拆分为固定的分块关键字短语(请参阅本文末尾的更新代码)

1) 忽略某些 POS

2) 某些 POS 不能用作根关键字。

并产生以下输出:

**Root Keyword:** In
**Phrase:** None
**Root Keyword:** 2017
**Phrase:** None
**Root Keyword:** Joe Smith
**Phrase:** None
**Root Keyword:** announced
**Phrase 1:** In CD, NNP announced he was
**Phrase 2:** CD, NNP announced he was diagnosed
**Phrase 3:** NNP announced he was diagnosed with
**Phrase 4:** announced he was diagnosed with Lyme
**Root Keyword:** diagnosed
**Phrase 1:** CD, NNP announced he was diagnosed
**Phrase 2:** NNP announced he was diagnosed with
**Phrase 3:** announced he was diagnosed with Lyme
**Phrase 4:** he was diagnosed with Lyme disease

生成短语的最后一个可能的词是:疾病
**Root Keyword:** disease
**Phrase 1:** he was diagnosed with Lyme disease

到目前为止,我已经实现了以下代码:
public class Sentence {


    public Sentence()
    {

    }


    ArrayList<Word> wordList = new ArrayList<Word>();

    public void addWord(Word word)
    {
        wordList.add(word);
    }

    public ArrayList<Word> getWordList() {
        return wordList;
    }

}
public class Word {

    public Word(String word, String pos) {

        this.word = word;
        this.pos = pos;
    }


    String word;
    String pos;
    ArrayList<String> phraseList = new ArrayList<String>();


    public String getWord() {
        return word;
    }

    public String getPos() {
        return pos;
    }


    public void setPhraseList(ArrayList<String> phraseList)
    {
        phraseList.addAll(phraseList);
    }

}
public void generatePhrases()
{


    Sentence sentence = new Sentence();
    sentence.addWord(new Word("In", "IN"));
    sentence.addWord(new Word("2017", "CD"));
    sentence.addWord(new Word(",", "PUNCT"));
    sentence.addWord(new Word("Joe Smith", "NNP"));
    sentence.addWord(new Word("announced", "VB"));
    sentence.addWord(new Word("he", "PRP"));
    sentence.addWord(new Word("was", "VBD"));
    sentence.addWord(new Word("diagnosed", "VBN"));
    sentence.addWord(new Word("with", "IN"));
    sentence.addWord(new Word("Lyme", "NN"));
    sentence.addWord(new Word("disease", "NN"));
    sentence.addWord(new Word(".", "PUNCT"));


    ArrayList<String> posListNotUsedAsRootKeyword = new ArrayList<String>();
    posListNotUsedAsRootKeyword.add("NNP");
    posListNotUsedAsRootKeyword.add("CD");


    ArrayList<String> posListNotCountedTowardMin = new ArrayList<String>();
    posListNotCountedTowardMin.add("VBD");
    posListNotCountedTowardMin.add("IN");
    posListNotCountedTowardMin.add("PRP");
    posListNotCountedTowardMin.add("TO");

    int minPhraseLength = 4; 
    int maxPhraseLength = 6;


    for (int wordCounter = 0; wordCounter < sentence.getWordList().size(); wordCounter++) {

        ArrayList<String> phraseList = new ArrayList<String>();


        Word word = sentence.getWordList().get(wordCounter);
        String wordAsStr = word.getWord();
        String pos = word.getPos();

        if (posListNotUsedAsRootKeyword.contains(pos) || posListNotCountedTowardMin.contains(pos)) {
            continue;
        }


        boolean phraseDesiredLength = false;

        String phrase = wordAsStr;
        int phraseCounter = wordCounter + 1;
        while (!phraseDesiredLength && phraseCounter < sentence.getWordList().size()) {

            Word phraseWord = sentence.getWordList().get(phraseCounter);
            String phraseWordAsStr = phraseWord.getWord();
            String phrasePOS = phraseWord.getPos();


            String appendPhrase = (posListNotUsedAsRootKeyword.contains(phrasePOS)) ? phrasePOS : phraseWordAsStr;
            phrase += " " + appendPhrase;

            if (StringX.countNumberOfWordsInStr(phrase) == minPhraseLength || StringX.countNumberOfWordsInStr(phrase) == maxPhraseLength) {

                phraseDesiredLength = true;
            }


            phraseCounter++;
        }


        System.out.println("PHRASE: " + phrase);

        phraseList.add(phrase);

    }

}

我主要是在生成在根关键字之前开始并在根关键字之后结束的短语(递归?)以及验证短语长度 == 最小或最大短语长度时遇到困难。

最佳答案

我有一种感觉,你对你的措辞做了太多的检查,这很令人困惑。
我将有一个包含键类型(VBD、IN、NNP、CD、TO...)和相关关键字的数据库作为我的“字典”,然后我会评估:

如果有更多不需要的键类型,请执行 if 检查所需的键类型,

如果需要更多的键类型,请对不需要的键类型进行 if 检查。

这将使代码更短。
然后我会去用户输入文本,他们会输入类似的东西:
Peter Griffin likes small white cats snoring on the couch .

然后将在您的 generatePhrases() 上解析该句子其中第一个块将短语排序到 StringList 中,该“排序”将检查字典中的每个单词以确定其键类型并检查该键类型是否需要,然后我将从该 StringList 中删除不需要的部分(NNP 、CD、VBD、IN、PRP、TO),因为您有更多想要的词类型,所以进行不需要的检查会更快。

String textinput = "Peter Griffin likes small white cats snoring on the couch";
String[] words = textinput.Split(" ");
StringList validwords = new StringList();

for (int i = 0; i < words.size(); i++){
    //do the SQL prepare thing, sqlite checks and all the good stuff...
    validword = "SELECT keytype FROM dictionary WHERE word = " + words[i] + 
    " AND keytype NOT IN ('NNP', 'CD', 'VBD', 'IN', 'PRP', 'TO')";

    validwords.add(validword);
}

if (validwords.size() >= 4) && (validwords.size() <= 6){
  system.out.println("Phrase: " + validwords.toString());
}

所以这会给我留下一个 StringList,其中只包含我的关键字句子所需的单词,然后我会检查 StringList 的长度是否在 4 到 6 之间,然后将索引中的单词与 StringList.toString() 连接起来。方法。

由于您将以有意义的顺序输入文本,因此您不必检查是否 Snoring couch cat Griffin small Peter有道理,因为它已经像 Peter Griffin likes small white cats 一样被订购了因为它是输入的顺序。

关于java - 使用词边界和 POS 将句子拆分为固定大小的 block ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60143830/

相关文章:

sql - 在 Postgres 中爬升父/子数据库关系

python - 有序遍历AVL树 : Name not defined

string - 从 io::stdin().read_line() 中修剪 '\n' 的更好方法是什么?

javascript - 在 Javascript 中,如何转换字符串以便它可以用于调用属性?

java - 哪些递归方法相互排斥?

java - Maven重复标签 "dependencies"错误

java - 优化程序速度的一般方法

java - 使用 GUI swing JFrame 作为其父级创建 JDialog?

java - 将 Jar 文件作为独立应用程序执行

java - 如何在 Java 中读取/转换 InputStream 为字符串?