java - 在文档中查找单词序列

使用 Java(在 Android 上)我试图找到一种方法(快速的...)来解决这个问题:

我有一个单词列表(大约 10 到 30 个)和一个文档。文档的长度也可能不同，可能在 2500 到 10000 字左右。本文档是一本书的一部分。

我想要的是在这个文档中找到包含我列表中更多单词的字符串(句子...)。文档中的单词必须与我的单词列表的顺序相同。通常情况下，文档中的单词不应彼此相距太远，我列表中的每个单词之间最多可能有 2 或 3 个单词。

为了更清楚，让我们举一个小数据的例子。

我的单词表是:

harm piece work day

我的文档:

just so, with the greatest care. You must see to it that you pull up regularly all the baobabs, at the very first moment when they can be distinguished from the rosebushes which they resemble so closely in their earliest youth. It is very tedious work," the little prince added, "but very easy." And one day he said to me: "You ought to make a beautiful drawing, so that the children where you live can see exactly how all this is. That would be very useful to them if they were to travel some day. Sometimes," he added, "there is no harm in putting off a piece of work until another day. But when it is a matter of baobabs, that always means a catastrophe. I knew a planet that was inhabited by a lazy man. He neglected three little bushes..." So, as the little prince described it to me, I have made a drawing of that planet. I do not much like to take the tone of a moralist. But the danger of the baobabs is so little understood, and such considerable risks would be run by anyone who might get lost on an asteroid, that for once I am breaking through my reserve. "Children," I say plainly, "watch out for the baobabs!"

目标是在文档中找到字符串“There is no harm in putting off a piece of work until another day”。

目前，我想到的唯一方法是:

1 - 查找文档列表中第一个单词的第一次出现。

2 - 将列表中的单词数乘以 2 或 3 以获得我必须在文档中检查的字符串长度(关于文档中列表中单词之间的最大单词数)。

3 - 通过拆分和循环搜索此文档字符串(具有我在步骤 2 中获得的字符串长度)中列表中其他单词的出现。

如果我认为我的单词在这个字符串中的出现次数不够(可能大约 50%)，那么继续在文档中搜索，从我列表中第一个单词的下一次出现开始。

但我担心这可能会很长，太长了，特别是因为我在移动设备上工作......所以我来这里是为了捕获一些我可能没有想到的想法，或者一些可以帮助我完成这项任务的库。我也考虑过正则表达式，但我不确定这是否是更好的方法。

@gukoff proposition

关于最终我的单词列表不能与我的文本顺序不同，它简化了算法。 @gukoff 回答的开头就足够了。无需实现 LIS 算法或反转列表。

//Section = input text
//wordsToFind = words to find in text separated by space
private ArrayList<ArrayList<Integer>> test1(String wordsToFind, Section section) {
    //1. Create the index of your words array.
    String[] wordsArray = wordsToFind.split(" ");

    ArrayList<Integer> indexesSentences = new ArrayList<>();
    ArrayList<ArrayList<Integer>> sentenceArrayIndexes = new ArrayList<>();
    ArrayList<Integer> wordsToFindIndexes = new ArrayList<>();

    for(Sentence sentence:section.getSentences()) {
        indexesSentences.clear();
        for(String sentenceWord:sentence.getWords()) {
            wordsToFindIndexes.clear();
            int j = 0;
            for(String word:wordsArray) {
                if(word.equals(sentenceWord)) {
                    wordsToFindIndexes.add(j+1);
                }
                j++;
            }
            //Collections.reverse(wordsToFindIndexes);
            for(int idx:wordsToFindIndexes) {
                indexesSentences.add(idx);
            }
        }
        sentenceArrayIndexes.add((ArrayList<Integer>)indexesSentences.clone());
    }
    return sentenceArrayIndexes;
}

public class Section {
    private ArrayList<Sentence> sentences;

    public Section (String text) {
        sentences = new ArrayList<>();

        if(text == null || text.trim() == "") {
            throw new IllegalArgumentException("Text not valid");
        }
        String formattedText = text.trim().replaceAll("[^a-zA-Z. ]", "").toLowerCase();
        String[] sentencesArray = formattedText.split("\\.");
        for(String sentenceStr:sentencesArray) {
            if(sentenceStr.trim() != "") {
                sentences.add(new Sentence(sentenceStr));
            }
        }
    }

    public ArrayList<Sentence> getSentences() {
        return sentences;
    }

    public void addSentence(Sentence sentence) {
        sentences.add(sentence);
    }
}

最佳答案

因此，您有要查找的单词和由要检查的句子组成的文本。

创建单词数组的索引。

例如，如果 words = a dog is not a human:

{
    "a": [1, 5],
    "dog": [2],
    "is": [3],
    "not": [4],
    "human": [6]
}

在每个句子中，按降序顺序用索引值替换每个单词。也就是说，"a" 被替换为 [5, 1]，"human" 被替换为 [6] 和 "tree" 被替换为 []。

例如，句子“not a cat is a human”应该变成[4, 5,1, 3, 5,1, 6]

找到 Longest increasing subsequence (LIS) 在每个数组中。本质上，LIS 将是句子中单词数组的最长子匹配项。

例如[4, 5,1, 3, 5,1, 6]的LIS为[1, 3, 5, 6]，映射到子匹配 “a is a human”。

但一般来说，如果单词之间的距离不是很远，我建议使用动态规划并进行相应的修改来查找LIS。

关于java - 在文档中查找单词序列，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41507108/

java - 在文档中查找单词序列

上一篇：algorithm - 二进制整数规划中的构造启发式

下一篇：c# - 设置等于总数的随机变量 C#