algorithm - 在一长串字符中找到单词。自动分词

标签 algorithm computer-science nlp string-algorithm

如何在一长串字符中找到正确的单词？

输入:

"The revised report onthesyntactictheoriesofsequentialcontrolandstate"

Google 的输出:

"The revised report on syntactic theories sequential controlandstate"

(考虑到他们生成输出的时间，这已经足够接近了)

您认为 Google 是如何做到的？你会如何提高准确性？

最佳答案

我会尝试这样的递归算法:

尝试在每个位置插入一个空格。如果左边部分是一个词，则在右边部分重复。
计算所有最终输出中的有效单词数/总单词数。比率最佳的可能就是您的答案。

例如，给它“thesentenceisgood”会运行:

thesentenceisgood
the sentenceisgood
    sent enceisgood
         enceisgood: OUT1: the sent enceisgood, 2/3
    sentence isgood
             is good
                go od: OUT2: the sentence is go od, 4/5
             is good: OUT3: the sentence is good, 4/4
    sentenceisgood: OUT4: the sentenceisgood, 1/2
these ntenceisgood
      ntenceisgood: OUT5: these ntenceisgood, 1/2

所以你会选择 OUT3 作为答案。

关于algorithm - 在一长串字符中找到单词。自动分词，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/3901266/

上一篇：algorithm - 找到两个相同大小数组的元素之间的唯一映射

下一篇：algorithm - 面试问题 - 查找数字

相关文章：

algorithm - 在 O(n+mlogm) 时间内对数组进行排序

architecture - 计算机类型

regex - 使用 Regex 删除括号和其中的所有内容

Java 开源文本挖掘框架

nlp - 句子的 RDF 表示

ruby - Ruby 中的位掩码 : Get numbers which generated the bitmask

algorithm - 二维圆形搜索模式

algorithm - 找到 Blob 质心

computer-science - 为什么即使是大端计算机也从低位内存读取到高位内存？对于 big-endianness 相反可能更优化

c# - 用 C# 编写的质数检查器