python - 从没有空格/组合词的文本中检测最有可能的词

我如何从组合字符串中检测和拆分单词？

例子:

"cdimage" -> ["cd", "image"]
"filesaveas" -> ["file", "save", "as"]

最佳答案

这是一个动态编程解决方案(作为内存函数实现)。给定一个包含频率的单词字典，它会在给出整体最可能短语的位置处拆分输入文本。您必须找到一个真实的单词表，但我包含了一些编造的频率以进行简单测试。

WORD_FREQUENCIES = {
    'file': 0.00123,
    'files': 0.00124,
    'save': 0.002,
    'ave': 0.00001,
    'as': 0.00555
}

def split_text(text, word_frequencies, cache):
    if text in cache:
        return cache[text]
    if not text:
        return 1, []
    best_freq, best_split = 0, []
    for i in xrange(1, len(text) + 1):
        word, remainder = text[:i], text[i:]
        freq = word_frequencies.get(word, None)
        if freq:
            remainder_freq, remainder = split_text(
                    remainder, word_frequencies, cache)
            freq *= remainder_freq
            if freq > best_freq:
                best_freq = freq
                best_split = [word] + remainder
    cache[text] = (best_freq, best_split)
    return cache[text]

print split_text('filesaveas', WORD_FREQUENCIES, {})

--> (1.3653e-08, ['file', 'save', 'as'])

关于python - 从没有空格/组合词的文本中检测最有可能的词，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/2174093/

上一篇：python - Google App Engine 中的 Jinja2

下一篇：python - 如何在 Python 中创建动态范围变量？

相关文章：

python - 如何使用正则表达式在非数字之前拆分句点？

python - pypdf python工具

python - 用 uvicorn : can we exclude certain code? 重新加载标志

java正则表达式，同时在括号上拆分字符串

java - 在 PDF 中突出显示单词

excel - VBA 是否包含注释 block 语法？

ruby - 如何使用 Ruby 正则表达式来捕获非英语单词？

javascript - 运行时错误 : There is no current event loop in thread 'Thread-1' . - requests_html, html.render()

Python 多页网页仅抓取文本

linux - 如何在Linux中按标题将文本文件拆分为许多较小的文件