我如何从组合字符串中检测和拆分单词?
例子:
"cdimage" -> ["cd", "image"]
"filesaveas" -> ["file", "save", "as"]
最佳答案
这是一个动态编程解决方案(作为内存函数实现)。给定一个包含频率的单词字典,它会在给出整体最可能短语的位置处拆分输入文本。您必须找到一个真实的单词表,但我包含了一些编造的频率以进行简单测试。
WORD_FREQUENCIES = {
'file': 0.00123,
'files': 0.00124,
'save': 0.002,
'ave': 0.00001,
'as': 0.00555
}
def split_text(text, word_frequencies, cache):
if text in cache:
return cache[text]
if not text:
return 1, []
best_freq, best_split = 0, []
for i in xrange(1, len(text) + 1):
word, remainder = text[:i], text[i:]
freq = word_frequencies.get(word, None)
if freq:
remainder_freq, remainder = split_text(
remainder, word_frequencies, cache)
freq *= remainder_freq
if freq > best_freq:
best_freq = freq
best_split = [word] + remainder
cache[text] = (best_freq, best_split)
return cache[text]
print split_text('filesaveas', WORD_FREQUENCIES, {})
--> (1.3653e-08, ['file', 'save', 'as'])
关于python - 从没有空格/组合词的文本中检测最有可能的词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/2174093/