我有一本字典:
dict = ["as", "ass", "share", "rest"]
和一个字符串输入:
string = "xassharest"
我想显示基于这样的字典可以创建的所有可能的单词:
[('x', 'as', 's', 'h', 'a', 'rest'), ('x', 'as', 'share', 's', 't'), ('x', 'ass', 'h', 'a', 'rest')]
实际上,我已经尝试使用所有字符串组合(使用库 itertools),但需要很长时间。这是我的代码:
def getallpossiblewords(string):
allwords = preprocessingcorpus("corpus.txt")
temp = []
for i in range(0, len(string)):
for j in range(1, len(string) + 1):
if string[i:j] in allwords:
temp += [string[i:j]]
allposwords = sorted(temp, key=len, reverse=True)
#print(allposwords)
return allposwords
def wordseg(string):
a = string
b = getallpossiblewords(string)
cuts = []
allpos = []
for i in range(0,len(a)):
cuts.extend(combinations(range(1,len(a)),i))
for i in cuts:
last = 0
output = []
for j in i:
output.append(a[last:j])
last = j
output.append(a[last:])
for x in range(len(output)):
if output[x] in b:
allpos += [output]
#print(output)
#print(allpos)
fixallpos = list()
for sublist in allpos:
if sublist not in fixallpos:
fixallpos.append(sublist)
我需要最快的算法来解决这个问题,因为输入的字符串可能会更长。
谁能解决我的问题吗?
最佳答案
这似乎是 str.partition()
的完美递归使用。下面是我的示例实现,我不会声称它解决了所有问题(因为实际上没有测试用例),而是尝试通过这种特定方法进行销售工作:
def segmented(string):
segmentations = set()
for word in words:
before, match, after = string.partition(word)
if not match:
continue
prefixes = segmented(before) or [before]
suffixes = segmented(after) or [after]
if prefixes and suffixes:
for prefix in prefixes:
for suffix in suffixes:
segmentations.add((*prefix, word, *suffix))
elif prefixes:
for prefix in prefixes:
segmentations.add((*prefix, word, *suffixes))
elif suffixes:
for suffix in suffixes:
segmentations.add((*prefixes, word, suffix))
else:
segmentations.add((*prefixes, word, *suffixes))
return segmentations
words = ["as", "ass", "share", "rest"]
print(segmented("xassharest"))
输出
% python3 test.py
{('x', 'as', 's', 'h', 'a', 'rest'), ('x', 'as', 'share', 's', 't'), ('x', 'ass', 'h', 'a', 'rest')}
%
关于python: 基于字典的分词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46326968/