我有一行字符串:
"specificationsinaccordancewithqualityaccreditedstandards"
需要拆分成标记化的词,例如:
"specifications in accordance with quality accredited standards"
我已经尝试过nltk
的word_tokenize
但是它无法转换,
上下文:我正在将 PDF 文档解析为文本文件,这是我从 pdf 转换器返回的文本,用于将 pdf 转换为文本我在 中使用 PDFminer Python
你可以使用递归来解决这个问题。首先,您需要下载一个字典 txt 文件,您可以在此处获取:https://github.com/Ajax12345/My-Python-Projects/blob/master/the_file.txt
dictionary = [i.strip('\n') for i in open('the_file.txt')]
def get_options(scrambled, flag, totals, last):
if flag:
return totals
else:
new_list = [i for i in dictionary if scrambled.startswith(i)]
if new_list:
possible_word = new_list[-1]
new_totals = totals
new_totals.append(possible_word)
new_scrambled = scrambled[len(possible_word):]
return get_options(new_scrambled, False, new_totals, possible_word)
else:
return get_options("", True, totals, '')
s = "specificationsinaccordancewithqualityaccreditedstandards"
print(' '.join(get_options(s, False, [], '')))
输出:
'specifications in accordance with quality accredited standards'