我有一个包含文本的 txt 文件
Table of Contents
Preface 1
Chapter 1: Tokenizing Text and WordNet Basics 7
Tokenizing text into sentences 8
Tokenizing sentences into words 10
Tokenizing sentences using regular expressions 12
如果我的字符串是:
input = "Tokenzing sentence using expressions"
我想过用开头词和结尾词来提取句子,但有很多重复。
那么获得输出的最佳方式是什么
Tokenizing sentences using regular expressions
最佳答案
如果您准备预处理章节标题,消除页码和其他内容,则:
import difflib
contents = ["Tokenizing Text and WordNet Basics",
"Tokenizing text into sentences",
"Tokenizing sentences into words",
"Tokenizing sentences using regular expressions"]
input = "Tokenzing sentence using expressions"
print (difflib.get_close_matches(input, contents, n=1))
会给你这个输出:
['Tokenizing sentences using regular expressions']
关于python - 在 python 中查找字符串中存在的相似文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44227820/