我正在做一个 Python 练习,从给定的排序 wordlist
中搜索 单词
,其中包含超过 100,000 个单词。
当使用 Python 中的 bisect_left
时 bisect
module ,效率很高,但是使用自己创建的二进制方法效率很低。谁能解释一下原因吗?
这是使用Python bisect
模块的搜索方法:
def in_bisect(word_list, word):
"""Checks whether a word is in a list using bisection search.
Precondition: the words in the list are sorted
word_list: list of strings
word: string
"""
i = bisect_left(word_list, word)
if i != len(word_list) and word_list[i] == word:
return True
else:
return False
我的实现效率确实很低(不知道为什么):
def my_bisect(wordlist,word):
"""search the given word in a wordlist using
bisection search, also known as binary search
"""
if len(wordlist) == 0:
return False
if len(wordlist) == 1:
if wordlist[0] == word:
return True
else:
return False
if word in wordlist[len(wordlist)/2:]:
return True
return my_bisect(wordlist[len(wordlist)/2:],word)
最佳答案
if word in wordlist[len(wordlist)/2:]
将使 Python 搜索一半的 wordlist
,这有点违背了编写二分搜索的初衷。另外,您没有正确地将列表分成两半。二分搜索的策略是每一步将搜索空间减半,然后仅对您的单词
可能所在的一半应用相同的策略。知道哪一半是正确的搜索,wordlist
的排序至关重要。下面是一个示例实现,它跟踪验证 word
是否在 wordlist
中所需的调用次数。
import random
numcalls = 0
def bs(wordlist, word):
# increment numcalls
print('wordlist',wordlist)
global numcalls
numcalls += 1
# base cases
if not wordlist:
return False
length = len(wordlist)
if length == 1:
return wordlist[0] == word
# split the list in half
mid = int(length/2) # mid index
leftlist = wordlist[:mid]
rightlist = wordlist[mid:]
print('leftlist',leftlist)
print('rightlist',rightlist)
print()
# recursion
if word < rightlist[0]:
return bs(leftlist, word) # word can only be in left list
return bs(rightlist, word) # word can only be in right list
alphabet = 'abcdefghijklmnopqrstuvwxyz'
wl = sorted(random.sample(alphabet, 10))
print(bs(wl, 'm'))
print(numcalls)
我添加了一些 print
语句,以便您可以看到发生了什么。这是两个示例输出。首先:word
在 wordlist
中:
wordlist ['b', 'c', 'g', 'i', 'l', 'm', 'n', 'r', 's', 'v']
leftlist ['b', 'c', 'g', 'i', 'l']
rightlist ['m', 'n', 'r', 's', 'v']
wordlist ['m', 'n', 'r', 's', 'v']
leftlist ['m', 'n']
rightlist ['r', 's', 'v']
wordlist ['m', 'n']
leftlist ['m']
rightlist ['n']
wordlist ['m']
True
4
第二:单词
不在单词列表
中:
wordlist ['a', 'c', 'd', 'e', 'g', 'l', 'o', 'q', 't', 'x']
leftlist ['a', 'c', 'd', 'e', 'g']
rightlist ['l', 'o', 'q', 't', 'x']
wordlist ['l', 'o', 'q', 't', 'x']
leftlist ['l', 'o']
rightlist ['q', 't', 'x']
wordlist ['l', 'o']
leftlist ['l']
rightlist ['o']
wordlist ['l']
False
4
请注意,如果将单词列表的大小加倍,即使用
wl = sorted(random.sample(alphabet, 20))
numcalls
平均只比长度为 10 的 wordlist
高 1,因为 wordlist
只需要分成两半一次更多。
关于python - 为什么我的二分搜索实现效率很低?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24352176/