我正在使用 Haystack 构建一个搜索引擎,我正在开发的功能之一是允许人们按版本字段进行过滤,如下所述:
version = indexes.CharField(model_attr="version")
版本是短字符串,并不局限于遵循“x.y.z”样式的语义“版本”,可能就像“1”一样简单。
不幸的是,经过一些实验后,Haystack 似乎忽略了少于 3 个字符的过滤器。所以这个:
SearchQuerySet().filter(version="1")
实际上什么都不会返回,而这个:
SearchQuerySet().filter(content="foo").filter(version="1")
将返回与第一个过滤器匹配的所有内容。
经过一些实验,我发现它基于字符串长度,而不是数字字段。所以所有这些行为都一样:
SearchQuerySet().filter(version="1")
SearchQuerySet().filter(version="a")
SearchQuerySet().filter(version="1a")
将起作用的是这些(如果一个项目的 version
设置为 "100"
):
SearchQuerySet().filter(version=100)
SearchQuerySet().filter(version="100")
现在很明显,我不希望每个字段都具有这种粒度级别,但是无论如何要声明对于特定字段,我希望即使对单个字符也能进行过滤?
最佳答案
我通过考虑后端 whoosh
来给出我的答案。但这可以通过研究它们的规则应用于其他后端。
django-haystack use StemmingAnalyzer从 whoosh.analysis.StemmingAnalyzer 进口用于 Text (char) field在方法中 build_schema的 WhooshSearchBackend .来自 whoosh.analysis.StemmingAnalyzer您可以看到它采用默认设置为 2
的 minsize
参数,因此您无法过滤一个字符。我们需要覆盖 WhooshSearchBackend
中的 build_schema
方法,并将 minszie
参数设置为 1
for StemmingAnalyzer
:
将此代码放在 search_backends.py 中:
from haystack.backends.whoosh_backend import WhooshEngine, WhooshSearchBackend, WHOOSH_ID, ID, DJANGO_CT, DJANGO_ID, Schema, IDLIST, TEXT, KEYWORD, NUMERIC, BOOLEAN, DATETIME, NGRAM, NGRAMWORDS
from whoosh.analysis import StemmingAnalyzer
class CustomSearchBackend(WhooshSearchBackend):
def build_schema(self, fields):
schema_fields = {
ID: WHOOSH_ID(stored=True, unique=True),
DJANGO_CT: WHOOSH_ID(stored=True),
DJANGO_ID: WHOOSH_ID(stored=True),
}
# Grab the number of keys that are hard-coded into Haystack.
# We'll use this to (possibly) fail slightly more gracefully later.
initial_key_count = len(schema_fields)
content_field_name = ''
for field_name, field_class in fields.items():
if field_class.is_multivalued:
if field_class.indexed is False:
schema_fields[field_class.index_fieldname] = IDLIST(stored=True, field_boost=field_class.boost)
else:
schema_fields[field_class.index_fieldname] = KEYWORD(stored=True, commas=True, scorable=True, field_boost=field_class.boost)
elif field_class.field_type in ['date', 'datetime']:
schema_fields[field_class.index_fieldname] = DATETIME(stored=field_class.stored)
elif field_class.field_type == 'integer':
schema_fields[field_class.index_fieldname] = NUMERIC(stored=field_class.stored, type=int, field_boost=field_class.boost)
elif field_class.field_type == 'float':
schema_fields[field_class.index_fieldname] = NUMERIC(stored=field_class.stored, type=float, field_boost=field_class.boost)
elif field_class.field_type == 'boolean':
# Field boost isn't supported on BOOLEAN as of 1.8.2.
schema_fields[field_class.index_fieldname] = BOOLEAN(stored=field_class.stored)
elif field_class.field_type == 'ngram':
schema_fields[field_class.index_fieldname] = NGRAM(minsize=3, maxsize=15, stored=field_class.stored, field_boost=field_class.boost)
elif field_class.field_type == 'edge_ngram':
schema_fields[field_class.index_fieldname] = NGRAMWORDS(minsize=2, maxsize=15, at='start', stored=field_class.stored, field_boost=field_class.boost)
else:
schema_fields[field_class.index_fieldname] = TEXT(stored=True, analyzer=StemmingAnalyzer(minsize=1), field_boost=field_class.boost)
if field_class.document is True:
content_field_name = field_class.index_fieldname
# Fail more gracefully than relying on the backend to die if no fields
# are found.
if len(schema_fields) <= initial_key_count:
raise SearchBackendError("No fields were found in any search_indexes. Please correct this before attempting to search.")
return (content_field_name, Schema(**schema_fields))
class CustomWhooshEngine(WhooshEngine):
backend = CustomSearchBackend
现在我们需要告诉 haystack 使用我们的 CustomSearchBackend
:
HAYSTACK_CONNECTIONS = {
'default': {
'ENGINE': 'search_backends.CustomWhooshEngine',
'PATH': os.path.join(os.path.dirname(__file__), 'whoosh_index'),
},
}
执行此操作后,运行命令 rebuild_index
和 update_index
,您应该能够过滤除字母 a
之外的单个字符,因为字母 a
也在 STOP_WORDS 中如果您还想允许单个字符 a
,您需要通过在 build_schema
中删除字母 a
来传递您的 STOP_WORDS:
from whoosh.analysis import STOP_WORDS
STOP_WORDS = frozenset([el for el in STOP_WORDS if len(el) > 1]) # remove all single letter stop words
class CustomSearchBackend(WhooshSearchBackend):
def build_schema(self, fields):
# rest of code
# ------
else:
schema_fields[field_class.index_fieldname] = TEXT(stored=True, analyzer=StemmingAnalyzer(minsize=1, stoplist=STOP_WORDS), field_boost=field_class.boost)
注意:build_schema
代码可能因 haystack 版本而异。上面的代码是用whoosh=2.4
和haystack==2.0.0
关于python - Haystack 搜索一个非常短的字符字段,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26268705/