python - Haystack 搜索一个非常短的字符字段

我正在使用 Haystack 构建一个搜索引擎，我正在开发的功能之一是允许人们按版本字段进行过滤，如下所述:

version = indexes.CharField(model_attr="version")

版本是短字符串，并不局限于遵循“x.y.z”样式的语义“版本”，可能就像“1”一样简单。

不幸的是，经过一些实验后，Haystack 似乎忽略了少于 3 个字符的过滤器。所以这个:

SearchQuerySet().filter(version="1")

实际上什么都不会返回，而这个:

SearchQuerySet().filter(content="foo").filter(version="1")

将返回与第一个过滤器匹配的所有内容。

经过一些实验，我发现它基于字符串长度，而不是数字字段。所以所有这些行为都一样:

SearchQuerySet().filter(version="1")
SearchQuerySet().filter(version="a")
SearchQuerySet().filter(version="1a")

将起作用的是这些(如果一个项目的 version 设置为 "100"):

SearchQuerySet().filter(version=100)
SearchQuerySet().filter(version="100")

现在很明显，我不希望每个字段都具有这种粒度级别，但是无论如何要声明对于特定字段，我希望即使对单个字符也能进行过滤？

最佳答案

我通过考虑后端 whoosh 来给出我的答案。但这可以通过研究它们的规则应用于其他后端。

django-haystack use StemmingAnalyzer从 whoosh.analysis.StemmingAnalyzer 进口用于 Text (char) field在方法中 build_schema的 WhooshSearchBackend .来自 whoosh.analysis.StemmingAnalyzer您可以看到它采用默认设置为 2 的 minsize 参数，因此您无法过滤一个字符。我们需要覆盖 WhooshSearchBackend 中的 build_schema 方法，并将 minszie 参数设置为 1 for StemmingAnalyzer:

将此代码放在 search_backends.py 中:

from haystack.backends.whoosh_backend import WhooshEngine, WhooshSearchBackend, WHOOSH_ID, ID, DJANGO_CT, DJANGO_ID, Schema, IDLIST, TEXT, KEYWORD, NUMERIC, BOOLEAN, DATETIME, NGRAM, NGRAMWORDS

from whoosh.analysis import StemmingAnalyzer

class CustomSearchBackend(WhooshSearchBackend):
    def build_schema(self, fields):
        schema_fields = {
            ID: WHOOSH_ID(stored=True, unique=True),
            DJANGO_CT: WHOOSH_ID(stored=True),
            DJANGO_ID: WHOOSH_ID(stored=True),
        }
        # Grab the number of keys that are hard-coded into Haystack.
        # We'll use this to (possibly) fail slightly more gracefully later.
        initial_key_count = len(schema_fields)
        content_field_name = ''

        for field_name, field_class in fields.items():
            if field_class.is_multivalued:
                if field_class.indexed is False:
                    schema_fields[field_class.index_fieldname] = IDLIST(stored=True, field_boost=field_class.boost)
                else:
                    schema_fields[field_class.index_fieldname] = KEYWORD(stored=True, commas=True, scorable=True, field_boost=field_class.boost)
            elif field_class.field_type in ['date', 'datetime']:
                schema_fields[field_class.index_fieldname] = DATETIME(stored=field_class.stored)
            elif field_class.field_type == 'integer':
                schema_fields[field_class.index_fieldname] = NUMERIC(stored=field_class.stored, type=int, field_boost=field_class.boost)
            elif field_class.field_type == 'float':
                schema_fields[field_class.index_fieldname] = NUMERIC(stored=field_class.stored, type=float, field_boost=field_class.boost)
            elif field_class.field_type == 'boolean':
                # Field boost isn't supported on BOOLEAN as of 1.8.2.
                schema_fields[field_class.index_fieldname] = BOOLEAN(stored=field_class.stored)
            elif field_class.field_type == 'ngram':
                schema_fields[field_class.index_fieldname] = NGRAM(minsize=3, maxsize=15, stored=field_class.stored, field_boost=field_class.boost)
            elif field_class.field_type == 'edge_ngram':
                schema_fields[field_class.index_fieldname] = NGRAMWORDS(minsize=2, maxsize=15, at='start', stored=field_class.stored, field_boost=field_class.boost)
            else:
                schema_fields[field_class.index_fieldname] = TEXT(stored=True, analyzer=StemmingAnalyzer(minsize=1), field_boost=field_class.boost)

            if field_class.document is True:
                content_field_name = field_class.index_fieldname

        # Fail more gracefully than relying on the backend to die if no fields
        # are found.
        if len(schema_fields) <= initial_key_count:
            raise SearchBackendError("No fields were found in any search_indexes. Please correct this before attempting to search.")

        return (content_field_name, Schema(**schema_fields))

class CustomWhooshEngine(WhooshEngine):
    backend = CustomSearchBackend

现在我们需要告诉 haystack 使用我们的 CustomSearchBackend:

HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'search_backends.CustomWhooshEngine',
        'PATH': os.path.join(os.path.dirname(__file__), 'whoosh_index'),
    },
}

执行此操作后，运行命令 rebuild_index 和 update_index，您应该能够过滤除字母 a 之外的单个字符，因为字母 a 也在 STOP_WORDS 中如果您还想允许单个字符 a，您需要通过在 build_schema 中删除字母 a 来传递您的 STOP_WORDS:

from whoosh.analysis import STOP_WORDS
STOP_WORDS = frozenset([el for el in STOP_WORDS if len(el) > 1]) # remove all single letter stop words

class CustomSearchBackend(WhooshSearchBackend):
        def build_schema(self, fields):
            # rest of code
            # ------
                else:
                    schema_fields[field_class.index_fieldname] = TEXT(stored=True, analyzer=StemmingAnalyzer(minsize=1, stoplist=STOP_WORDS), field_boost=field_class.boost)

注意:build_schema 代码可能因 haystack 版本而异。上面的代码是用whoosh=2.4和haystack==2.0.0

测试的

关于python - Haystack 搜索一个非常短的字符字段，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/26268705/

python - Haystack 搜索一个非常短的字符字段

上一篇：python - del 在未分配的 python 对象上的行为

下一篇：python - 统一码编码错误 : 'ascii' codec can't encode character u'\xf3' in position 16: ordinal not in range(128)

python - Haystack 搜索一个非常短的字符字段

上一篇：python - __del__ 在未分配的 python 对象上的行为

下一篇：python - 统一码编码错误 : 'ascii' codec can't encode character u'\xf3' in position 16: ordinal not in range(128)

上一篇：python - del 在未分配的 python 对象上的行为