java - 如何转义 Elasticsearch 的 URL?

标签 java url elasticsearch

在 Elasticsearch 的一个字段中,我存储了文档的 URL(例如 http://techcrunch.com/something-great)

当我不转义 URL 时,可以正确找到文档 - 但我在某些 URL 上收到 EOF 错误。

当我转义 URL 时:

String escapedString = QueryParser.escape(e.getKey().getUrl());

找不到该文档 - 我的命中率为零。

那么该怎么做呢?

<小时/>
    {
    _index: "crawlbot",
    _type: "article",
    _id: "AVFaaFu4w49jUzVInKS5",
    _score: 1,
    _source: {
        job: {
            id: 65,
            name: "wikipedia_en",
            max_pages: 300000,
            crawl_depth: 0,
            processing_patterns: "-Category,-User,-Wikipedia:,-Topic,-Special:,-Talk:,-Portal:,-MOS",
            status: 0,
            days: 0,
            url: [
                "https://en.wikipedia.org"
            ],
            ajax: false,
            min_description: 0
        },
        article: {
            url: "https://en.wikipedia.org/w/index.php?action=history&feed=atom&title=Parliament_of_Romania",
            provider_url: "https://en.wikipedia.org",
            provider_name: "",
            provider_display: "en.wikipedia.org",
            favicon_url: "http://www.google.com/s2/u/0/favicons?domain=https://en.wikipedia.org",
            language: "en",
            metadata: {
                authors: []
            },
            entities: [],
            keywords: [],
            videos: [],
            unfilteredKeywords: [],
            published: "",
            published_long: 0
        }
    }
}

我希望检索每个article.url 的文档

这是查询:

 SearchRequestBuilder requestBuilder = client.prepareSearch("crawlbot").setSearchType(SearchType.DFS_QUERY_THEN_FETCH);
            BoolQueryBuilder queryBuilder = new BoolQueryBuilder();
            String escapedString = QueryParser.escape(e.getKey().getUrl());
            queryBuilder.must(QueryBuilders.queryStringQuery(escapedString).defaultField("article.url"));
            queryBuilder.must(QueryBuilders.queryStringQuery(e.getKey().getJob().getId() + "").defaultField("job.id"));

如果我不转义就会出错:

Exception in thread "main" org.elasticsearch.action.search.SearchPhaseExecutionException: Failed to execute phase [query], all shards failed; shardFailures {[9_T8APppReyWKppSNZWmXw][crawlbot][0]: SearchParseException[[crawlbot][0]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111.  Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111.  Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }{[9_T8APppReyWKppSNZWmXw][crawlbot][1]: SearchParseException[[crawlbot][1]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111.  Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111.  Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }{[9_T8APppReyWKppSNZWmXw][crawlbot][2]: SearchParseException[[crawlbot][2]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111.  Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111.  Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }{[9_T8APppReyWKppSNZWmXw][crawlbot][3]: SearchParseException[[crawlbot][3]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111.  Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111.  Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }{[9_T8APppReyWKppSNZWmXw][crawlbot][4]: SearchParseException[[crawlbot][4]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111.  Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111.  Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }
    at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:237)
    at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$1.onFailure(TransportSearchTypeAction.java:183)
    at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:565)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

最佳答案

我建议您将 article.url 字段的映射更改为:

url: {
    "type": "string",
    "index": "not_analyzed"
}

如果不这样做,您的字段就会被分析并且很难查询,因为标准分析器会将 URL 分解为多个标记。

然后,您可以使用 term 查询来查询文档,而不是使用 query_string 查询。

SearchRequestBuilder requestBuilder = client.prepareSearch("crawlbot").setSearchType(SearchType.DFS_QUERY_THEN_FETCH);
BoolQueryBuilder queryBuilder = new BoolQueryBuilder();
queryBuilder.must(QueryBuilders.termQuery("article.url", e.getKey().getUrl()));
...                                 ^
                                    |
                        use a term query instead

更新

根据 Evaldas 的评论(值得称赞的 Evaldas!),最终的想法是创建一个自定义分析器,以确保 URL 也将小写。

创建索引时,您可以在设置中添加新的分析器,然后将其用作article.url字段的分析器:

PUT /crawlbot
{
    "settings": {
        "analysis": {
            "analyzer": {
                "url_analyzer": {
                    "type":         "custom",
                    "tokenizer":    "keyword",
                    "filter":       [ "lowercase" ]
                }
            }
        }
    },
    "mappings": {
        "article": {
            "properties": {
                "article": {
                    "url": {
                        "type": "string",
                        "analyzer": "url_analyzer"
                    }
                }
            }
        }
    }
}

关于java - 如何转义 Elasticsearch 的 URL?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34133218/

相关文章:

url - 单个嵌入 URL 中的多个 YouTube 视频

c# - 附件2中的附件字段未正确映射

java - 使用多线程或优先队列确定特定 API 调用优先级的方法?

java - Apache HttpClient 获取添加字节范围 header ?

url - 技术术语 - URL 路径类型 : Absolute, 相对,以及

java - 尝试从网站下载 exe 文件并运行它

php - Elasticsearch - 上一个/下一个功能

elasticsearch - 如何在一个查询中插入多条记录?

java - Executor Service 和 Rate Limiter

java - 获取 Activity 的 ResolveInfo 对象,知道它的名称?