ruby-on-rails - 使用 pg_search 和 GIN 索引大型文档

我正在使用 Rails 3 为我自己和我的 friend 们建立一个论坛(出于各种原因我没有使用开箱即用的论坛)并且我正在尝试实现完整的论坛的文字搜索。没有什么特别的——如果有人搜索字符串“morning”，我希望能够显示所有论坛帖子的列表，其中的帖子包含“morning”这个词。我一直在使用 pg_search用于搜索，但它很慢(5 秒以上)，因为我们已经有 300 个论坛线程和 200k+ 帖子，其中一些帖子只有 4k+ 个字符。所以我有这个 multisearch 的迁移:

class CreatePgSearchDocuments < ActiveRecord::Migration
  def self.up
    say_with_time("Creating table for pg_search multisearch") do
      create_table :pg_search_documents do |t|
        t.text :content
        t.belongs_to :searchable, :polymorphic => true, :index => true
        t.timestamps null: false
      end
      add_index :pg_search_documents, :content, using: "gin"
      PgSearch::Multisearch.rebuild(Post)
      PgSearch::Multisearch.rebuild(Reply)
    end
  end
end

但是当我运行迁移并出现此错误时它失败了:

PG::ProgramLimitExceeded: ERROR:  index row size 3080 exceeds maximum 2712 for index "index_pg_search_documents_on_content"
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.

到目前为止，谷歌搜索得到了以下信息:

GIN 索引在处理 100,000 多个词位方面优于 GIST 索引。这对我来说意味着 GIN 索引应该能够处理只有 700 字的帖子
我猜测这个错误是关于单个值而不是文档的长度，并且担心这是由于以下事实引起的我允许在论坛帖子中使用 HTML 标记的子集，因此我现在存储 post.sanitized_content 而不是存储 post.content。这将去除所有 HTML，然后用空格替换标点符号，然后去除重复项，如下所示:ActionView::Base.full_sanitizer.sanitize(content).gsub(/[^\w ]/, ' ').squeeze(" ") .这将错误消息降低到 index row size 2848 exceeds maximum 2712 ，所以它显然做了一些，但还不够。
然后我理智地检查 pg_search 实际上允许我使用这样的动态方法，而且它不仅仅是 secret 地静默地失败。根据文档，“但是，如果您在 :against 中调用任何动态方法，将使用以下策略”，因此它们似乎处理得很好。

我实现 Post 的相关部分:

class Post < ActiveRecord::Base
  include PgSearch

  multisearchable against: [:subject, :sanitized_content]

  def sanitized_content
    ActionView::Base.full_sanitizer.sanitize(content).gsub(/[^\w ]/, ' ').squeeze(" ")
  end
end

(我也尝试从 multisearchable-against 数组中删除 :subject ，以防它是一个未净化的主题导致问题；这让我在错误中下降到 row size 2800，但没有修复它。)

那么……我错过了什么？ GIN 索引不应该能够处理大型文本文档吗？我是否需要像 this answer 中那样先将我的文档转换为 tsvectors？？它一直建议“全文索引”，但我认为这就是它。

最佳答案

为了以后的人谷歌:暂定，使用

execute "CREATE INDEX idx_fts_search_content ON pg_search_documents USING gin(to_tsvector('english', content))

代替

add_index :pg_search_documents, :content, using: "gin"

已经解决了。到目前为止，索引的作用不大，搜索所有内容需要 8.1 秒，但至少迁移现在可以运行!

编辑:错过了一件重要的事情。实际命令应该是:

execute "CREATE INDEX idx_fts_post_content ON posts USING gin(to_tsvector('english', coalesce(\"posts\".\"content\"::text, '')))"

如果您没有 coalesce()，它将不会使用索引。

关于ruby-on-rails - 使用 pg_search 和 GIN 索引大型文档，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39684191/

ruby-on-rails - 使用 pg_search 和 GIN 索引大型文档

上一篇：node.js - Nodejs 和 Postgres 中的类型错误

下一篇：ruby-on-rails - Rails 多对一关系聚合函数(组、计数)事件记录