node.js - $facet 如何提高 $lookup 的性能

问题

我最近参加了一个技术聚会，并向一位更有经验的开发人员展示了我的一些代码。他评论说，由于 $lookup，我的管道会遇到问题。我应该考虑使用 $facet来解决这个问题。

我不记得他说我会遇到什么问题，也不记得如何$facet可以帮助修复它。我认为这与 16mb 文件限制有关，但这可以通过使用 $unwind 来解决。之后 $lookup .

我的代码(Node.js)

我收藏了Post文件。一些帖子是父帖子，其他帖子是评论。作为评论的帖子通过其 parent 的事实来标识。属性不是 null .

我的目标是返回一组最新的父帖子并附加到每个帖子上，这是一个 int 属性，表示它拥有的评论数。

这是我的 Post Mongoose 模式

const postSchema = new mongoose.Schema({
    title: { type: String, required: true, trim: true },
    body: { type: String, required: true, trim: true },
    category: { type: String, required: true, trim: true, lowercase: true },
    timestamp: { type: Date, required: true, default: Date.now },
    parent: { type: mongoose.Schema.Types.ObjectId, ref: 'Post', default: null },
});

这是我的管道

const pipeline = [
    { $match: { category: query.category } },
    { $sort: { timestamp: -1 } },
    { $skip: (query.page - 1) * query.count },
    { $limit: query.count },
    {
        $lookup: {
            from: 'posts',
            localField: '_id',
            foreignField: 'parent',
            as: 'comments',
        },
    },
    {
        $addFields: {
            comments: { $size: '$comments' },
            id: '$_id',
        },
    },
    { $project: { _id: 0, __v: 0 } },
];

最佳答案

简而言之，它不能。但是，如果有人告诉您这一点，那么就应该解释清楚为什么这样的概念是不正确的。
为什么不 $facet
正如评论， $facet 在这里无法为您做任何事情，并且可能是对您的查询打算做什么的误解。如果有的话， $facet 由于 $facet 的唯一输出这一显而易见的事实，管道阶段会导致更多的 BSON 限制问题。流水线阶段是 “单个文档” ，这意味着除非您实际上将它用于“汇总结果”的预期目的，否则您几乎肯定会在现实世界条件下违反此限制。
它根本不适用的最大原因是因为您的 $lookup 源正在从不同的集合中提取数据。 $facet stage 仅适用于“同一集合”，因此您不能从一个“方面”中的一个集合和不同方面中的另一个集合输出。只能为 .aggregate() 所在的同一集合定义“管道”。正在执行。
$lookup 仍然是你想要的
然而，BSON 大小限制的要点是完全有效的，因为当前聚合管道中的主要失败是使用 $size 返回数组上的运算符。 “数组”实际上是这里的问题，因为“未绑定(bind)”它具有从相关集合中提取文档的“潜力”，这实际上导致输出中包含此数组的父文档超过 BSON 限制。
因此，您可以使用两种基本方法来简单地获取“大小”，而实际上不需要创建“整个”相关文档的数组。
MongoDB 3.6 及更高版本
在这里您将使用 $lookup 使用此版本中引入的“子管道”表达式语法来简单地返回“减少的计数”而不实际返回任何文档:

const pipeline = [
    { "$match": { "category": query.category } },
    { "$sort": { "timestamp": -1 } },
    { "$skip": (query.page - 1) * query.count },
    { "$limit": query.count },
    { "$lookup": {
      "from": "posts",
      "let": { "id": "$_id" },
      "pipeline": [
        { "$match": {
          "$expr": { "$eq": [ "$$id", "$parent" ] }
        }},
        { "$count": "count" }
      ],
      "as": "comments",
    }},
    { $addFields: {
        "comments": { 
          "$ifNull": [ { "$arrayElemAt": ["$comments.count", 0] }, 0 ]
        },
        "id": "$_id"
    }}
];

非常简单地将新的“子管道”返回放入目标“数组”(始终是一个数组)只有管道表达式的输出。我们不仅在这里 $match 在本地和外键值上(这实际上是另一个 $lookup 表单现在在内部执行的操作)，但我们使用 $count 继续管道stage，这实际上又是以下的同义词:

{ "$group": { "_id": null, "count": { "$sum": 1 } } },
{ "$project": { "_id": 0, "count": 1 } }

重点是您在数组响应中最多只会收到“一个”文档，然后我们可以通过 $arrayElemAt 轻松地将其转换为奇异值并使用 $ifNull 如果外部集合中没有匹配项以获取 0 的计数
早期版本
对于 MongoDB 3.6 之前的版本，总体思路是 $unwind 直接在 $lookup 之后.这实际上有一个特殊的 Action ，在 $lookup + $unwind Coalescence 下描述。在 Aggregation Pipeline Optimization 上更广泛的手册部分.我个人认为这些更像是“障碍”而不是“优化”，因为您确实应该能够“表达您的意思”，而不是“背后”为您做事。但基本是这样的:

const pipeline = [
    { "$match": { "category": query.category } },
    { "$sort": { "timestamp": -1 } },
    { "$skip": (query.page - 1) * query.count },
    { "$limit": query.count },
    { "$lookup": {
      "from": "posts",
      "localField": "_id",
      "foreignField": "parent",
      "as": "comments"
    }},
    { "$unwind": "$comments" },
    { "$group": {
      "_id": "$_id",
      "otherField": { "$first": "$otherField" },
      "comments": { "$sum": 1 }
    }}
];

这里的重要部分是 $lookup 实际发生的情况。和 $unwind 阶段，可以使用 explain() 查看查看服务器实际表达的解析管道:

        {
            "$lookup" : {
                "from" : "posts",
                "as" : "comments",
                "localField" : "_id",
                "foreignField" : "parent",
                "unwinding" : {
                        "preserveNullAndEmptyArrays" : false
                }
            }
        }

那个unwinding基本上被“卷入” $lookup 和 $unwind 自己“消失”了。这是因为组合以这种“特殊方式”被翻译，这实际上导致 $lookup 的“展开”结果。而不是针对数组。这样做基本上是为了如果“数组”从未真正被创建，那么 BSON 限制永远不会被破坏。
其余的当然非常简单，您只需使用 $group 以便“组回”到原始文件。您可以使用 $first 作为累加器，以便在响应中保留您想要的文档的任何字段，只需 $sum 统计返回的国外数据。
由于这是 Mongoose ，我已经概述了“自动化”构建所有字段的过程，以包含在 $first 中。作为我对 Querying after populate in Mongoose 的回答的一部分它显示了如何检查“模式”以获得该信息。
另一个“皱纹”是 $unwind 否定 $lookup 固有的“LEFT JOIN”因为没有与父内容匹配的地方，那么从结果中删除“父文档”。在撰写本文时，我对此不太确定(并且应该稍后查找)，但是 preserveNullAndEmptyArrays选项确实有一个限制，因为它不能应用于这种形式的“合并”，但至少在 MongoDB 3.6 中并非如此:

const pipeline = [
    { "$match": { "category": query.category } },
    { "$sort": { "timestamp": -1 } },
    { "$skip": (query.page - 1) * query.count },
    { "$limit": query.count },
    { "$lookup": {
      "from": "posts",
      "localField": "_id",
      "foreignField": "parent",
      "as": "comments"
    }},
    { "$unwind": { "path": "$comments", "preserveNullAndEmptyArrays": true } },
    { "$group": {
      "_id": "$_id",
      "otherField": { "$first": "$otherField" },
      "comments": {
        "$sum": {
          "$cond": {
            "if": { "$eq": [ "$comments", null ]  },
            "then": 0,
            "else": 1
          }
        }
      }
    }}
];

由于我实际上无法确认它在 MongoDB 3.6 以外的任何其他版本中都能正常工作，因此这有点毫无意义，因为在较新的版本中您应该使用不同形式的 $lookup 反正。我知道 MongoDB 3.2 至少存在一个初始问题，即 preserveNullAndEmptyArrays取消了“聚结”，因此取消了 $lookup 仍然以“数组”的形式返回输出，并且只有后那个阶段是阵列“展开”。这违背了这样做以避免 BSON 限制的目的。

在代码中做
综上所述，最终您只是在寻找要添加到“相关”评论结果中的“计数”。只要您没有拉入包含“数百个项目”的页面，那么您的 $limit 条件应将其保持在合理的结果以简单地触发 count() 查询以获取每个键的匹配文档计数，而不会“太多”开销使其不合理:

// Get documents
let posts = await Post.find({ "category": query.category })
    .sort({ "timestamp": -1 })
    .skip((query.page - 1) * query.count)
    .limit(query.count)
    .lean().exec();

// Map counts to each document
posts = (await Promise.all(
  posts.map(post => Comment.count({ "parent": post._id }) )
)).map((comments,i) => ({ ...posts[i], comments }) );

这里的“权衡”是，同时运行所有这些 count() 的“并行”执行查询意味着对服务器的额外请求，每个查询本身的开销实际上非常低。获取查询结果的“光标计数”比使用诸如 $count 之类的东西要高效得多。聚合管道阶段如上所示。
这会在执行时给数据库连接带来负载，但它没有相同的“处理负载”，当然您只查看“计数”，并且没有通过网络返回或什至从“获取”文档集合在处理游标结果。
所以最后一个基本上是对 Mongoose 的“优化”populate()过程，我们实际上并不要求“文档”，而只是获取每个查询的计数。技术上 populate()将在这里使用“one”查询和 $in对于先前结果中的所有文档。但这在这里不起作用，因为您需要每个“父级”的总数，这本质上是单个查询和响应中的聚合。因此，为什么在这里发出“多个请求”。
概括
所以为了避免BSON Limit问题，您真正要寻找的是避免从您的 $lookup 返回相关文档“数组”的任何一种技术。用于“连接”的管道阶段，通过获取“减少的数据”或“光标计数”技术。
BSON 大小限制和处理有更多的“深度”:
Aggregate $lookup Total size of documents in matching pipeline exceeds maximum document size在这个网站上。请注意，此处演示的导致错误的相同技术也适用于 $facet 阶段也是如此，因为 16MB 限制对于任何“文档”都是一个常数。并且 MongoDB 中的“一切”几乎都是 BSON 文档，因此在限制范围内工作非常重要。

NOTE: Purely from a "performance" perspective the biggest problem outside of the potential BSON Size Limit breach inherent in your current query is actually the $skip and $limit processing. If what you are actually implementing is more of a "Load more results..." type of functionality, then something like Implementing pagination in mongodb where you would use a "range" to start the next "page" selection by excluding previous results is a lot more performance oriented than $skip and $limit.

Paging with $skip and $limit should only be used where you really have no other option. Being on "numbered paging" where you can jump to any numbered page. And even then, it's still far better to instead "cache" results into pre-defined sets.

But that really is a "whole other question" than the essential question here about the BSON Size Limit.

关于node.js - $facet 如何提高 $lookup 的性能，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50317379/

node.js - $facet 如何提高 $lookup 的性能

上一篇：clojure - 如何启动 clojure repl 并访问 jar

下一篇：c - fcntl : invalid argument in C