django - 检索邻居的查询太慢

我有一组表示有向图的 Django ORM 模型，我正在尝试检索给定顶点的所有相邻顶点，忽略边方向:

class Vertex(models.Model):
    pass

class Edge(models.Model):
    orig = models.ForeignKey(Vertex, related_name='%(class)s_orig', null=True, blank=True)
    dest = models.ForeignKey(Vertex, related_name='%(class)s_dest', null=True, blank=True)
    # ... other data about this edge ...

查询 Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct() 返回正确的结果，但在我的例子中它花费的时间太长了执行。

对于我的应用程序，在任何给定时间通常会有大约 50-100 个顶点和大约一百万条边。即使将其减少到只有 20 个顶点和 100000 条边，该查询也需要大约一分半钟才能执行:

for i in range(20):
    Vertex().save()

vxs = list(Vertex.objects.all())
for i in tqdm.tqdm(range(100000)):
    Edge(orig = random.sample(vxs,1)[0], dest = random.sample(vxs,1)[0]).save()

v = vxs[0]
def f1():
    return list( Vertex.objects.filter(
        Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct() )

t1 = timeit.Timer(f1)

print( t1.timeit(number=1) ) # 84.21138522100227

另一方面，如果我将查询分成两部分，我可以在几毫秒内得到完全相同的结果:

def f2():
    q1 = Vertex.objects.filter(edge_orig__dest=v).distinct()
    q2 = Vertex.objects.filter(edge_dest__orig=v).distinct()
    return list( {x for x in itertools.chain(q1, q2)} )

t2 = timeit.Timer(f2)
print( t2.timeit(number=100)/100 ) # 0.0109818680600074

虽然第二个版本有问题:

它不是原子的。边缘列表几乎可以保证在我的应用程序中的两个查询之间发生变化，这意味着结果不会代表单个时间点。
如果不手动循环，我无法对结果执行额外的处理和聚合。 (例如，如果我想要 Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct().aggregate(avg=Avg('some_field')))

为什么第二个版本比第一个版本运行得快那么多？我该怎么做，有没有办法让第一个运行得足够快以供实际使用？

我目前正在使用 Python 3.5.2、PostgreSQL 9.5.6 和 Django 1.11 进行测试，尽管如果这是其中之一的问题，我不会坚持使用它们。

这里是每个查询生成的SQL，以及Postgres的explan:

第一个:

Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v))

SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
LEFT OUTER JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
LEFT OUTER JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id")
WHERE ("app_edge"."dest_id" = 1061 
       OR T4."orig_id" = 1061)

HashAggregate  (cost=8275151.47..8275151.67 rows=20 width=4)
  Group Key: app_vertex.id
  ->  Hash Left Join  (cost=3183.45..8154147.45 rows=48401608 width=4)
        Hash Cond: (app_vertex.id = app_edge.orig_id)
        Filter: ((app_edge.dest_id = 1061) OR (t4.orig_id = 1061))
        ->  Hash Right Join  (cost=1.45..2917.45 rows=100000 width=8)
              Hash Cond: (t4.dest_id = app_vertex.id)
              ->  Seq Scan on app_edge t4  (cost=0.00..1541.00 rows=100000 width=8)
              ->  Hash  (cost=1.20..1.20 rows=20 width=4)
                    ->  Seq Scan on app_vertex  (cost=0.00..1.20 rows=20 width=4)
        ->  Hash  (cost=1541.00..1541.00 rows=100000 width=8)
              ->  Seq Scan on app_edge  (cost=0.00..1541.00 rows=100000 width=8)

第二个:

Vertex.objects.filter(edge_orig__dest=v).distinct()

SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
INNER JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
WHERE "app_edge"."dest_id" = 1061

HashAggregate  (cost=1531.42..1531.62 rows=20 width=4)
  Group Key: app_vertex.id
  ->  Hash Join  (cost=848.11..1519.04 rows=4950 width=4)
        Hash Cond: (app_edge.orig_id = app_vertex.id)
        ->  Bitmap Heap Scan on app_edge  (cost=846.65..1449.53 rows=4950 width=4)
              Recheck Cond: (dest_id = 1061)
              ->  Bitmap Index Scan on app_edge_dest_id_4254b90f  (cost=0.00..845.42 rows=4950 width=0)
                    Index Cond: (dest_id = 1061)
        ->  Hash  (cost=1.20..1.20 rows=20 width=4)
              ->  Seq Scan on app_vertex  (cost=0.00..1.20 rows=20 width=4)

@khampson 的版本也需要一分半钟才能运行，所以也不行。

Vertex.objects.raw( ... )

SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id")
WHERE ("app_edge"."dest_id" = 1061
       OR T4."orig_id" = 1061);

HashAggregate  (cost=8275347.47..8275347.67 rows=20 width=4)
  Group Key: app_vertex.id
  ->  Hash Join  (cost=3183.45..8154343.45 rows=48401608 width=4)
        Hash Cond: (app_vertex.id = app_edge.orig_id)
        Join Filter: ((app_edge.dest_id = 1061) OR (t4.orig_id = 1061))
        ->  Hash Join  (cost=1.45..2917.45 rows=100000 width=12)
              Hash Cond: (t4.dest_id = app_vertex.id)
              ->  Seq Scan on app_edge t4  (cost=0.00..1541.00 rows=100000 width=8)
              ->  Hash  (cost=1.20..1.20 rows=20 width=4)
                    ->  Seq Scan on app_vertex  (cost=0.00..1.20 rows=20 width=4)
        ->  Hash  (cost=1541.00..1541.00 rows=100000 width=8)
              ->  Seq Scan on app_edge  (cost=0.00..1541.00 rows=100000 width=8)

最佳答案

这两个查询的查询计划完全不同。第一个(较慢的)一个没有命中任何索引，并且正在执行两个 left join，这两者都会导致更多的行被处理和返回。从我对 Django ORM 语法意图的解释来看，听起来您并不真的想在这里执行 left join。

在这种情况下，我建议考虑从 Django ORM 中下降到原始 SQL，并将两者混合。例如如果你拿第一个，然后把它改成这样:

SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id")
WHERE ("app_edge"."dest_id" = 1061
       OR T4."orig_id" = 1061);

这里有两个问题:该版本的性能如何，它是否为您提供了您正在寻找的结果？

有关原始查询的更多信息，请查看 this section Django 文档。

对 OP 评论的回应:

我建议的查询的查询计划还显示它没有命中任何索引。

对于所涉及的列，您是否在两个表上都有索引？我怀疑不是，特别是因为对于这个特定的查询，我们正在寻找一个单一的值，这意味着如果有索引，如果查询规划器确定顺序扫描是更好的选择，我会感到非常惊讶(OTOH，如果你是寻找范围广泛的行，比如说，表中超过 10% 的行，查询规划器可能会正确地做出这样的决定。

关于django - 检索邻居的查询太慢，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44960197/

django - 检索邻居的查询太慢

上一篇：sql - 在 PostgreSQL 中计算累进定价

下一篇：java - 使用 jOOQ 更新 Postgres 日期范围