我有一列,分数,它是介于1和5之间(含1和5)的整数。
我试图从每个分数中选择n个(在这种情况下为2000个)样本。
我自己的黑客攻击和其他SO问题导致我构造了以下查询:
select * from (select text, score from data where score= 1 and LENGTH(text) > 45 limit 2000)
union
select * from (select text, score from data where score= 2 and LENGTH(text) > 45 limit 2000)
union
select * from (select text, score from data where score= 3 and LENGTH(text) > 45 limit 2000)
union
select * from (select text, score from data where score= 4 and LENGTH(text) > 45 limit 2000)
union
select * from (select text, score from data where score= 5 and LENGTH(text) > 45 limit 2000)
感觉这是最糟糕的方式,更重要的是,当我分别运行每个查询时,它给我2k个结果,但是,当我运行此联合时,我得到的行数不到1万行
我正在寻找帮助来优化此查询,但是更重要的是我想了解为什么工会返回错误的结果数
最佳答案
关于查询为什么返回错误数量的结果的原因,我敢打赌您的数据不在每个查询返回的结果集中的distinct
内。使用union
时,它将返回整个结果集中的distinct
行。
尝试将其更改为union all
:
select * from (select text, score from data where score= 1 and LENGTH(text) > 45 limit 2000)
union all
select * from (select text, score from data where score= 2 and LENGTH(text) > 45 limit 2000)
union all
select * from (select text, score from data where score= 3 and LENGTH(text) > 45 limit 2000)
union all
select * from (select text, score from data where score= 4 and LENGTH(text) > 45 limit 2000)
union all
select * from (select text, score from data where score= 5 and LENGTH(text) > 45 limit 2000)
Here's a condensed demo showing the difference.
如果您具有主键(例如自动增量),那么这是另一种方法,它为每组分数生成一个
row_number
(这假设一个id
主键):select text, score
from (
select text, score,
(select count(*) from data b
where a.id >= b.id and
a.score = b.score and
length(b.text) > 45) rn
from data a
where length(text) > 45
) t
where rn <= 2000
关于sql - 从每个类别中选择n个样本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50803974/