带组的 SQL 随机样本

我有一个大学毕业生数据库，想提取大约 1000 条记录的随机数据样本。

我想确保样本能够代表总体，因此希望包含相同比例的类(class)，例如

enter image description here

我可以使用以下方法来做到这一点:

select top 500 id from degree where coursecode = 1 order by newid()
union
select top 300 id from degree where coursecode = 2 order by newid()
union
select top 200 id from degree where coursecode = 3 order by newid()

但是我们有数百个类(class)代码，因此这将非常耗时，我希望能够针对不同的样本大小重用此代码，并且不特别希望遍历查询并对样本大小进行硬编码。

任何帮助将不胜感激

最佳答案

您想要一个分层样本。我建议通过按类(class)代码对数据进行排序并执行第 n 个示例来完成此操作。如果您的人口规模较大，以下是一种最有效的方法:

select d.*
from (select d.*,
             row_number() over (order by coursecode, newid) as seqnum,
             count(*) over () as cnt
      from degree d
     ) d
where seqnum % (cnt / 500) = 1;

编辑:

您还可以“即时”计算每个组的人口规模:

select d.*
from (select d.*,
             row_number() over (partition by coursecode order by newid) as seqnum,
             count(*) over () as cnt,
             count(*) over (partition by coursecode) as cc_cnt
      from degree d
     ) d
where seqnum < 500 * (cc_cnt * 1.0 / cnt)

关于带组的 SQL 随机样本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30235542/

带组的 SQL 随机样本

上一篇：maven-2 - Maven webstart 插件找不到依赖项

下一篇：symfony - 如何使用 Doctrine Annotations 更改实体子类中的列名称？