google-bigquery - 如何计算bigquery数组字段中元素的频率

标签 google-bigquery bigquery-standard-sql

我有一张看起来像这样的表:

enter image description here

我正在寻找一个表格,它给出了字段中元素的频率计数 l_0, l_1, l_2, l_3 .

例如,输出应如下所示:

| author_id  | year | l_o.name         | l_0.count| l1.name    | l1.count | l2.name             | l2.count| l3.name            | l3.count|
| 2164089123 | 1987 | biology          | 3        | botany     | 3        |                     |         |                    |         |
| 2595831531 | 1987 | computer science | 2        | simulation | 2        | computer simulation | 2       | mathematical model | 2       |

编辑:

在某些情况下,数组字段可能有不止一种类型的元素。例如 l_0可能是 ['biology', 'biology', 'geometry', 'geometry'] .在这种情况下,字段 l_0, l_1, l_2, l_3 的输出将是一个嵌套的重复字段,其中包含 l_0.name 中的所有元素以及 l_0.count 中的所有相应计数.

最佳答案

这应该有效,假设您想以每个数组为基础:

SELECT
  author_id,
  year,
  (SELECT AS STRUCT ANY_VALUE(l_0) AS name, COUNT(*) AS count
   FROM UNNEST(l_0) AS l_0) AS l_0,
  (SELECT AS STRUCT ANY_VALUE(l_1) AS name, COUNT(*) AS count
   FROM UNNEST(l_1) AS l_1) AS l_1,
  (SELECT AS STRUCT ANY_VALUE(l_2) AS name, COUNT(*) AS count
   FROM UNNEST(l_2) AS l_2) AS l_2,
  (SELECT AS STRUCT ANY_VALUE(l_3) AS name, COUNT(*) AS count
   FROM UNNEST(l_3) AS l_3) AS l_3
FROM YourTable;

为了避免如此多的重复,您可以使用 SQL UDF:
CREATE TEMP FUNCTION GetNameAndCount(elements ARRAY<STRING>) AS (
  (SELECT AS STRUCT ANY_VALUE(elem) AS name, COUNT(*) AS count
   FROM UNNEST(elements) AS elem)
);

SELECT
  author_id,
  year,
  GetNameAndCount(l_0) AS l_0,
  GetNameAndCount(l_1) AS l_1,
  GetNameAndCount(l_2) AS l_2,
  GetNameAndCount(l_3) AS l_3
FROM YourTable;

如果您可能需要考虑数组中的多个不同名称,您可以让 UDF 返回一个包含相关计数的数组:
CREATE TEMP FUNCTION GetNamesAndCounts(elements ARRAY<STRING>) AS (
  ARRAY(
    SELECT AS STRUCT elem AS name, COUNT(*) AS count
    FROM UNNEST(elements) AS elem
    GROUP BY elem
    ORDER BY count
  )
);

SELECT
  author_id,
  year,
  GetNamesAndCounts(l_0) AS l_0,
  GetNamesAndCounts(l_1) AS l_1,
  GetNamesAndCounts(l_2) AS l_2,
  GetNamesAndCounts(l_3) AS l_3
FROM YourTable;

请注意,如果要执行跨行计数,则需要取消嵌套数组并执行 GROUP BY在外部层面,但看起来这不是您基于问题的意图。

关于google-bigquery - 如何计算bigquery数组字段中元素的频率,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48411331/

相关文章:

python - 处理使用 docker 读取和写入数据到 google big query 的 python 脚本的自动身份验证

google-bigquery - 大查询分析函数提高查询性能

sql - 如何在 Bigquery 中使用 except 子句?

sql - 如何在标准 SQL 中解码 BASE64?

google-bigquery - 当字段包含 @sign BigQuery 标准 SQL 时,如何使用函数从 Google BigQuery 中提取 json

sql - Google Big Query SQL - 获取最新的列值

sql - Big Query - 将数组/json 对象转置为列

google-bigquery - 无法在 SELECT 中选择 _PARTIONTIME,但可以在 INSERT 中选择它

sql - 使用 BigQuery 移动(滚动)中位数

sql - 大查询 : Union two different tables which are based on federated Google Spreadsheet