regex - Bigquery 标准方言 REGEXP_REPLACE 输入类型

标签 regex google-bigquery gdelt

我正在使用此 tutorial 探索 Google Biguery 与 GDELT 数据库的强大功能但是 sql 方言是“遗留”的,我想使用标准方言。

在传统方言中:

SELECT
  theme,
  COUNT(*) AS count
FROM (
  SELECT
    REGEXP_REPLACE(SPLIT(V2Themes,';'), r',.*',"") theme
from [gdelt-bq:gdeltv2.gkg]
where DATE>20150302000000 and DATE < 20150304000000 and V2Persons like '%Netanyahu%'
)
group by theme
ORDER BY 2 DESC
LIMIT 300

当我尝试翻译成标准方言时:
SELECT
  theme,
  COUNT(*) AS count
FROM (
  SELECT
    REGEXP_REPLACE(SPLIT(V2Themes,';') , r',.*', " ") AS theme
    FROM
      `gdelt-bq.gdeltv2.gkg`
    WHERE
      DATE>20150302000000
      AND DATE < 20150304000000
      AND V2Persons LIKE '%Netanyahu%' )
  GROUP BY
    theme
  ORDER BY
    2 DESC
  LIMIT
    300

它引发以下错误:
No matching signature for function REGEXP_REPLACE for argument types: ARRAY<STRING>, STRING, STRING. Supported signatures: REGEXP_REPLACE(STRING, STRING, STRING); REGEXP_REPLACE(BYTES, BYTES, BYTES) at [6:5]

似乎我必须将 SPLIT() 操作的结果转换为字符串。我该怎么做呢?

更新:我发现了一个 talk解释 unnest 操作:
SELECT
  COUNT(*),
  REGEXP_REPLACE(themes,",.*","") AS theme
FROM
  `gdelt-bq.gdeltv2.gkg_partitioned`,
  UNNEST( SPLIT(V2Themes,";") ) AS themes
WHERE
  _PARTITIONTIME >= "2018-08-09 00:00:00"
  AND _PARTITIONTIME < "2018-08-10 00:00:00"
  AND V2Persons LIKE '%Netanyahu%'
GROUP BY
  theme
ORDER BY
  2 DESC
LIMIT
  100

最佳答案

首先展平数组:

SELECT
  REGEXP_REPLACE(theme , r',.*', " ") AS theme,
  COUNT(*) AS count
FROM
  `gdelt-bq.gdeltv2.gkg`,
  UNNEST(SPLIT(V2Themes,';')) AS theme
WHERE
  DATE>20150302000000
  AND DATE < 20150304000000
  AND V2Persons LIKE '%Netanyahu%' 
GROUP BY
  theme
ORDER BY
  2 DESC
LIMIT
  300

您问题中的旧 SQL 等效项实际上也具有展平数组的效果,尽管它隐含在主题的 GROUP BY 中。

关于regex - Bigquery 标准方言 REGEXP_REPLACE 输入类型,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51768648/

相关文章:

javascript - 有没有一种方法可以衡量 Google BigQuery 中的字符串相似性

sql - 删除连续的重复行bigquery

google-bigquery - 是否有比 EXTRACT(HOUR FROM TIMESTAMP_SECONDS(visitStartTime)) 更短的方法从以整数形式存储的时间戳中提取小时?

google-bigquery - 如何使用 Google BigQuery 从 GDELT 获取超过 6 个月的数据

regex - Notepad++正则表达式将所选句子的第一个单词大写

regex - 匹配 Lua 模式中的可选数字

javascript - getURLParameter 和 RegExp

python - 在mysql加载数据文件导入期间删除CSV公式值