sql - 需要为文本正文中的特定单词格式化 BigQuery 中的表格

标签 sql regex google-bigquery

我正在使用 Google BigQuery 抓取 reddit 评论数据库。我将从我正在处理的查询开始:

SELECT
  DATE(SEC_TO_TIMESTAMP(created_utc)) AS date,
  subreddit,
  author AS comment_author,
  ups AS upvotes,
  LOWER(body)
FROM
  [fh-bigquery:reddit_comments.2015_01]
WHERE
  body CONTAINS 'acid'
  OR body CONTAINS 'ecstasy'
  OR body CONTAINS 'fire'
  OR body CONTAINS 'heroin'
LIMIT 10;

我需要从 reddit 数据库中抓取大约 30 个与药物相关的词的列表(为简洁起见,我将其限制为 3 个)。

我在两件事上遇到了麻烦:

  1. 我希望能够正确查询数据库,但返回的许多结果不符合标准,也就是不包含任何匹配词。
  2. 我希望能够创建一个列来显示匹配的特定词....因此,如果它与词“drug”匹配,则该词将出现在“word_matched”列中,同时正文、作者、日期等

我也尝试过使用正则表达式来匹配单词,但这似乎也无济于事:

  WHERE (REGEXP_MATCH(body,'drug|acid|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers'))

我们将不胜感激任何帮助。谢谢大家!

最佳答案

下面针对问题的两点
1. 只输出匹配的单词,而不输出属于另一个/不同单词的单词。使用 REGEXP_MATCH 很容易做到这一点函数
2. 拥有包含所有匹配词的列。 (我认为有所有匹配的词比有问题的只有一个更有意义。

SELECT
    [date],
    subreddit,
    comment_author,
    upvotes,
    GROUP_CONCAT(word) AS matches, 
    body
FROM (
  SELECT 
    [date],
    subreddit,
    comment_author,
    upvotes,
    body,
    word
  FROM (
    SELECT
      DATE(SEC_TO_TIMESTAMP(created_utc)) AS [date],
      subreddit,
      author AS comment_author,
      ups AS upvotes,
      LOWER(body) AS body
    FROM
      [fh-bigquery:reddit_comments.2015_01]
    WHERE REGEXP_MATCH(body, r'\b(drug|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers)\b')
  ) x 
  CROSS JOIN (
    SELECT SPLIT(list,'|') AS word FROM 
    (SELECT 'drug|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers' AS list)
  ) y
  HAVING body CONTAINS word
)
GROUP BY [date], subreddit, comment_author, upvotes, body
LIMIT 1000

以上解决方案提供了尽力而为的匹配词列表,因此请注意:
如果 matches 列包含一个词 - 它肯定是完全匹配的词
但是,如果此列由几个词组成 - 仍然有一个是完全匹配的,但其他列可能不是完全匹配的。
我认为对于冗长的 body - 至少将它们作为寻找内容的提示仍然很有值(value)。例如在

drug,meth,heroin,alcohol,benzos it also inhibits the reuptake of serotonin and norepinephrine which gives a hell of a lot worse withdrawal symptoms than most other drugs(incl. heroin, meth, coke and etc.). from what i have heard the only things that rival tramadol it terms of withdrawal are benzos and alcohol.
liquor,beer,alcohol,booze       1. reinforce #3 - it is not cheap to live here. not by any stretch. expect to pay more than the rest of the country pays for everything. even franchises that operate nation-wide have special wa/perth pricing. 2. petrol has literally just dropped to $1 this past month, i wouldn't go as far as quoting that as our average price just yet. average is still between $1.20-1.30. 3. parking is free at beaches & parks, do not expect to get free parking anywhere in the city though. if you're using public parking in the city all day, expect to pay $50 unless you get in early. 4. forget bribing the cops, don't even call them "mate". last time i was pulled over (last week, random stop) i said "evening mate" as i was handing him my license and was responded with "don't call me mate, i'm not your friend, i don't know you". 5. unlike the rest of the world, regular stores do not sell alcohol here. liquor stores only, don't expect to buy beer from a gas station or grocery store. 6. rent is expensive, food is expensive, booze is expensive, being alive is expensive.
drug,meth,heroin,beer           that's simply not true. first there's a difference between legalization and decriminalization. second, some european countries have places to go to safely use drugs. there is middle ground between allowing heroin to be sold all over town and having users go to prison. heroin, meth and some other drugs are not good things for society and their use should encouraged by making it as easy to buy as a 6 pack of beer. i'm not really sure why you can't see a middle ground because it's clearly not as black and white as you say. you can go after the dealers while leaving the users alone.
drug,fire,joint,smoke           not a story about a rave, but still relevant i think: i was working a job called "fire watch," which is just what it sounds like, at a nine inch nails concert a few years ago. our comrades, the security workers, were far from seasoned professionals. they were mostly college temps with a yellow security tee shirt and a flashlight; they didn't even have radios. the job is basically to make sure people don't go into restricted areas. ...but this one boy scout took it upon himself to tame the metal masses. mid-concert, he pulled me close and shouted "they're smoking pot!" i shrugged, and shot him an "and?" look. i guess he thought i should care because technically a joint is a tiny dangerous drug fire, and i was on the fire crew. he then proceeded to disappear into the crowd, shoving people out of the way on his heroic journey toward the countless smoke puff origins. the next time i saw him he was bleeding out of his face and getting a flashlight in the eyes from an onsite emt. i guess it's pretty harsh to say that he deserved the beating, but it's hard to argue that he didn't go asking for it. i guess the moral of my story is that security people are just people, and some people's shittyness is inflamed when combined with authority. it sounds like your event just happened to be warded by a gaggle of douches, probably being captained by king fuckwad who really wanted to be a cop, but couldn't pass the exams.

注意:如果您只需要完全匹配的列表,使用BigQuery User-Defined Functions 仍然相对容易。

关于sql - 需要为文本正文中的特定单词格式化 BigQuery 中的表格,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34933379/

相关文章:

php - 我是否可以创建一个 View 来显示一个月内表中的所有记录?

regex - 如何使用 Bash 从 JSON 字符串中删除\n 和\r?

json - BigQuery : Load JSON object as a string

json - 从 bigquery 中的 json 字符串中提取键和值,其中 json 文档中没有指定的键

sql - 计算 BigQuery 中数组中的匹配项数

sql - 较短的查询方法

mysql - SQL 列重复匹配不同的值

php - 使用php按类别显示表格

python - 使用正则表达式抓取 HTML 表单

java - 如何使用 Reg-ex 查找一组单词?