sql - 如何在 Hive 中生成所有 n-gram

我想使用 HiveQL 创建 n 元语法列表。我的想法是使用带有前瞻和 split 函数的正则表达式 - 但这不起作用:

select split('This is my sentence', '(\\S+) +(?=(\\S+))');

输入是表单的列

|sentence                 |
|-------------------------|
|This is my sentence      |
|This is another sentence |

输出应该是:

["This is","is my","my sentence"]
["This is","is another","another sentence"]

Hive 中有一个 n-gram udf，但该函数直接计算 n-gram 的频率 - 我想要一个所有 n-gram 的列表。

提前非常感谢!

最佳答案

这可能不是最优化但非常有效的解决方案。通过分隔符分割句子(在我的示例中，它是一个或多个空格或逗号)，然后分解并连接以获得 n-gram，然后使用 collect_set 组装 n-gram 数组(如果您需要唯一的 n -grams) 或 collect_list:

with src as 
(
select source_data.sentence, words.pos, words.word
  from
      (--Replace this subquery (source_data) with your table
       select stack (2,
                     'This is my sentence', 
                     'This is another sentence'
                     ) as sentence
      ) source_data 
        --split and explode words
        lateral view posexplode(split(sentence, '[ ,]+')) words as pos, word
)

select s1.sentence, collect_set(concat_ws(' ',s1.word, s2.word)) as ngrams 
      from src s1 
           inner join src s2 on s1.sentence=s2.sentence and s1.pos+1=s2.pos              
group by s1.sentence;

结果:

OK
This is another sentence        ["This is","is another","another sentence"]
This is my sentence             ["This is","is my","my sentence"]
Time taken: 67.832 seconds, Fetched: 2 row(s)

关于sql - 如何在 Hive 中生成所有 n-gram，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52782188/

上一篇：hadoop - 我无法从 Hadoop 客户端连接到 Hadoop 服务器

下一篇：apache-spark - 如何将作业提交到其他集群上的 yarn ？

相关文章：

hadoop - 启动namenode和datanode时出错

csv - 使用DBeaver，当尝试将数据从CSV导出到我的Hive数据库时，导出卡住吗？

mysql - 通过sqoop将hive hadoop中的数据存入mysql？

sql - Postgresql - 通过比较组内的行进行搜索

mysql - 读取9GB的SQL脚本文件

sql - 从 PARTITION BY 子句中删除 ORDER BY 子句？

sorting - 在具有零化简节点的 Mapreduce 中实现简单排序程序时出错

hadoop - 如何在配置单元中转置/旋转数据？

hadoop - 无法使用 Hive 版本 1.1.0 HBase 版本 0.94.8 和 hadoop 版本 2.7.0 从配置单元创建 Hbase 表

sql - 如何使用nodejs和sql作为数据库构建搜索栏？