lucene - 单词共现 - 在一组 n-gram 中找到一个术语的共现

我将如何着手编写一个类似 Java 的同现类，它需要一个充满 n-gram 的文件并计算给定输入术语的词同现。

是否有任何库或包可以与 Lucene(索引)或类似 Hadoop 中 n-gram 列表的 map-reduce 一起工作？

谢谢。

最佳答案

好吧，假设你想在一个 ngram 文件中找到两个不同单词的共现......

这是伪代码式的 Java:

// Co-occurrence matrix
Hashmap<String,HashMap<String,Integer>> map = new HashMap();

// List of ngrams
ArrayList<ArrayList<String>> ngrams = ..... // assume we've loaded them into here already

// build the matrix
for(ArrayList<String> ngram:ngrams){
  // Calculate word co-occurrence in ngram for all words
  // result is an map strings-> count
  // words in alphabetical order
  Hashmap<String,<ArrayList<String>,Integer> wordCoocurrence = cooccurrence(ngram) // assume we have this

  // then just join this with original
}

// and just query with words in alphabetic order

用 Pig 做这样的计数可能很不错，但你可能比我更熟悉

关于lucene - 单词共现 - 在一组 n-gram 中找到一个术语的共现，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/6510338/

上一篇：sql - 协调 protobuf-net bcl.Guid 的 HI/LO 与 sql uniqueidentifiers 以进行相关子查询？

下一篇：hadoop - Apache Hive 是更多地用于编程语言还是数据仓库方面？

相关文章：

hadoop - 如何在hadoop级联中加载固定宽度的文件

python - 如何将损失函数中的变量存储到实例变量中

php - Node.js 或 PHP 中的模式识别算法？

java - Elasticsearch:如何使产品名称在搜索中比产品描述更重要？

lucene - 使用 IndexReader IsLocked 和 Unlock 方法

elasticsearch - 在 Elasticsearch 和 Lucene 4.4 中使用 Shingles 和停用词

sql-server - 使用 sqoop 从 sql server 导入表时出错

hadoop - 如何在 Hive 中加载分布式数据？

lucene - 如何在 Lucene 搜索结果中进行分组？

Python将多个单词列表转换为单个单词