带有倒排索引的 Ruby & Mongodb 带来了一些有趣的结果

对于我的程序，我正在使用来自 Twitter 提要的数据创建倒排索引，但是，在解析并将它们放入 mongodb 时，会出现一些有趣的问题。

通常的条目类型应该是这样的:

{"ax"=>1, "easyjet"=>1, "from"=>2}

然而，当解析一些推文时，它们在数据库中的结果是这样的:

{""=>{""=>{""=>{""=>{""=>{"giants"=>{"dhem"=>1, "giants"=>1, "giantss"=>1}}}}

我有这些行将推文拆分并递增数据库中的值:

def pull_hash_tags(tweet, lang)
    hash_tags = tweet.split.find_all { |word| /^#.+/.match word }
    t = tweet.gsub(/https?:\/\/[\S]+/,"") # removing urls
    t = t.gsub(/#\w+/,"") # removing hash tags
    t = t.gsub(/[^0-9a-z ]/i, '') # removing non-alphanumerics and keeping spaces
    t = t.gsub(/\r/," ")
    t = t.gsub(/\n/," ")
    hash_tags.each { |tag| add_to_hash(lang, tag, t) }
end

def add_to_hash(lang, tag, t)
    t.gsub(/\W+/, ' ').split.each { |word| @db.collection.update({"_id" => lang}, {"$inc" => {"#{tag}.#{word}" => 1}}, { :upsert => true }) }
end

我正在尝试获取普通单词(仅包含字母数字字符)并且没有双空格，也没有回车符等。

最佳答案

您应该添加 t.strip!，因为看起来问题可能出在前导/尾随空格上。

关于带有倒排索引的 Ruby & Mongodb 带来了一些有趣的结果，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/9212710/

上一篇：mongodb - 如何根据数组中匹配对象的数量在 MongoDb 中查找文档

下一篇：mongodb - 使用 MongoMapper，如何找到另一个表中不存在 ID 的记录？

相关文章：

ruby-on-rails - ruby strptime 错误

java - 具有 "{"的字符串的模式匹配

regex - 正则表达式出现奇怪的错误'[

python - 使用正则表达式匹配多种日期/时间戳

node.js - 无法通过 mongo compass 创建数据库

ruby - 如何在 rspec 中测试 .sample 方法？

ruby - 是否可以在 Solr/Lucene 中模拟余弦相似度？

ruby - 在 Mac 上通过 export http_proxy 使用 Ruby 和 Charles Proxy

node.js - 使用 systemd 连接到 MongoDB，在 Linux (Debian) 上托管 Meteor (MeteorJS)。错误 : URL must be in the format mongodb://user:pass@host:port/dbname

java - 在 JAVA Spring 框架中使用存储库时如何正确使用 MongoDB 进行身份验证