ruby - 在 ruby 中拆分句子的更好的正则表达式？

我正在做一些事情来计算一个词在一堆文本中出现的频率，告诉它出现在哪个句子中，并根据每个词的频率对结果进行排序。例如: sample input and out put

这是我目前所拥有的:

File.open('sample_text.txt', 'r') do |f| # open a file named "sample_text.txt"

content = f.read # turn the content into a long string

# split the string by sentences
sentences = content.split(/\.|\?|\!/).each do |es|

  es.split(/\W|\s/).each do |w| 
     #split into individual words 
     #and for each word, find matched words in the content

  end

end
end

问题:

1. 是否有更好的正则表达式来拆分句子？现在，split(/\.|\?|\!/) 将把 web 2.0 作为两个句子 web 2 和 0 。

2. 谁能给我一些提示，告诉我如何完成返回一个单词所在的句子数组的部分？

最佳答案

在句号(或像 ? 或 ! 这样的标点符号)之后要求一个空格，然后选择性地防止其前面出现某些众所周知的缩写怎么样？ (例如 vs. 或 Mr. 或 Mrs. 或 i.e. 或 e.g.) ，或许还要求后面有一个大写字母？

给定一个句子字符串数组和一个将每个句子拆分为一个单词数组的方法(我会把它留给你)，你可以这样做:

sentences_for_word = Hash.new{ |h,k| h[k] = [] }
sentences.each do |sentence|
  words_for_sentence(sentence).each do |word|
    sentences_for_word[word] << sentence
  end
end

关于ruby - 在 ruby 中拆分句子的更好的正则表达式？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/8351579/

上一篇：ruby-on-rails - 除了 heckle，Ruby 的代码修改器？

下一篇：ruby-on-rails - 设计可邀请的批量邀请 - Ruby on rails

ruby - 在 Ruby 中循环主集、处理每个子集的最干净的方法

javascript - 需要 dotnet 正则表达式将下划线 (_) 替换为 0%

javascript - 使用 String.prototype.replace 删除非字母数字文本

c# - 正则表达式分割字符串并将括号[]中的内容放入数组中

arrays - Swift:检查数组的字典对象类型

ruby - omniauth 0.2.3 invalid_credentials

ruby-on-rails - 在列出关联对象中的方法检索的对象时避免 N+1 的最佳实践

javascript - 使用append方法将对象转换为字符串

ios - 引号在数组中丢失

ruby - 在 ruby​​ 中拆分句子的更好的正则表达式？

上一篇：ruby-on-rails - 除了 heckle，Ruby 的代码修改器？

下一篇：ruby-on-rails - 设计可邀请的批量邀请 - Ruby on rails

ruby - 在 ruby 中拆分句子的更好的正则表达式？