ruby - 使用正则表达式将 ruby 字符串分块

标签 ruby regex

我有一个按部分构建的文本文件,我想将其分解为一个数组,每个部分都包含字符串元素。每个部分的内容将根据部分进行不同的操作。我目前正在使用 irb,很可能会将其分解为一个单独的 ruby​​ 脚本文件。

我从输入文件(分别为“sample”和“sample_file”)创建了一个字符串对象和文件对象来测试不同的方法。我确定文件读取循环在这里冷工作,但我相信我只需要一个简单的匹配。

文件看起来像这样:

*** Section Header ***

randomly formatted content
multiple lines

 *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)

This sections info
       **** sub headers sometime occur***
           I'm okay with treating this as normal headers for now.
           I think sub headers may have something consistent about them.


*** Header ***
  info for this section

示例输出:

[*** Section Header ***\r\n\r\n randomly formatted content\r multiple lines, **** Another Header\r this sections info,*** sub header and its info, ...etc.]

这是[部分的字符串,部分的字符串,部分的字符串] 由于不一致的打开和关闭条件或我的需求的多行性质导致的并发症,我的大多数尝试都失败了。

这是我最接近的尝试,要么创建不需要的元素(例如包含一个标题的结束星号和另一个标题的开头的字符串),要么只获取一个标题。

这匹配标题:

sample.scan(/\*{3}.*/)

这匹配标题和部分,但从结束和开始星号创建元素,我不完全理解前瞻和背后的断言,但我认为根据我对解决方案的搜索,解决方案看起来像这样。

sample.scan(/(?<=\*{3}).*?(?=\*{3})/m)

我现在正在寻找以空格和/或星号开头的行,但目前还没有!

sample.scan(/^(\s+\*+|\*+).*/)

非常感谢任何方向。

最佳答案

Ruby 的 Enumerable 包括 slice_before这对这类任务非常有用:

str = "*** Section Header ***

randomly formatted content
multiple lines

 *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)

This sections info
       **** sub headers sometime occur***
           I'm okay with treating this as normal headers for now.
           I think sub headers may have something consistent about them.


*** Header ***
  info for this section
"
str.split("\n").slice_before(/^\s*\*{3}/).to_a
# => [["*** Section Header ***",
#      "",
#      "randomly formatted content",
#      "multiple lines",
#      ""],
#     [" *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)",
#      "",
#      "This sections info"],
#     ["       **** sub headers sometime occur***",
#      "           I'm okay with treating this as normal headers for now.",
#      "           I think sub headers may have something consistent about them.",
#      "",
#      ""],
#     ["*** Header ***", "  info for this section"]]

使用 slice_before 允许我使用一个非常简单的模式来定位一个标志/目标,该标志/目标指示子数组中断发生的位置。使用 /^\s*\*{3}/ 查找以可能的空格字符串开头后跟三个 '*' 的行。一旦找到,一个新的子数组就开始了。

如果您希望每个子数组实际上是单个字符串而不是 block 中的行数组,map(&:join) 是您的 friend :

str.split("\n").slice_before(/^\s*\*{3}/).map(&:join)
# => ["*** Section Header ***    randomly formatted content    multiple lines",
#     "     *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)    This sections info",
#     "           **** sub headers sometime occur***               I'm okay with treating this as normal headers for now.               I think sub headers may have something consistent about them.",
#     "    *** Header ***      info for this section    "]

而且,如果你想去掉前导和尾随的空格,你可以结合使用 stripmap:

str.split("\n").slice_before(/^\s*\*{3}/).map{ |sa| sa.join.strip }
# => ["*** Section Header ***    randomly formatted content    multiple lines",
#     "*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)    This sections info",
#     "**** sub headers sometime occur***               I'm okay with treating this as normal headers for now.               I think sub headers may have something consistent about them.",
#     "*** Header ***      info for this section"]

或:

str.split("\n").slice_before(/^\s*\*{3}/).map{ |sa| sa.map(&:strip).join(' ') }
# => ["*** Section Header ***  randomly formatted content multiple lines ",
#     "*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)  This sections info",
#     "**** sub headers sometime occur*** I'm okay with treating this as normal headers for now. I think sub headers may have something consistent about them.  ",
#     "*** Header *** info for this section "]

或:

str.split("\n").slice_before(/^\s*\*{3}/).map{ |sa| sa.join.strip.squeeze(' ') }
# => ["*** Section Header *** randomly formatted content multiple lines",
#     "*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set) This sections info",
#     "**** sub headers sometime occur*** I'm okay with treating this as normal headers for now. I think sub headers may have something consistent about them.",
#     "*** Header *** info for this section"]

取决于你想做什么。


Splitting by "\r" produces a better output on my real file than "\n"

str.split(/\r?\n/).slice_before(/^\s*\*{3}/).to_a

使用 /\r?\n/,这是一个正则表达式,用于查找可选的回车符后跟一个换行符。 Windows 使用 "\r\n" 组合来标记一行的结束,而 Mac OS 和 *nix 仅使用 "\n"。通过这样做,您不会将代码绑定(bind)到仅限 Windows。

我不知道 slice_before 是否是为这个特殊用途而开发的,但我用它来撕开文本文件并将它们分解成段落,并将网络设备配置分成 block ,这使得两种情况下的解析都容易得多。

关于ruby - 使用正则表达式将 ruby 字符串分块,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26317294/

相关文章:

ruby - 为什么 RuboCop 建议用 Array.new 替换 .times.map?

java - 多行正则表达式

python 重新。排除一些结果

regex - 如何检查字符串是否与正则表达式匹配

单行上的python正则表达式语句

PHP json_decode 一个 json_encoded 字符串

ruby & Lmstat : parslet and structured multi-line block : where to put the newline statement?

ruby - 使用 ruby​​ 从模板写入文件

ruby - 如何将 octopress 3 部署到现有 gh-pages 站点中的子目录?

ruby - 如何从 C 扩展将关键字传递到 Ruby 方法?