ruby - 使用正则表达式将 ruby 字符串分块

我有一个按部分构建的文本文件，我想将其分解为一个数组，每个部分都包含字符串元素。每个部分的内容将根据部分进行不同的操作。我目前正在使用 irb，很可能会将其分解为一个单独的 ruby 脚本文件。

我从输入文件(分别为“sample”和“sample_file”)创建了一个字符串对象和文件对象来测试不同的方法。我确定文件读取循环在这里冷工作，但我相信我只需要一个简单的匹配。

文件看起来像这样:

*** Section Header ***

randomly formatted content
multiple lines

 *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)

This sections info
       **** sub headers sometime occur***
           I'm okay with treating this as normal headers for now.
           I think sub headers may have something consistent about them.


*** Header ***
  info for this section

示例输出:

[*** Section Header ***\r\n\r\n randomly formatted content\r multiple lines, **** Another Header\r this sections info,*** sub header and its info, ...etc.]

这是[部分的字符串，部分的字符串，部分的字符串] 由于不一致的打开和关闭条件或我的需求的多行性质导致的并发症，我的大多数尝试都失败了。

这是我最接近的尝试，要么创建不需要的元素(例如包含一个标题的结束星号和另一个标题的开头的字符串)，要么只获取一个标题。

这匹配标题:

sample.scan(/\*{3}.*/)

这匹配标题和部分，但从结束和开始星号创建元素，我不完全理解前瞻和背后的断言，但我认为根据我对解决方案的搜索，解决方案看起来像这样。

sample.scan(/(?<=\*{3}).*?(?=\*{3})/m)

我现在正在寻找以空格和/或星号开头的行，但目前还没有!

sample.scan(/^(\s+\*+|\*+).*/)

非常感谢任何方向。

最佳答案

Ruby 的 Enumerable 包括 slice_before这对这类任务非常有用:

str = "*** Section Header ***

randomly formatted content
multiple lines

 *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)

This sections info
       **** sub headers sometime occur***
           I'm okay with treating this as normal headers for now.
           I think sub headers may have something consistent about them.


*** Header ***
  info for this section
"
str.split("\n").slice_before(/^\s*\*{3}/).to_a
# => [["*** Section Header ***",
#      "",
#      "randomly formatted content",
#      "multiple lines",
#      ""],
#     [" *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)",
#      "",
#      "This sections info"],
#     ["       **** sub headers sometime occur***",
#      "           I'm okay with treating this as normal headers for now.",
#      "           I think sub headers may have something consistent about them.",
#      "",
#      ""],
#     ["*** Header ***", "  info for this section"]]

使用 slice_before 允许我使用一个非常简单的模式来定位一个标志/目标，该标志/目标指示子数组中断发生的位置。使用 /^\s*\*{3}/ 查找以可能的空格字符串开头后跟三个 '*' 的行。一旦找到，一个新的子数组就开始了。

如果您希望每个子数组实际上是单个字符串而不是 block 中的行数组，map(&:join) 是您的 friend :

str.split("\n").slice_before(/^\s*\*{3}/).map(&:join)
# => ["*** Section Header ***    randomly formatted content    multiple lines",
#     "     *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)    This sections info",
#     "           **** sub headers sometime occur***               I'm okay with treating this as normal headers for now.               I think sub headers may have something consistent about them.",
#     "    *** Header ***      info for this section    "]

而且，如果你想去掉前导和尾随的空格，你可以结合使用 strip 和 map:

str.split("\n").slice_before(/^\s*\*{3}/).map{ |sa| sa.join.strip }
# => ["*** Section Header ***    randomly formatted content    multiple lines",
#     "*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)    This sections info",
#     "**** sub headers sometime occur***               I'm okay with treating this as normal headers for now.               I think sub headers may have something consistent about them.",
#     "*** Header ***      info for this section"]

或:

str.split("\n").slice_before(/^\s*\*{3}/).map{ |sa| sa.map(&:strip).join(' ') }
# => ["*** Section Header ***  randomly formatted content multiple lines ",
#     "*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)  This sections info",
#     "**** sub headers sometime occur*** I'm okay with treating this as normal headers for now. I think sub headers may have something consistent about them.  ",
#     "*** Header *** info for this section "]

或:

str.split("\n").slice_before(/^\s*\*{3}/).map{ |sa| sa.join.strip.squeeze(' ') }
# => ["*** Section Header *** randomly formatted content multiple lines",
#     "*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set) This sections info",
#     "**** sub headers sometime occur*** I'm okay with treating this as normal headers for now. I think sub headers may have something consistent about them.",
#     "*** Header *** info for this section"]

取决于你想做什么。

Splitting by "\r" produces a better output on my real file than "\n"

str.split(/\r?\n/).slice_before(/^\s*\*{3}/).to_a

使用 /\r?\n/，这是一个正则表达式，用于查找可选的回车符后跟一个换行符。 Windows 使用 "\r\n" 组合来标记一行的结束，而 Mac OS 和 *nix 仅使用 "\n"。通过这样做，您不会将代码绑定(bind)到仅限 Windows。

我不知道 slice_before 是否是为这个特殊用途而开发的，但我用它来撕开文本文件并将它们分解成段落，并将网络设备配置分成 block ，这使得两种情况下的解析都容易得多。

关于ruby - 使用正则表达式将 ruby 字符串分块，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/26317294/

ruby - 使用正则表达式将 ruby 字符串分块

上一篇：ruby-on-rails - 迁移文件的非法名称

下一篇：ruby - 创建允许 bang 和 non-bang 选项的方法