我有一个按部分构建的文本文件,我想将其分解为一个数组,每个部分都包含字符串元素。每个部分的内容将根据部分进行不同的操作。我目前正在使用 irb,很可能会将其分解为一个单独的 ruby 脚本文件。
我从输入文件(分别为“sample”和“sample_file”)创建了一个字符串对象和文件对象来测试不同的方法。我确定文件读取循环在这里冷工作,但我相信我只需要一个简单的匹配。
文件看起来像这样:
*** Section Header ***
randomly formatted content
multiple lines
*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)
This sections info
**** sub headers sometime occur***
I'm okay with treating this as normal headers for now.
I think sub headers may have something consistent about them.
*** Header ***
info for this section
示例输出:
[*** Section Header ***\r\n\r\n randomly formatted content\r multiple lines, **** Another Header\r this sections info,*** sub header and its info, ...etc.]
这是[部分的字符串,部分的字符串,部分的字符串] 由于不一致的打开和关闭条件或我的需求的多行性质导致的并发症,我的大多数尝试都失败了。
这是我最接近的尝试,要么创建不需要的元素(例如包含一个标题的结束星号和另一个标题的开头的字符串),要么只获取一个标题。
这匹配标题:
sample.scan(/\*{3}.*/)
这匹配标题和部分,但从结束和开始星号创建元素,我不完全理解前瞻和背后的断言,但我认为根据我对解决方案的搜索,解决方案看起来像这样。
sample.scan(/(?<=\*{3}).*?(?=\*{3})/m)
我现在正在寻找以空格和/或星号开头的行,但目前还没有!
sample.scan(/^(\s+\*+|\*+).*/)
非常感谢任何方向。
最佳答案
Ruby 的 Enumerable 包括 slice_before
这对这类任务非常有用:
str = "*** Section Header ***
randomly formatted content
multiple lines
*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)
This sections info
**** sub headers sometime occur***
I'm okay with treating this as normal headers for now.
I think sub headers may have something consistent about them.
*** Header ***
info for this section
"
str.split("\n").slice_before(/^\s*\*{3}/).to_a
# => [["*** Section Header ***",
# "",
# "randomly formatted content",
# "multiple lines",
# ""],
# [" *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)",
# "",
# "This sections info"],
# [" **** sub headers sometime occur***",
# " I'm okay with treating this as normal headers for now.",
# " I think sub headers may have something consistent about them.",
# "",
# ""],
# ["*** Header ***", " info for this section"]]
使用 slice_before
允许我使用一个非常简单的模式来定位一个标志/目标,该标志/目标指示子数组中断发生的位置。使用 /^\s*\*{3}/
查找以可能的空格字符串开头后跟三个 '*'
的行。一旦找到,一个新的子数组就开始了。
如果您希望每个子数组实际上是单个字符串而不是 block 中的行数组,map(&:join)
是您的 friend :
str.split("\n").slice_before(/^\s*\*{3}/).map(&:join)
# => ["*** Section Header *** randomly formatted content multiple lines",
# " *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set) This sections info",
# " **** sub headers sometime occur*** I'm okay with treating this as normal headers for now. I think sub headers may have something consistent about them.",
# " *** Header *** info for this section "]
而且,如果你想去掉前导和尾随的空格,你可以结合使用 strip
和 map
:
str.split("\n").slice_before(/^\s*\*{3}/).map{ |sa| sa.join.strip }
# => ["*** Section Header *** randomly formatted content multiple lines",
# "*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set) This sections info",
# "**** sub headers sometime occur*** I'm okay with treating this as normal headers for now. I think sub headers may have something consistent about them.",
# "*** Header *** info for this section"]
或:
str.split("\n").slice_before(/^\s*\*{3}/).map{ |sa| sa.map(&:strip).join(' ') }
# => ["*** Section Header *** randomly formatted content multiple lines ",
# "*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set) This sections info",
# "**** sub headers sometime occur*** I'm okay with treating this as normal headers for now. I think sub headers may have something consistent about them. ",
# "*** Header *** info for this section "]
或:
str.split("\n").slice_before(/^\s*\*{3}/).map{ |sa| sa.join.strip.squeeze(' ') }
# => ["*** Section Header *** randomly formatted content multiple lines",
# "*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set) This sections info",
# "**** sub headers sometime occur*** I'm okay with treating this as normal headers for now. I think sub headers may have something consistent about them.",
# "*** Header *** info for this section"]
取决于你想做什么。
Splitting by "\r" produces a better output on my real file than "\n"
str.split(/\r?\n/).slice_before(/^\s*\*{3}/).to_a
使用 /\r?\n/
,这是一个正则表达式,用于查找可选的回车符后跟一个换行符。 Windows 使用 "\r\n"
组合来标记一行的结束,而 Mac OS 和 *nix 仅使用 "\n"
。通过这样做,您不会将代码绑定(bind)到仅限 Windows。
我不知道 slice_before
是否是为这个特殊用途而开发的,但我用它来撕开文本文件并将它们分解成段落,并将网络设备配置分成 block ,这使得两种情况下的解析都容易得多。
关于ruby - 使用正则表达式将 ruby 字符串分块,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26317294/