我有这种类型的文本:
Song of Solomon 1:1: The song of songs, which is Solomon’s.
John 3:16:For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
III John 1:8: We therefore ought to receive such, that we might be fellowhelpers to the truth.
我正在尝试删除这节经文(或元数据,如果你愿意)并只获取纯文本内容。示例文本显示了三种不同类型的诗句(多字、单字和罗马+字),我认为从每行开头检测到“number:number:”之前的任何内容会更容易 ,然后用“”(空字符串)替换它。
我测试了一个似乎有效的正则表达式(正如我所描述的):
- 首先查找“number:number:”,排除它[或: .+?(?=(\s+)(\d+)(:)(\d+)(:))],
- 然后添加“number:number:”模式 [或: (\s+)(\d+)(:)(\d+)(:)]
这会导致以下正则表达式:
.+?(?=(\s+)(\d+)(:)(\d+)(:))(\s+)(\d+)(:)(\d+)(:)
正则表达式似乎工作正常,你可以尝试一下 here ,问题是当我尝试将正则表达式与 sed 一起使用时,它不起作用:
$ sed 's/.+?(?=(\s+)(\d+)(:)(\d+)(:))(\s+)(\d+)(:)(\d+)(:)//g' testcase.txt
当它应该生成时,它将生成与输入相同的文本:
The song of songs, which is Solomon’s.
For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
We therefore ought to receive such, that we might be fellowhelpers to the truth.
请问有什么帮助吗?
非常感谢!
最佳答案
这个awk
应该做:
awk -F": *" '{print $3}' file
The song of songs, which is Solomon.s.
For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
We therefore ought to receive such, that we might be fellowhelpers to the truth.
为了使 number:number:
更安全,请使用以下命令:
awk -F"[0-9]+:[0-9]+: *" '{print $2}' file
The song of songs, which is Solomon.s.
For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
We therefore ought to receive such, that we might be fellowhelpers to the truth.
这也可以防止文本中出现 :
问题。
使用 Adams 正则表达式,我们可以缩短它一些。
awk -F"([0-9]+:){2} ?" '{print $2}' file
或
awk -F"([0-9]+:){2} ?" '{$0=$2}1' file
关于regex - 为什么这个工作正则表达式不能与 sed 一起工作?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28639112/