regex - 多行正则表达式

标签 regex bash csv sed

我现在坚持这个几个小时,并循环使用大量不同的工具来完成工作。没有成功。如果有人能帮我解决这个问题,那就太好了。

问题是:

我有一个非常大的 CSV 文件 (400mb+),但格式不正确。现在它看起来像这样:

This is a long abstract describing something. What follows is the tile for this sentence."   
,Title1  
This is another sentence that is running on one line. On the next line you can find the title.   
,Title2

As you can probably see the titles ",Title1" and ",Title2" should actually be on the same line as the foregoing sentence. Then it would look something like this:

This is a long abstract describing something. What follows is the tile for this sentence.",Title1  
This is another sentence that is running on one line. On the next line you can find the title.,Title2

Please note that the end of the sentence can contain quotes or not. In the end they should be replaced too.

Here is what I came up with so far:

sed -n '1h;1!H;${;g;s/\."?.*,//g;p;}' out.csv > out1.csv

这实际上应该完成匹配多行表达式的工作。不幸的是它没有:)

表达式正在寻找句子末尾的点和可选的引号以及我试图与 .* 匹配的换行符。

非常感谢帮助。使用什么工具完成工作并不重要(awk、perl、sed、tr 等)。

最佳答案

sed 中的多行本身并不一定很棘手,只是它使用了大多数人不熟悉的命令并且有一定的副作用,比如将当前行与下一行分隔开'\n' 当您使用 'N' 将下一行追加到模式空间时。

无论如何,如果你匹配以逗号开头的行来决定是否删除换行符会容易得多,所以这就是我在这里所做的:

sed 'N;/\n,/s/"\? *\n//;P;D' title_csv

输入

$ cat title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
also, don't touch this line

输出

$ sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence.,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.,Title2
also, don't touch this line

关于regex - 多行正则表达式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4510813/

相关文章:

bash - 在变量中存储 bash 字符串比较

python - 如何从具有词频的CSV文件生成词云

mysql - 如何格式化 CSV 文件以导入多个 MySQL 表

bash - 从包含冒号分隔符的文件创建 bash 关联数组

scala - 当第一行是模式时,如何从 Spark 中的 csv(使用 scala)创建数据框?

regex - Google 表格 REGEXREPLACE 如果在中间则保留重复字符串之一,但如果在开头或结尾则删除它们

javascript - 匹配字符串中的任何/所有多个单词

php - REGEX:获取字符串中不在 block 引号内的所有单词

java - 用于在 android 中读取 SRT 文件的正则表达式

json - 在 jq 中作为变量赋值