我现在坚持这个几个小时,并循环使用大量不同的工具来完成工作。没有成功。如果有人能帮我解决这个问题,那就太好了。
问题是:
我有一个非常大的 CSV 文件 (400mb+),但格式不正确。现在它看起来像这样:
This is a long abstract describing something. What follows is the tile for this sentence." ,Title1 This is another sentence that is running on one line. On the next line you can find the title. ,Title2
As you can probably see the titles ",Title1" and ",Title2" should actually be on the same line as the foregoing sentence. Then it would look something like this:
This is a long abstract describing something. What follows is the tile for this sentence.",Title1 This is another sentence that is running on one line. On the next line you can find the title.,Title2
Please note that the end of the sentence can contain quotes or not. In the end they should be replaced too.
Here is what I came up with so far:
sed -n '1h;1!H;${;g;s/\."?.*,//g;p;}' out.csv > out1.csv
这实际上应该完成匹配多行表达式的工作。不幸的是它没有:)
表达式正在寻找句子末尾的点和可选的引号以及我试图与 .* 匹配的换行符。
非常感谢帮助。使用什么工具完成工作并不重要(awk、perl、sed、tr 等)。
最佳答案
sed
中的多行本身并不一定很棘手,只是它使用了大多数人不熟悉的命令并且有一定的副作用,比如将当前行与下一行分隔开'\n' 当您使用 'N' 将下一行追加到模式空间时。
无论如何,如果你匹配以逗号开头的行来决定是否删除换行符会容易得多,所以这就是我在这里所做的:
sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
输入
$ cat title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
also, don't touch this line
输出
$ sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence.,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.,Title2
also, don't touch this line
关于regex - 多行正则表达式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4510813/