我需要快速处理大量 Mechanical Turk HITS,因此我正在尝试编写 AWK 命令/脚本。
我需要它做三件事:
抓取并打印特定列
从某些字段中删除一些文本。
输出带引号的字段。
我想接受我的意见:
"1","2","Input.image_url","Answer.main"
"1","2","http://i.imgur.com/rGJA3YU.jpg","text"
然后出去:
"image_url","main"
"http://i.imgur.com/rGJA3YU.jpg","text"
到目前为止我已经
awk -F'","|^"|"$' '{sub("^\"","")} {print $3 ", "$4}' test.csv > output.csv
打印内容:
Input.image_url, Answer.main
http://i.imgur.com/rGJA3YU.jpg, text
我该如何改变这个?
感谢您的关注,我真的很感激。我是 AWK 的新手。
编辑:此代码片段适用于我提供的示例,但不幸的是有些问题。我想我可以简化输入/输出以使工作变得更容易,但似乎我跳过了一些东西。所以我将填写详细信息...
当我使用时:
awk 'BEGIN{FS=OFS=","} {gsub(/"[[:alpha:]]+\./,"\""); print $28, $31}' test.csv > output.csv
对于:
"HITId","HITTypeId","Title","Description","Keywords","Reward","CreationTime","MaxAssignments","RequesterAnnotation","AssignmentDurationInSeconds","AutoApprovalDelayInSeconds","Expiration","NumberOfSimilarHITs","LifetimeInSeconds","AssignmentId","WorkerId","AssignmentStatus","AcceptTime","SubmitTime","AutoApprovalTime","ApprovalTime","RejectionTime","RequesterFeedback","WorkTimeInSeconds","LifetimeApprovalRate","Last30DaysApprovalRate","Last7DaysApprovalRate","Input.image_url","Input.main_text","Answer.SEND","Answer.SUBJECT","Answer.main","Approve","Reject"
"373L46LKP7703E3YWZRRTZTZNUJJKX","3H9KHFULG43TZRE1KD4ITGVT4OWCEU","Transcribe the text contained in the image","Transcribe the text contained in the image","transcribe, image, text","$0.01","Mon Aug 25 20:47:26 GMT 2014","1","BatchId:1651513;","900","60","Mon Sep 01 20:47:26 GMT 2014","","","33IZTU6J812191JP8EKV0EN8HD7XS2","A1GOJEDZM2CQTN","Submitted","Mon Aug 25 20:48:15 GMT 2014","Mon Aug 25 20:48:26 GMT 2014","Mon Aug 25 13:49:26 PDT 2014","","","","11","100% (3/3)","100% (3/3)","0% (0/0)","http://i.imgur.com/rGJA3YU.jpg","hippy hay","","","text"
它打印:
"image_url","main"
"100% (3/3)",""
但我需要:
"image_url","main"
"http://i.imgur.com/rGJA3YU.jpg","text"
第一行效果很好,但由于某种原因它在第二行中返回另一列。
最佳答案
您没有在问题中说明如何确定要从字段中删除哪些文本,因此这可能是也可能不是您想要的:
$ awk 'BEGIN{FS=OFS=","} {gsub(/"[[:alpha:]]+\./,"\""); print $3, $4}' file
"image_url","main"
"http://i.imgur.com/rGJA3YU.jpg","text"
关于csv - AWK:为 Mechanical Turk 输出带有引号和删除段的 CSV,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25494489/