csv - AWK:为 Mechanical Turk 输出带有引号和删除段的 CSV

标签 csv awk mechanicalturk

我需要快速处理大量 Mechanical Turk HITS,因此我正在尝试编写 AWK 命令/脚本。

我需要它做三件事:

  1. 抓取并打印特定列

  2. 从某些字段中删除一些文本。

  3. 输出带引号的字段。

我想接受我的意见:

"1","2","Input.image_url","Answer.main"
"1","2","http://i.imgur.com/rGJA3YU.jpg","text"

然后出去:

"image_url","main"
"http://i.imgur.com/rGJA3YU.jpg","text"

到目前为止我已经

awk -F'","|^"|"$' '{sub("^\"","")} {print $3 ", "$4}' test.csv > output.csv

打印内容:

Input.image_url, Answer.main
http://i.imgur.com/rGJA3YU.jpg, text

我该如何改变这个?

感谢您的关注,我真的很感激。我是 AWK 的新手。

编辑:此代码片段适用于我提供的示例,但不幸的是有些问题。我想我可以简化输入/输出以使工作变得更容易,但似乎我跳过了一些东西。所以我将填写详细信息...

当我使用时:

awk 'BEGIN{FS=OFS=","} {gsub(/"[[:alpha:]]+\./,"\""); print $28, $31}' test.csv > output.csv

对于:

"HITId","HITTypeId","Title","Description","Keywords","Reward","CreationTime","MaxAssignments","RequesterAnnotation","AssignmentDurationInSeconds","AutoApprovalDelayInSeconds","Expiration","NumberOfSimilarHITs","LifetimeInSeconds","AssignmentId","WorkerId","AssignmentStatus","AcceptTime","SubmitTime","AutoApprovalTime","ApprovalTime","RejectionTime","RequesterFeedback","WorkTimeInSeconds","LifetimeApprovalRate","Last30DaysApprovalRate","Last7DaysApprovalRate","Input.image_url","Input.main_text","Answer.SEND","Answer.SUBJECT","Answer.main","Approve","Reject"
"373L46LKP7703E3YWZRRTZTZNUJJKX","3H9KHFULG43TZRE1KD4ITGVT4OWCEU","Transcribe the text contained in the image","Transcribe the text contained in the image","transcribe, image, text","$0.01","Mon Aug 25 20:47:26 GMT 2014","1","BatchId:1651513;","900","60","Mon Sep 01 20:47:26 GMT 2014","","","33IZTU6J812191JP8EKV0EN8HD7XS2","A1GOJEDZM2CQTN","Submitted","Mon Aug 25 20:48:15 GMT 2014","Mon Aug 25 20:48:26 GMT 2014","Mon Aug 25 13:49:26 PDT 2014","","","","11","100% (3/3)","100% (3/3)","0% (0/0)","http://i.imgur.com/rGJA3YU.jpg","hippy hay","","","text"

它打印:

"image_url","main"
"100% (3/3)",""

但我需要:

"image_url","main"
"http://i.imgur.com/rGJA3YU.jpg","text"

第一行效果很好,但由于某种原因它在第二行中返回另一列。

最佳答案

您没有在问题中说明如何确定要从字段中删除哪些文本,因此这可能是也可能不是您想要的:

$ awk 'BEGIN{FS=OFS=","} {gsub(/"[[:alpha:]]+\./,"\""); print $3, $4}' file
"image_url","main"
"http://i.imgur.com/rGJA3YU.jpg","text"

关于csv - AWK:为 Mechanical Turk 输出带有引号和删除段的 CSV,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25494489/

相关文章:

bash - 如何在 bash 中使用 awk 遍历一个字段的多个值?

command-line - MTurk 命令行工具错误 : Bad version number in . 类文件

ruby-on-rails - 使用 Ruby 在 Mechanical Turk 上创建新任务的简单方法?

python - 索引错误 : tuple index out of range in showing columns of CSV

python - 有没有一种快速方法将 Pandas 列数据框转换为字符串列表?

awk - 如何删除单词和特定标记之间的空格?

web-services - Mechanical Turk 文件上传

csv - 一次 awk 多个转换/分隔符

csv - pyspark csv位于数据帧的url,而不写入磁盘

bash - awk '{print $9}' 最后一个 ls -l 列,包括文件名中的任何空格