我有一个需要解析的 CSV 文件类型。以下正是我需要考虑的条件(缺少列标题、引号内换行符、缺少数据等):
ID,NAME,TITLE,DESCRIPTION,,
PRO1234,"JOHN SMITH",ENGINEER,"JOHN HAS BEEN WORKING
HARD ON BEING A GOOD
SERVENT."
PRO1235,"KEITH SMITH",ENGINEER,"keith has been working
hard on being a good
servent."
PRO1235,"KENNY SMITH",,"keith has been working
hard on being a good
servent."
PRO1235,"RICK SMITH",,,
您会注意到描述中有换行符以及新数据行的换行符。
我编写了这个正则表达式来查找引号之外的换行符,它效果很好 here
代码,使用 Node.js:
var fs = require('fs');
function parseCSV(filename){
var rx = new RegExp(/\n(?=([^"\\]*(\\.|"([^"\\]*\\.)*[^"\\]*"))*[^"]*$)/g);
var strFile = fs.readFileSync(filename).toString();
console.log("line feed count via match: " + strFile.match(rx).length);
var csv = strFile.split(rx);
console.log("csv length: " + csv.length);
console.log("csv items ###############################");
csv.forEach(function(e,i,a){
console.log("item e: " + e);
});
}
当我运行此命令时,您将看到换行计数(通过匹配找到的行返回)是正确的,即4。但是,当对 String.split() 使用相同的 ret 时,它会返回 17 并且生成的数组不稳定:
line feed count via match: 4
csv length: 17
csv items ###############################
item e: ID,NAME,TITLE,DESCRIPTION,,
item e:
PRO1235,"RICK SMITH"
item e: "RICK SMITH"
item e: undefined
item e: PRO1234,"JOHN SMITH",ENGINEER,"JOHN HAS BEEN WORKING
HARD ON BEING A GOOD
SERVENT."
item e:
PRO1235,"RICK SMITH"
item e: "RICK SMITH"
item e: undefined
item e: PRO1235,"KEITH SMITH",ENGINEER,"keith has been working
hard on being a good
servent."
item e:
PRO1235,"RICK SMITH"
item e: "RICK SMITH"
item e: undefined
item e: PRO1235,"KENNY SMITH",,"keith has been working
hard on being a good
servent."
item e: PRO1235,"RICK SMITH"
item e: "RICK SMITH"
item e: undefined
item e: PRO1235,"RICK SMITH",,,
我在 split 时做错了什么?我的想法是,如果我可以识别出与 match() 完美配合的 4 个换行符,那么相同的正则表达式应该提供“拆分”字符串的位置。
最佳答案
您的捕获组太多。 Split 将在分割字符串时返回捕获的组。 考虑以下简单示例:
var simpleString = "111aaa222bbb";
var regxNoCaptureGroup = /\d+/;
var regxWithCaptureGroup = /(\d+)/;
var regxWithNoncapturingGroup = /(?:\d+)/;
simpleString.split(regxNoCaptureGroup); //["", "aaa", "bbb"]
simpleString.split(regxWithNoncapturingGroup); //same as above
simpleString.split(regxWithCaptureGroup); //["", "111", "aaa", "222", "bbb"] - includes captured groups
捕获组内有捕获组。请记住,分割将找到该组,并删除它以找到分割部分,因此围绕数字分割(如第一个示例)将仅返回字母。 就您而言,它将删除捕获的所有内容。 对于捕获组,它将在结果中返回它们 - 因此,如果您计划将 split 与正则表达式一起使用,您可能应该构建一个良好的正则表达式,它仅捕获所需的内容。
关于javascript - RegExp 适用于 String.match,但不适用于 String.split,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26000415/