更新
为了澄清,我正在专门寻找一个正则表达式,它将:
按换行符分割...除非换行符位于双引号内。
如果换行符位于双引号内,则会:
- 忽略双引号内的换行符
- 结果中不包含外部双引号
- 将任何双双引号 (
""
) 转换为外部双引号内的单引号
我有一个如下所示的数据网格。
复制并粘贴后,结果文本如下:
Data
Data Data
Data Data Data
Data Data Data"
Data Data "Da
ta"
Data Data "Da
ta"""
Data Data Data""
Data Data """Da
ta"""
Data Data """Da
ta"""
生成的文本有点奇怪,因为单元格内的换行符会导致一些奇怪的行为:
- 单元格的内容用双引号括起来
- 该单元格内的任何现有双引号都会转换为双双引号 (
""
)。
我希望能够将该文本粘贴到文本区域中,然后在 HTML 中的表格中重新创建原始网格,即使存在提到的不稳定行为。
我找到并稍微修改了this代码,我认为很接近,但我认为正则表达式不太正确,所以我还在 this 中添加了正则表达式答案作为一个选项(我已将其注释掉,因为它会导致“内存不足”异常:
function splitOnNewlineExceptInDoubleQuotes(string) {
//The parenthesis in the regex creates a captured group within the quotes
var myRegexp = /[^\n"]+|"([^"]*)"/gim;
//var myRegexp = /(\n)(?=(?:[^\"]|\"[^\"]*\")*$)/m;
var myString = string
var myArray = [];
do {
//Each call to exec returns the next regex match as an array
var match = myRegexp.exec(myString);
if (match != null)
{
//Index 1 in the array is the captured group if it exists
//Index 0 is the matched text, which we use if no captured group exists
myArray.push(match[1] ? match[1] : match[0]);
}
} while (match != null);
return myArray
}
所以,我认为使用正则表达式(而不是成熟的状态机)可以做到这一点,但我不太确定如何做到这一点。
最佳答案
解析所有数据
下面是一个正则表达式,它将匹配源的每个组件,一一匹配到编号的捕获组中:
- 制表符分隔符
- 行尾/新行
- 引用数据
- 未加引号的数据
这将适用于单行数据或同时处理所有行。
还处理 CLRF (\r\n
) 和 RF (\n
) 行结尾。
表达式
/(?:(\t)|(\r?\n)|"((?:[^"]+|"")*)"|([^\t\r\n]+))/
可视化
用法示例
这里我们使用捕获的组来指示要做什么。
这会在控制台中输出行数组。
var str =
'Data ' + "\r\n" +
'Data Data ' + "\r\n" +
'Data Data Data' + "\r\n" +
'Data Data Data"' + "\r\n" +
'Data Data "Da' + "\r\n" +
'ta"' + "\r\n" +
'Data Data "Da' + "\r\n" +
'ta"""' + "\r\n" +
'Data Data Data""' + "\r\n" +
'Data Data """Da' + "\r\n" +
'ta"""' + "\r\n" +
'Data Data """Da' + "\r\n" +
'' + "\r\n" +
'ta"""';
var myregexp = /(?:(\t)|(\r?\n)|"((?:[^"]+|"")*)"|([^\t\r\n]+))/ig;
var match = myregexp.exec(str);
var emptyRow = [];
var row = emptyRow.slice();
var rows = [];
var prevTab = false;
while (match != null) {
if (match[4]) {
// Unquoted data
row.push(match[4]);
prevTab = false;
} else if (match[3]) {
// Quoted data (replace escaped double quotes with single)
row.push(match[3].replace(/""/g, "'"));
prevTab = false;
} else if (match[1]) {
// Tab seperator
if (prevTab) {
// Two tabs means empty data
row.push('');
}
prevTab = true;
} else if (match[2]) {
// End of the row
if (prevTab) {
// Previously had a tab, so include the empty data
row.push('');
}
prevTab = false;
rows.push(row);
// Here we are ensuring the new empty row doesn't reference the old one.
row = emptyRow.slice();
}
match = myregexp.exec(str);
}
// Handles missing new line at end of string
if (row.length) {
if (prevTab) {
// Previously had a tab, so include the empty data
row.push('');
}
rows.push(row);
}
console.log('rows', rows);
注释正则表达式
// (?:(\t)|(\r?\n)|"((?:[^"]+|"")*)"|([^\t\r\n]+))
//
// Options: Case insensitive; ^$ don’t match at line breaks
//
// Match the regular expression below «(?:(\t)|(\r?\n)|"((?:[^"]+|"")*)"|([^\t\r\n]+))»
// Match this alternative (attempting the next alternative only if this one fails) «(\t)»
// Match the regex below and capture its match into backreference number 1 «(\t)»
// Match the tab character «\t»
// Or match this alternative (attempting the next alternative only if this one fails) «(\r?\n)»
// Match the regex below and capture its match into backreference number 2 «(\r?\n)»
// Match the carriage return character «\r?»
// Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
// Match the line feed character «\n»
// Or match this alternative (attempting the next alternative only if this one fails) «"((?:[^"]+|"")*)"»
// Match the character “"” literally «"»
// Match the regex below and capture its match into backreference number 3 «((?:[^"]+|"")*)»
// Match the regular expression below «(?:[^"]+|"")*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// Match this alternative (attempting the next alternative only if this one fails) «[^"]+»
// Match any character that is NOT a “"” «[^"]+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// Or match this alternative (the entire group fails if this one fails to match) «""»
// Match the character string “""” literally «""»
// Match the character “"” literally «"»
// Or match this alternative (the entire group fails if this one fails to match) «([^\t\r\n]+)»
// Match the regex below and capture its match into backreference number 4 «([^\t\r\n]+)»
// Match any single character NOT present in the list below «[^\t\r\n]+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// The tab character «\t»
// The carriage return character «\r»
// The line feed character «\n»
关于javascript - 在新行上拆分,除非在双引号内,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57101907/