javascript - 在新行上拆分,除非在双引号内

标签 javascript regex

更新

为了澄清,我正在专门寻找一个正则表达式,它将:

按换行符分割...除非换行符位于双引号内。

如果换行符位于双引号内,则会:

  1. 忽略双引号内的换行符
  2. 结果中不包含外部双引号
  3. 将任何双双引号 ("") 转换为外部双引号内的单引号

我有一个如下所示的数据网格。

enter image description here

复制并粘贴后,结果文本如下:

Data        
Data    Data    
Data    Data    Data
Data    Data    Data"
Data    Data    "Da
ta"
Data    Data    "Da
ta"""
Data    Data    Data""
Data    Data    """Da
ta"""
Data    Data    """Da

ta"""

生成的文本有点奇怪,因为单元格内的换行符会导致一些奇怪的行为:

  1. 单元格的内容用双引号括起来
  2. 该单元格内的任何现有双引号都会转换为双双引号 ("")。

我希望能够将该文本粘贴到文本区域中,然后在 HTML 中的表格中重新创建原始网格,即使存在提到的不稳定行为。

我找到并稍微修改了this代码,我认为很接近,但我认为正则表达式不太正确,所以我还在 this 中添加了正则表达式答案作为一个选项(我已将其注释掉,因为它会导致“内存不足”异常:

function splitOnNewlineExceptInDoubleQuotes(string) {
    //The parenthesis in the regex creates a captured group within the quotes
    var myRegexp = /[^\n"]+|"([^"]*)"/gim;
    //var myRegexp = /(\n)(?=(?:[^\"]|\"[^\"]*\")*$)/m;
    var myString = string
    var myArray = [];

    do {
        //Each call to exec returns the next regex match as an array
        var match = myRegexp.exec(myString);
        if (match != null)
        {
            //Index 1 in the array is the captured group if it exists
            //Index 0 is the matched text, which we use if no captured group exists
            myArray.push(match[1] ? match[1] : match[0]);
        }
    } while (match != null);

    return myArray
}

所以,我认为使用正则表达式(而不是成熟的状态机)可以做到这一点,但我不太确定如何做到这一点。

最佳答案

解析所有数据

下面是一个正则表达式,它将匹配源的每个组件,一一匹配到编号的捕获组中:

  1. 制表符分隔符
  2. 行尾/新行
  3. 引用数据
  4. 未加引号的数据

这将适用于单行数据或同时处理所有行。 还处理 CLRF (\r\n) 和 RF (\n) 行结尾。

表达式

/(?:(\t)|(\r?\n)|"((?:[^"]+|"")*)"|([^\t\r\n]+))/

可视化

Visualisation

用法示例

这里我们使用捕获的组来指示要做什么。

这会在控制台中输出行数组。

var str =
  'Data		' + "\r\n" +
  'Data	Data	' + "\r\n" +
  'Data	Data	Data' + "\r\n" +
  'Data	Data	Data"' + "\r\n" +
  'Data	Data	"Da' + "\r\n" +
  'ta"' + "\r\n" +
  'Data	Data	"Da' + "\r\n" +
  'ta"""' + "\r\n" +
  'Data	Data	Data""' + "\r\n" +
  'Data	Data	"""Da' + "\r\n" +
  'ta"""' + "\r\n" +
  'Data	Data	"""Da' + "\r\n" +
  '' + "\r\n" +
  'ta"""';



var myregexp = /(?:(\t)|(\r?\n)|"((?:[^"]+|"")*)"|([^\t\r\n]+))/ig;
var match = myregexp.exec(str);
var emptyRow = [];
var row = emptyRow.slice();
var rows = [];
var prevTab = false;
while (match != null) {
  if (match[4]) {
    // Unquoted data
    row.push(match[4]);
    prevTab = false;
  } else if (match[3]) {
    // Quoted data (replace escaped double quotes with single)
    row.push(match[3].replace(/""/g, "'"));
    prevTab = false;
  } else if (match[1]) {
    // Tab seperator
    if (prevTab) {
      // Two tabs means empty data
      row.push('');
    }
    prevTab = true;
  } else if (match[2]) {
    // End of the row
    if (prevTab) {
      // Previously had a tab, so include the empty data
      row.push('');
    }
    prevTab = false;
    rows.push(row);
    
    // Here we are ensuring the new empty row doesn't reference the old one.
    row = emptyRow.slice();
  }
  match = myregexp.exec(str);
}

// Handles missing new line at end of string
if (row.length) {
  if (prevTab) {
    // Previously had a tab, so include the empty data
    row.push('');
  }
  rows.push(row);
}

console.log('rows', rows);

注释正则表达式

// (?:(\t)|(\r?\n)|"((?:[^"]+|"")*)"|([^\t\r\n]+))
// 
// Options: Case insensitive; ^$ don’t match at line breaks
// 
// Match the regular expression below «(?:(\t)|(\r?\n)|"((?:[^"]+|"")*)"|([^\t\r\n]+))»
//    Match this alternative (attempting the next alternative only if this one fails) «(\t)»
//       Match the regex below and capture its match into backreference number 1 «(\t)»
//          Match the tab character «\t»
//    Or match this alternative (attempting the next alternative only if this one fails) «(\r?\n)»
//       Match the regex below and capture its match into backreference number 2 «(\r?\n)»
//          Match the carriage return character «\r?»
//             Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
//          Match the line feed character «\n»
//    Or match this alternative (attempting the next alternative only if this one fails) «"((?:[^"]+|"")*)"»
//       Match the character “"” literally «"»
//       Match the regex below and capture its match into backreference number 3 «((?:[^"]+|"")*)»
//          Match the regular expression below «(?:[^"]+|"")*»
//             Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//             Match this alternative (attempting the next alternative only if this one fails) «[^"]+»
//                Match any character that is NOT a “"” «[^"]+»
//                   Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
//             Or match this alternative (the entire group fails if this one fails to match) «""»
//                Match the character string “""” literally «""»
//       Match the character “"” literally «"»
//    Or match this alternative (the entire group fails if this one fails to match) «([^\t\r\n]+)»
//       Match the regex below and capture its match into backreference number 4 «([^\t\r\n]+)»
//          Match any single character NOT present in the list below «[^\t\r\n]+»
//             Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
//             The tab character «\t»
//             The carriage return character «\r»
//             The line feed character «\n»

关于javascript - 在新行上拆分,除非在双引号内,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57101907/

相关文章:

python - 正则表达式查找搜索词并将结果放入另一个数据文件中?

Python正则表达式替换函数来计算文字字符

javascript - 如何触发 chrome.tabs onClick 事件

javascript - 为什么我记录的值是正确的,但当我返回它时它是 'undefined' ?

javascript - 使用变量从项目数组中检索 json 数据

需要 Php 路由 preg_match 帮助

java - 如何用多个分隔符拆分字符串 - 并知道哪个分隔符匹配

c# regex 解析 ical 格式的文件并用结果填充对象

javascript - Google map 中自定义标记的编号

javascript - SceneJS 导入模型不适用于 IE