unicode - 使用 LPeg 匹配 Unicode 标点符号

我正在尝试创建一个 LPeg 模式，该模式将匹配 UTF-8 编码输入中的任何 Unicode 标点符号。我想出了以下 Selene Unicode 和 LPeg 的结合:

local unicode     = require("unicode")
local lpeg        = require("lpeg")
local punctuation = lpeg.Cmt(lpeg.Cs(any * any^-3), function(s,i,a)
  local match = unicode.utf8.match(a, "^%p")
  if match == nil
    return false
  else
    return i+#match
  end
end)

这似乎有效，但它会错过几个 Unicode 代码点组合的标点符号(如果存在此类字符)，因为我只读取前面 4 个字节，它可能会扼杀解析器的性能，并且不确定是什么图书馆match当我向它提供一个包含矮小的 UTF-8 字符的字符串时，函数就可以了(尽管它现在似乎可以工作)。

我想知道这是否是一种正确的方法，或者是否有更好的方法来实现我想要实现的目标。

最佳答案

匹配 UTF-8 字符的正确方法显示在 the LPeg homepage 中的示例中。 . UTF-8 字符的第一个字节决定了它的一部分还有多少字节:

local cont = lpeg.R("\128\191") -- continuation byte

local utf8 = lpeg.R("\0\127")
           + lpeg.R("\194\223") * cont
           + lpeg.R("\224\239") * cont * cont
           + lpeg.R("\240\244") * cont * cont * cont

以此为基础 utf8我们可以使用的模式 lpeg.Cmt和 Selene Unicode match功能有点像你提出的:

local punctuation = lpeg.Cmt(lpeg.C(utf8), function (s, i, c)
    if unicode.utf8.match(c, "%p") then
        return i
    end
end)

请注意，我们返回 i ，这是按照什么Cmt期望:

The given function gets as arguments the entire subject, the current position (after the match of patt), plus any capture values produced by patt. The first value returned by function defines how the match happens. If the call returns a number, the match succeeds and the returned number becomes the new current position.

这意味着我们应该返回函数接收到的相同数字，即紧接在 UTF-8 字符之后的位置。

关于unicode - 使用 LPeg 匹配 Unicode 标点符号，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39006753/

unicode - 使用 LPeg 匹配 Unicode 标点符号

上一篇：entity-framework - EF 中的导航属性和关联有什么区别？

下一篇：clojure - Clojure(或 JCE，或 JVM，或...？)会自动引入并行性吗？