java - 在 Clojure/Java 中检测 Unicode 文本连字

连字是由多个代码点表示的 Unicode 字符。例如，在梵文中 त्र 是一个由代码点 त + ् + र 组成的连字。

当在记事本等简单的文本文件编辑器中查看时，त्र 显示为 त् + र 并存储为三个 Unicode 字符。但是，当在 Firefox 中打开同一文件时，它显示为正确的连字。

所以我的问题是，如何在从我的代码中读取文件时以编程方式检测此类连字。既然 Firefox 做到了，那么肯定存在一种以编程方式完成它的方法。是否有包含此信息的任何 Unicode 属性，或者我是否需要映射到所有此类连字？

SVG CSS 属性 text-rendering当设置为 optimizeLegibility 时，会做同样的事情(将代码点组合成正确的连字)。

PS:我用的是Java。

编辑

我的代码的目的是计算 Unicode 文本中的字符数，假设连字是单个字符。所以我需要一种方法将多个代码点折叠成一个连字。

最佳答案

Computer Typesetting维基百科页面说 -

The Computer Modern Roman typeface provided with TeX includes the five common ligatures ff, fi, fl, ffi, and ffl. When TeX finds these combinations in a text it substitutes the appropriate ligature, unless overridden by the typesetter.

这表明是编辑器在进行替换。此外，

Unicode maintains that ligaturing is a presentation issue rather than a character definition issue, and that, for example, "if a modern font is asked to display 'h' followed by 'r', and the font has an 'hr' ligature in it, it can display the ligature."

据我所知(我对这个话题有些兴趣，现在刚读了几篇文章)，连字替换的说明嵌入在字体中。现在，我深入研究并为您找到了这些； GSUB - The Glyph Substitution Table和 Ligature Substitution Subtable来自 OpenType 文件格式规范。

接下来，您需要找到一些可以让您深入了解 OpenType 字体文件的库，即用于快速访问的文件解析器。阅读以下两个讨论可能会给您一些关于如何进行这些替换的指导:

Chromium 漏洞 http://code.google.com/p/chromium/issues/detail?id=22240
Firefox 漏洞 https://bugs.launchpad.net/firefox/+bug/37828

关于java - 在 Clojure/Java 中检测 Unicode 文本连字，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/3466565/

java - 在 Clojure/Java 中检测 Unicode 文本连字

上一篇：java - 在 Eclipse 中为所有 JUnit 测试设置环境变量

下一篇：java - 从崩溃中恢复的 Eclipse