实际上我也无法让它以英语运行。我正在寻找一个组合两个正则表达式的表达式:
const string = 'Joe shouted, \"The pitchforks are coming!\"';
zeroWidthSplitter = /(?=e)/g; //splits before every "e"
string.split(zeroWidthSplitter) ==
["Jo", "e shout", "ed, \"Th", "e pitchforks ar", "e coming!"] //true
wordRegex = /[a-zA-Z]+/g; // matches all English letters,
// discarding spaces and punctuation
string.match(wordRegex) ==
["Joe", "shouted", "The", "pitchforks", "are", "coming"]
我想要的是一个zeroWidthWordDelimiter,这样它的行为就像用单词边界分割一样,保持空格和标点符号与单词分开:
string.split(/(?:\b)/gm);
//the string is split strictly with words and non-words
0: "Joe"
1: " "
2: "shouted"
3: ", \""
4: "The"
5: " "
6: "pitchforks"
7: " "
8: "are"
9: " "
10: "coming"
11: "!"
但我希望分割一串外语(孟加拉语)字符,而这些字符无法被单词边界识别。
我可以成功地将单词分组,并通过将所有孟加拉语字母放入 [ 中来成功分组间隙。字符类]+
Wiktor 的建议是一个很大的改进,将 Angular 色分成非捕获组 (?:a|b|c|d)。这成功地对单词进行了分组,但丢失了标点符号。
Peter 的更灵活的正则表达式 /[^\p{Script=Bengali}]+/u
也是如此,它使用 Unicode Property Escapes
BengaliString = 'হঠাৎ একটা মেয়ে বাকি দু’জন কে কানে কানে বললো,“আমি যেমন টা করবো তোরা সেরকম আমার সাথে থাকবি ।”' ;
const BengaliRegex = /[ড়ঢ়ঁংঃঅআইঈউঊঋঌএঐওঔকখগঘঙচছজঝঞটঠডঢণতথদধনপফববভমমযরলশষসহািীুূৃৄেৈোৌ্ৎড়ঢ়য়]+/gm ; //groups words
const BengaliGapsRegex = /[^ড়ঢ়ঁংঃঅআইঈউঊঋঌএঐওঔকখগঘঙচছজঝঞটঠডঢণতথদধনপফববভমমযরলশষসহািীুূৃৄেৈোৌ্ৎড়ঢ়য়]+/gm ; //groups gaps
const BengaliDelimiter = /(?=[^ড়ঢ়ঁংঃঅআইঈউঊঋঌএঐওঔকখগঘঙচছজঝঞটঠডঢণতথদধনপফববভমমযরলশষসহািীুূৃৄেৈোৌ্ৎড়ঢ়য়]+)/gm ;
//zero width but breaks apart many words
const BengaliRegexWiktor = /(?:ড়|ঢ|়|ঁ|ং|ঃ|অ|আ|ই|ঈ|উ|ঊ|ঋ|ঌ|এ|ঐ|ও|ঔ|ক|খ|গ|ঘ|ঙ|চ|ছ|জ|ঝ|ঞ|ট|ঠ|ড|ঢ|ণ|ত|থ|দ|ধ|ন|প|ফ|ব|ব|ভ|ম|ম|য|র|ল|শ|ষ|স|হ|া|ি|ী|ু|ূ|ৃ|ৄ|ে|ৈ|ো|ৌ|্|ৎ|ড়|ঢ়|য়)+/mg
//groups words perfectly
const BengaliSplitterWiktor = /(?=(?:ড়|ঢ|়|ঁ|ং|ঃ|অ|আ|ই|ঈ|উ|ঊ|ঋ|ঌ|এ|ঐ|ও|ঔ|ক|খ|গ|ঘ|ঙ|চ|ছ|জ|ঝ|ঞ|ট|ঠ|ড|ঢ|ণ|ত|থ|দ|ধ|ন|প|ফ|ব|ব|ভ|ম|ম|য|র|ল|শ|ষ|স|হ|া|ি|ী|ু|ূ|ৃ|ৄ|ে|ৈ|ো|ৌ|্|ৎ|ড়|ঢ়|য়)+)/gm ;
//doesn't group multiple letters using +
const BengaliRegexPeter = /[^\p{Script=Bengali}]+/u ;
//beautiful! but doesn't keep punctuation and spacing
console.log("Bengali Gaps Regex: " + BengaliString.split(BengaliGapsRegex));
console.log("Regex Delimiter: " + BengaliString.split(BengaliDelimiter));
console.log("Bengali Regex Wiktor: " + BengaliString.match(BengaliRegexWiktor));
console.log("Bengali Splitter Wiktor: "+ BengaliString.split(BengaliSplitterWiktor));
console.log("Regex Peter: " + BengaliString.split(BengaliRegexPeter));
最佳答案
使用此正则表达式模式来检测孟加拉字母:
/[\u0985-\u0994\u0995-\u09a7-\u09a8-\u09ce\u0981\u0982\u0983\u09e6-\u09ef-]/g
关于javascript - 通过正则表达式分割(孟加拉语)字符串而不使用分隔符,零宽度前向查找不起作用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68800391/