regex - Golang正则表达式: Getting index position of variable

我有一个包含变量 (?P<next_tok>) 的正则表达式如何获取该变量匹配的索引？

这是完整的正则表达式: \S*[\.\?!](?P<after_tok>(?:[?!)";}\]\*:@\'\({\[])|\s+(?P<next_tok>\S+))

我想获取正则表达式匹配中任何变量的匹配项和索引。这在 golang 中可能吗？

编辑: 我不知道如何获取 next_tok按名称，但我能够通过 FindAllStringSubmatchIndex 获得所有子匹配

http://play.golang.org/p/SEaCLVKisr

最佳答案

您可以使用.FindAllStringSubmatchIndex :

package main

import (
    "fmt"
    "regexp"
    "unicode/utf8"
)

func main() {
    text := "Here... are some initials E.R.B. and also an etc. in the middle.\nPeriods that form part of an abbreviation but are taken to be end-of-sentence markers\nor vice versa do not only introduce errors in the determination of sentence boundaries.\nSegmentation errors propagate into further components which rely on accurate\nsentence segmentation and subsequent analyses are most likely affected negatively.\nWalker et al. (2001), for example, stress the importance of correct sentence boundary\ndisambiguation for machine translation and Kiss and Strunk (2002b) show that errors\nin sentence boundary detection lead to a higher error rate in part-of-speech tagging.\nIn this paper, we present an approach to sentence boundary detection that builds\non language-independent methods and determines sentence boundaries with high accuracy.\nIt does not make use of additional annotations, part-of-speech tagging, or precompiled\nlists to support sentence boundary detection but extracts all necessary data\nfrom the corpus to be segmented. Also, it does not use orthographic information as primary\nevidence and is thus suited to process single-case text. It focuses on robustness\nand flexibility in that it can be applied with good results to a variety of languages without\nany further adjustments. At the same time, the modular structure of the proposed\nsystem makes it possible in principle to integrate language-specific methods and clues\nto further improve its accuracy. The basic algorithm has been determined experimentally\non the basis of an unannotated development corpus of English. We have applied\nthe resulting system to further corpora of English text as well as to corpora from ten\nother languages: Brazilian Portuguese, Dutch, Estonian, French, German, Italian, Norwegian,\nSpanish, Swedish, and Turkish. Without further additions or amendments to\nthe system produced through experimentation on the development corpus, the mean\naccuracy of sentence boundary detection on newspaper corpora in eleven languages is\n98.74 %."

    var periodContextFmt string = `\S*[\.\?!](?P<after_tok>(?:[?!)";}\]\*:@\'\({\[])|\s+(?P<next_tok>\S+))`
    sent := regexp.MustCompile(periodContextFmt)
    matches := sent.FindAllStringSubmatchIndex(text, -1)

    for _, match := range matches {
        fmt.Println("context: ", text[utf8.RuneCountInString(text[:match[0]]):utf8.RuneCountInString(text[:match[1]])])
        fmt.Println("next_tok: ", text[utf8.RuneCountInString(text[:match[4]]):utf8.RuneCountInString(text[:match[5]])])
        fmt.Println("start: ", utf8.RuneCountInString(text[:match[2]]))
        fmt.Println("end: ", utf8.RuneCountInString(text[:match[4]]))
        fmt.Println("------")
    }
}

请参阅Go demo .

请注意，unicode/utf8 导入和 utf8.RuneCountInString 对于获取 Unicode 字符串中的 Unicode 字符索引是必需的，否则，您将获得字节偏移量。请参阅Identify the correct hashtag indexes in tweet messages .

关于regex - Golang正则表达式: Getting index position of variable，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31976346/

regex - Golang正则表达式: Getting index position of variable

上一篇：microsoft-edge - 如何在 Microsoft Edge 浏览器中右键单击 -> 保存？

下一篇：html - 定位没有 class 或 id 的 div