regex - Golang正则表达式: Getting index position of variable

标签 regex go

我有一个包含变量 (?P<next_tok>) 的正则表达式如何获取该变量匹配的索引?

这是完整的正则表达式: \S*[\.\?!](?P<after_tok>(?:[?!)";}\]\*:@\'\({\[])|\s+(?P<next_tok>\S+))

示例: http://play.golang.org/p/7CYfK50W2Q

我想获取正则表达式匹配中任何变量的匹配项和索引。这在 golang 中可能吗?

编辑: 我不知道如何获取 next_tok按名称,但我能够通过 FindAllStringSubmatchIndex 获得所有子匹配

http://play.golang.org/p/SEaCLVKisr

最佳答案

您可以使用.FindAllStringSubmatchIndex :

package main

import (
    "fmt"
    "regexp"
    "unicode/utf8"
)

func main() {
    text := "Here... are some initials E.R.B. and also an etc. in the middle.\nPeriods that form part of an abbreviation but are taken to be end-of-sentence markers\nor vice versa do not only introduce errors in the determination of sentence boundaries.\nSegmentation errors propagate into further components which rely on accurate\nsentence segmentation and subsequent analyses are most likely affected negatively.\nWalker et al. (2001), for example, stress the importance of correct sentence boundary\ndisambiguation for machine translation and Kiss and Strunk (2002b) show that errors\nin sentence boundary detection lead to a higher error rate in part-of-speech tagging.\nIn this paper, we present an approach to sentence boundary detection that builds\non language-independent methods and determines sentence boundaries with high accuracy.\nIt does not make use of additional annotations, part-of-speech tagging, or precompiled\nlists to support sentence boundary detection but extracts all necessary data\nfrom the corpus to be segmented. Also, it does not use orthographic information as primary\nevidence and is thus suited to process single-case text. It focuses on robustness\nand flexibility in that it can be applied with good results to a variety of languages without\nany further adjustments. At the same time, the modular structure of the proposed\nsystem makes it possible in principle to integrate language-specific methods and clues\nto further improve its accuracy. The basic algorithm has been determined experimentally\non the basis of an unannotated development corpus of English. We have applied\nthe resulting system to further corpora of English text as well as to corpora from ten\nother languages: Brazilian Portuguese, Dutch, Estonian, French, German, Italian, Norwegian,\nSpanish, Swedish, and Turkish. Without further additions or amendments to\nthe system produced through experimentation on the development corpus, the mean\naccuracy of sentence boundary detection on newspaper corpora in eleven languages is\n98.74 %."

    var periodContextFmt string = `\S*[\.\?!](?P<after_tok>(?:[?!)";}\]\*:@\'\({\[])|\s+(?P<next_tok>\S+))`
    sent := regexp.MustCompile(periodContextFmt)
    matches := sent.FindAllStringSubmatchIndex(text, -1)

    for _, match := range matches {
        fmt.Println("context: ", text[utf8.RuneCountInString(text[:match[0]]):utf8.RuneCountInString(text[:match[1]])])
        fmt.Println("next_tok: ", text[utf8.RuneCountInString(text[:match[4]]):utf8.RuneCountInString(text[:match[5]])])
        fmt.Println("start: ", utf8.RuneCountInString(text[:match[2]]))
        fmt.Println("end: ", utf8.RuneCountInString(text[:match[4]]))
        fmt.Println("------")
    }
}

请参阅Go demo .

请注意,unicode/utf8 导入和 utf8.RuneCountInString 对于获取 Unicode 字符串中的 Unicode 字符索引是必需的,否则,您将获得字节偏移量。请参阅Identify the correct hashtag indexes in tweet messages .

关于regex - Golang正则表达式: Getting index position of variable,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31976346/

相关文章:

Javascript for 循环未正确相加

regex - Vim 替换 : Delete All between specified words

go - 缩短||或者如果Golang中的语句

http - Golang 与 http 重用的意外行为

go - 将表单值附加到 Go 中的 GET/POST 请求

javascript - V8 会自动缓存编译的正则表达式吗?

Ruby - 使用扫描拆分多个字符串

pointers - 通过引用设置结构中的字段

go - 尝试运行 docker-compose up -d 时出现错误

python - 关于正则表达式的查询