python - 如何获得发送者开始和结束的索引？

我刚开始使用 spacy。我有一个场景，我必须在句子中获取句子开始和结束的索引。如果我使用文档。发送然后我得到一个发送列表。 sent.beg 和 sent.end 打印 token 索引，但我想要字符索引。

for sent in doc.sents:
    print(sent.start,sent.end)     #prints token index

例子:

completeText = "Hi, I am using StackOverflow. The community is great."
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(nlp.create_pipe('sentencizer'))
doc = nlp(completeText)
for sent in doc.sents:
    print(sent.start,sent.end)  #prints 0,7 and 7,12 the token indices

上面的打印语句只打印标记索引，不打印字符索引。我想要的输出是 0,29 和 30, 54。

我试过按如下方式获取句子的长度。我在最后添加了一个 if 语句，因为句号后的空格在句子中被忽略了。

start = [0] * len(list(doc.sents))
end = [0] * len(list(doc.sents))
for index, i in enumerate(doc.sents):

    if index !=0:
        start[index] = end[index-1] + 1

    length += len(str(i))

    if index == 0:
         end[index] = length
    else:
        end[index] = length 
    if end[index] + 1 < len(sent) and sent[end[index]+1] == " ":        
        length += 1

当句号后只有空格时，这很有效。但是在我的全文中(超过 10,000 行)我没有得到正确的答案。 spacy 是否会忽略上面提到的包含在发送中的任何其他字符？

有更好的方法吗？

最佳答案

您可以只使用 start_char 和 end_char。

for sent in doc.sents:
    print(sent.start_char,sent.end_char)

一个句子是 spaCy 中的一个 Span，它带有许多有用的属性，这些属性在 docs 中有介绍。 .

关于python - 如何获得发送者开始和结束的索引？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/64202958/

上一篇：github-api - 是否可以通过 GitHub API 找出问题是否已通过拉取请求关闭

下一篇：angular - 如何将 [formControlName] 与 mat-checkbox 一起使用

相关文章：

java - 什么是 NLP 中经过训练的模型？

python - 有哪些替代 WordNet 查找反义词的方法？

nlp - 如何在没有文档上下文的情况下取消标记 spacy 文本？

python-3.x - 使用 spacy 和 textacy。需要在原始推文的语料库中找到 tf-idf 分数，但无法导入 textacy vectorizer

python - BeautifulSoup 刮 table 与 table 休息

python - 在没有 paramiko 的情况下通过 python 运行 ssh 时为 "Pseudo-terminal will not be allocated because stdin is not a terminal"

python - 在 headless 健身房 jupyter Python 2.7 中获取 "AttributeError: ' ImageData' 对象没有属性 'data'"

machine-learning - 如何使用 Word2Vec 获取单词列表的向量？

python-3.x - 使用 spacy 分词器拆分句子

python - 如何用变量保存 nltk Text.similar()