python - 从数据中提取特定信息

标签 python python-3.x nltk stanford-nlp information-retrieval

如何转换数据格式，例如:

James Smith was born on November 17, 1948

变成类似的东西

("James Smith", DOB, "November 17, 1948")

无需依赖字符串的位置索引

我尝试过以下方法

from nltk import word_tokenize, pos_tag

new = "James Smith was born on November 17, 1948"
sentences = word_tokenize(new)
sentences = pos_tag(sentences)
grammar = "Chunk: {<NNP*><NNP*>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentences)
print(result)

如何进一步获取所需格式的输出。

最佳答案

您始终可以使用正则表达式。正则表达式 (\S+)\s(\S+)\s\bwasborn on\b\s(\S+)\s(\S+),\s(\S+) 将匹配并返回特别是上述字符串格式的数据。

这是实际操作:https://regex101.com/r/W2ykKS/1

Python 中的正则表达式:

import re

regex = r"(\S+)\s(\S+)\s\bwas born on\b\s(\S+)\s(\S+),\s(\S+)"
test_str = "James Smith was born on November 17, 1948"

matches = re.search(regex, test_str)

# group 0 in a regex is the input string

print(matches.group(1)) # James
print(matches.group(2)) # Smith
print(matches.group(3)) # November
print(matches.group(4)) # 17
print(matches.group(5)) # 1948

关于python - 从数据中提取特定信息，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39928277/

上一篇：python - 从大型语料库创建 DTM

下一篇：Python 解析器 ply 匹配错误的正则表达式

python - 在 Python 中使用嵌套列表

python-3.x - 从 Pandas 构建一个方法词典

python - 计数向量化器() : StreamBackedCorpusView' object has no attribute 'lower'

python - 使用 nltk 进行自定义标记

python - 如何在 python-docx 中应用粗体和居中？

python - 将字典的 numpy ndarray 转换为 DataFrame

python - 读取压缩的 JSON 文件

python-2.7 - NLTK程序包估计(字母组合)的困惑

python - 通过 pandas read_html 获取 HTML 表将不起作用