python - 如何使用nltk从字符串中提取名称

标签 python nlp nltk stanford-nlp

我正在尝试从非结构化字符串中提取 name(Indian)。

这是我的代码:

text = "Balaji Chandrasekaran Bangalore |  Senior Business Analyst/ Lead Business Analyst An accomplished Senior Business Analyst with a track record of handling complex projects in given period of time, exceeding above the expectation. Successful at developing product road maps and leading cross-functional software teams from prototype to release. Professional Competencies Systems Development Life Cycle (SDLC) Agile methodologies Business process improvement Requirements gathering & Analysis Project Management UML Specification UI & UX (Wireframe Designing) Functional Specification Test Scenario Creation SharePoint Admin Work History Senior Business Analyst (Aug 2012 Current) YouBox Technology pvt ltd, Chennai Translating business goals, feature concepts and customer needs into prioritized product requirements and use cases. Expertized in designing innovative wireframes combining user experience analysis and technology models. Extensive Experience in implementing soft wares for Shipping/Logistics firms to handle CRM, Finance, Logistics, Operations, Intermodal, and documentation. Strong interpersonal skills, highly adept at diplomatically facilitating discussions and negotiations with stakeholders. Education Bachelor of Engineering: Electronics & Communication, 2011 CES Tech Hosur Accomplishment Successful onsite implementation at various locations around the globe for Europe Shipping Company. - (Pre Study, General Design, and Functional Specification) Organized Business Analyst Forum and conducted various activities to develop skill sets of Business Analysts."
if text != "":
    grammar = """PERSON: {<NNP>}"""
    chunkParser = nltk.RegexpParser(grammar)
    tagged = nltk.pos_tag(nltk.word_tokenize(text))
    tree = chunkParser.parse(tagged)

    for subtree in tree.subtrees():
        if subtree.label() == "PERSON": 
            pronouns.append(' '.join([c[0] for c in subtree]))

    print(pronouns)

['Balaji', 'Chandrasekaran', 'Bangalore', '|','Senior', 'Business', 'Analys', '/', 'Lead', 'Business', 'Analyst', 'Senior', 'Business', 'Analyst', 'Successful', 'Development', 'Life', 'Cycle', 'SDLC', 'Agile', 'Business', 'Requirements', 'Analysis', 'Project', 'Management', 'UML', 'Specification', 'UI', 'UX', 'Wireframe', 'Designing', 'Functional', 'Specification', 'Test', 'Scenario', 'Creation', 'SharePoint', 'Admin', 'Work', 'History', 'Senior', 'Business', 'Analyst', 'Aug', 'Current', 'Technology', 'Chennai', 'Translating', 'CRM', 'Finance', 'Logistics', 'Operations', 'Intermodal', 'Education', 'Bachelor', 'Engineering', 'Electronics', 'Communication', 'Accomplishment', 'Successful', 'Mediterranean', 'Ship', 'Company', 'MSC', 'Georgia', 'MSC', 'Cambodia', 'MSC', 'MSC', 'South', 'Successful', 'Stake', 'MSC', 'Geneva', 'Switzerland', 'Pre', 'Study', 'General', 'Design', 'Functional', 'Specification', 'O', 'Business', 'Analyst', 'Forum', 'Business']

但实际上我只需要获取 Balaji Chandrasekaran ,我什至尝试使用 Standford ner lib。它无法选择 Balaji Chandrasekaran

任何人都可以帮助从 un strcuture 字符串中提取名称,或者向我推荐任何好的教程来做到这一点。

提前谢谢你。

最佳答案

就像我在评论中所说的那样,您必须为印度名字创建自己的语料库并根据它测试您的文本。 NLTK Book 在 Chapter 2 中教您如何执行此操作(确切地说是第 1.9 节)。

from nltk.corpus import PlaintextCorpusReader

# You can use a regular expression to find the files, or pass a list of files
files = ".*\.txt"

new_corpus = PlaintextCorpusReader("/path/", files)
corpus  = nltk.Text(new_corpus.words())

另请参阅:Creating a new corpus with NLTK

关于python - 如何使用nltk从字符串中提取名称,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47123094/

相关文章:

python - 如何在一行中合并 python 中的值和范围或列表?

nlp - 如何加载手套的词向量模型二进制文件(cooccurence.bin)?

python - 二叉解析树表示的字符串列表

python - 从维基百科 API 中提取表数据

python - Python 中的 Blob 检测?

python - 覆盖 nltk 中的函数 - ContextIndex 类中的错误

nlp - BERT 模型 : "enable_padding() got an unexpected keyword argument ' max_length'"

python - WordNet - n 和数字代表什么?

nlp - 是否有一种语义相似度方法在语义准确性方面优于 word2vec 方法?

python - 主题建模一致性得分 :