nlp - 如何使用 DBpedia 属性构建主题层次结构?

标签 nlp semantic-web dbpedia topic-modeling spotlight-dbpedia

我正在尝试通过遵循下面提到的两个 DBpedia 属性来构建主题层次结构。

  • skos:更广泛的属性(property)
  • dcterms:主题属性

  • 我的意图是给这个词标识它的主题。例如,给定这个词; “支持向量机”,我想从中识别主题,例如分类算法、机器学习等。

    但是,有时我对如何构建主题层次结构感到有些困惑,因为我为主题获得了超过 5 个 URI,而为更广泛的属性获得了许多 URI。有没有办法测量强度或其他东西并减少我从 DBpedia 获得的额外 URI 并只分配最高可能的 URI?

    那里似乎有两个问题。
  • 如何限制 DBpedia Spotlight 结果的数量。
  • 如何限制特定结果的主题和类别数量。

  • 我目前的代码如下。

    from SPARQLWrapper import SPARQLWrapper, JSON
    import requests
    import urllib.parse
    
    ## initial consts
    BASE_URL = 'http://api.dbpedia-spotlight.org/en/annotate?text={text}&confidence={confidence}&support={support}'
    TEXT = 'First documented in the 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918), the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third Reich (1933–45). Berlin in the 1920s was the third largest municipality in the world. After World War II, the city became divided into East Berlin -- the capital of East Germany -- and West Berlin, a West German exclave surrounded by the Berlin Wall from 1961–89. Following German reunification in 1990, the city regained its status as the capital of Germany, hosting 147 foreign embassies.'
    CONFIDENCE = '0.5'
    SUPPORT = '120'
    REQUEST = BASE_URL.format(
        text=urllib.parse.quote_plus(TEXT), 
        confidence=CONFIDENCE, 
        support=SUPPORT
    )
    HEADERS = {'Accept': 'application/json'}
    sparql = SPARQLWrapper("http://dbpedia.org/sparql")
    all_urls = []
    
    r = requests.get(url=REQUEST, headers=HEADERS)
    response = r.json()
    resources = response['Resources']
    
    for res in resources:
        all_urls.append(res['@URI'])
    
    for url in all_urls:
        sparql.setQuery("""
            SELECT * WHERE {<"""
                 +url+
                """>skos:broader|dct:subject ?resource 
                }
        """)
    
        sparql.setReturnFormat(JSON)
        results = sparql.query().convert()
    
        for result in results["results"]["bindings"]:
            print('resource ---- ', result['resource']['value'])
    

    如果需要,我很乐意提供更多示例。

    最佳答案

    您似乎正在尝试检索与给定段落相关的维基百科类别。

    小建议

    首先,我建议您执行单个请求,将 DBpedia Spotlight 结果收集到 VALUES 中。 ,例如,以这种方式:

    values = '(<{0}>)'.format('>) (<'.join(all_urls))
    

    其次,如果您在谈论主题层次结构,则应该使用 SPARQL 1.1 property paths .

    这两个建议有点不兼容。当查询同时包含多个起点(即 VALUES )和任意长度的路径(即 *+ 运算符)时,Virtuoso 效率非常低。

    下面我使用的是 dct:subject/skos:broader属性路径,即检索“下一级”类别。

    方法一

    第一种方法是按资源的普遍受欢迎程度对资源进行排序,例如。 G。他们的 PageRank :

    values = '(<{0}>)'.format('>) (<'.join(all_urls))
    
    sparql.setQuery(
        """PREFIX vrank:<http://purl.org/voc/vrank#>
           SELECT DISTINCT ?resource ?rank
           FROM <http://dbpedia.org> 
           FROM <http://people.aifb.kit.edu/ath/#DBpedia_PageRank>
           WHERE {
               VALUES (?s) {""" + values + 
        """    }
           ?s dct:subject/skos:broader ?resource .
           ?resource vrank:hasRank/vrank:rankValue ?rank.
           } ORDER BY DESC(?rank)
             LIMIT 10
        """)
    

    结果是:
    dbc:Member_states_of_the_United_Nations
    dbc:Country_subdivisions_of_Europe
    dbc:Republics
    dbc:Demography
    dbc:Population
    dbc:Countries_in_Europe
    dbc:Third-level_administrative_country_subdivisions
    dbc:International_law
    dbc:Former_countries_in_Europe
    dbc:History_of_the_Soviet_Union_and_Soviet_Russia
    

    方法二

    第二种方法是计算给定文本的类别频率...

    values = '(<{0}>)'.format('>) (<'.join(all_urls))
    
    sparql.setQuery(
        """SELECT ?resource count(?resource) AS ?count WHERE {
               VALUES (?s) {""" + values + 
        """    }
           ?s dct:subject ?resource
           } GROUP BY ?resource
             # https://github.com/openlink/virtuoso-opensource/issues/254
             HAVING (count(?resource) > 1)
             ORDER BY DESC(count(?resource))
             LIMIT 10
        """)
    

    结果是:
    dbc:Wars_by_country
    dbc:Wars_involving_the_states_and_peoples_of_Europe
    dbc:Wars_involving_the_states_and_peoples_of_Asia
    dbc:Wars_involving_the_states_and_peoples_of_North_America
    dbc:20th_century_in_Germany
    dbc:Modern_history_of_Germany
    dbc:Wars_involving_the_Balkans
    dbc:Decades_in_Germany
    dbc:Modern_Europe
    dbc:Wars_involving_the_states_and_peoples_of_South_America
    

    dct:subject而不是 dct:subject/skos:broader ,结果更好:
    dbc:Former_polities_of_the_Cold_War
    dbc:Former_republics
    dbc:States_and_territories_established_in_1949
    dbc:20th_century_in_Germany_by_period
    dbc:1930s_in_Germany
    dbc:Modern_history_of_Germany
    dbc:1990_disestablishments_in_West_Germany
    dbc:1933_disestablishments_in_Germany
    dbc:1949_establishments_in_West_Germany
    dbc:1949_establishments_in_Germany
    

    结论

    结果不是很好。我看到两个原因:DBpedia 类别非常随机,工具非常原始。结合方法一和方法二,或许可以得到更好的结果。反正需要大语料的实验。

    关于nlp - 如何使用 DBpedia 属性构建主题层次结构?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49848040/

    相关文章:

    sparql - 使用SPARQL和Jena查询DBpedia

    java - 从 XML Schema 数据类型转换为 Java int

    python - 加载 word2vec 模块时出现“utf-8”解码错误

    python-3.x - 在必要的预处理后,如何使用 nltk 文本分析库预测特定文本或文本组

    machine-learning - 如何将树编码为神经网络的输入?

    rdf - 在SPARQL模式中选择多个值作为对象

    php - 了解 SPARQL 是什么

    machine-learning - 如何调整最大熵的参数?

    java - 从文件名生成唯一的 IRI

    sparql - 使用 DBPedia 和 SPARQL 获取人员的国籍