我正在尝试通过遵循下面提到的两个 DBpedia 属性来构建主题层次结构。
我的意图是给这个词标识它的主题。例如,给定这个词; “支持向量机”,我想从中识别主题,例如分类算法、机器学习等。
但是,有时我对如何构建主题层次结构感到有些困惑,因为我为主题获得了超过 5 个 URI,而为更广泛的属性获得了许多 URI。有没有办法测量强度或其他东西并减少我从 DBpedia 获得的额外 URI 并只分配最高可能的 URI?
那里似乎有两个问题。
我目前的代码如下。
from SPARQLWrapper import SPARQLWrapper, JSON
import requests
import urllib.parse
## initial consts
BASE_URL = 'http://api.dbpedia-spotlight.org/en/annotate?text={text}&confidence={confidence}&support={support}'
TEXT = 'First documented in the 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918), the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third Reich (1933–45). Berlin in the 1920s was the third largest municipality in the world. After World War II, the city became divided into East Berlin -- the capital of East Germany -- and West Berlin, a West German exclave surrounded by the Berlin Wall from 1961–89. Following German reunification in 1990, the city regained its status as the capital of Germany, hosting 147 foreign embassies.'
CONFIDENCE = '0.5'
SUPPORT = '120'
REQUEST = BASE_URL.format(
text=urllib.parse.quote_plus(TEXT),
confidence=CONFIDENCE,
support=SUPPORT
)
HEADERS = {'Accept': 'application/json'}
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
all_urls = []
r = requests.get(url=REQUEST, headers=HEADERS)
response = r.json()
resources = response['Resources']
for res in resources:
all_urls.append(res['@URI'])
for url in all_urls:
sparql.setQuery("""
SELECT * WHERE {<"""
+url+
""">skos:broader|dct:subject ?resource
}
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
for result in results["results"]["bindings"]:
print('resource ---- ', result['resource']['value'])
如果需要,我很乐意提供更多示例。
最佳答案
您似乎正在尝试检索与给定段落相关的维基百科类别。
小建议
首先,我建议您执行单个请求,将 DBpedia Spotlight 结果收集到 VALUES
中。 ,例如,以这种方式:
values = '(<{0}>)'.format('>) (<'.join(all_urls))
其次,如果您在谈论主题层次结构,则应该使用 SPARQL 1.1 property paths .
这两个建议有点不兼容。当查询同时包含多个起点(即
VALUES
)和任意长度的路径(即 *
和 +
运算符)时,Virtuoso 效率非常低。下面我使用的是
dct:subject/skos:broader
属性路径,即检索“下一级”类别。方法一
第一种方法是按资源的普遍受欢迎程度对资源进行排序,例如。 G。他们的 PageRank :
values = '(<{0}>)'.format('>) (<'.join(all_urls))
sparql.setQuery(
"""PREFIX vrank:<http://purl.org/voc/vrank#>
SELECT DISTINCT ?resource ?rank
FROM <http://dbpedia.org>
FROM <http://people.aifb.kit.edu/ath/#DBpedia_PageRank>
WHERE {
VALUES (?s) {""" + values +
""" }
?s dct:subject/skos:broader ?resource .
?resource vrank:hasRank/vrank:rankValue ?rank.
} ORDER BY DESC(?rank)
LIMIT 10
""")
结果是:
dbc:Member_states_of_the_United_Nations
dbc:Country_subdivisions_of_Europe
dbc:Republics
dbc:Demography
dbc:Population
dbc:Countries_in_Europe
dbc:Third-level_administrative_country_subdivisions
dbc:International_law
dbc:Former_countries_in_Europe
dbc:History_of_the_Soviet_Union_and_Soviet_Russia
方法二
第二种方法是计算给定文本的类别频率...
values = '(<{0}>)'.format('>) (<'.join(all_urls))
sparql.setQuery(
"""SELECT ?resource count(?resource) AS ?count WHERE {
VALUES (?s) {""" + values +
""" }
?s dct:subject ?resource
} GROUP BY ?resource
# https://github.com/openlink/virtuoso-opensource/issues/254
HAVING (count(?resource) > 1)
ORDER BY DESC(count(?resource))
LIMIT 10
""")
结果是:
dbc:Wars_by_country
dbc:Wars_involving_the_states_and_peoples_of_Europe
dbc:Wars_involving_the_states_and_peoples_of_Asia
dbc:Wars_involving_the_states_and_peoples_of_North_America
dbc:20th_century_in_Germany
dbc:Modern_history_of_Germany
dbc:Wars_involving_the_Balkans
dbc:Decades_in_Germany
dbc:Modern_Europe
dbc:Wars_involving_the_states_and_peoples_of_South_America
与
dct:subject
而不是 dct:subject/skos:broader
,结果更好:dbc:Former_polities_of_the_Cold_War
dbc:Former_republics
dbc:States_and_territories_established_in_1949
dbc:20th_century_in_Germany_by_period
dbc:1930s_in_Germany
dbc:Modern_history_of_Germany
dbc:1990_disestablishments_in_West_Germany
dbc:1933_disestablishments_in_Germany
dbc:1949_establishments_in_West_Germany
dbc:1949_establishments_in_Germany
结论
结果不是很好。我看到两个原因:DBpedia 类别非常随机,工具非常原始。结合方法一和方法二,或许可以得到更好的结果。反正需要大语料的实验。
关于nlp - 如何使用 DBpedia 属性构建主题层次结构?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49848040/