我被困在计算 TF_IDF在我的雷克斯特图形数据库中。这是我得到的:
假设我有一个图,由一组代表术语 T 的顶点和一组代表文档 D 的顶点组成。
T 中的术语和 D 中的文档之间存在边 E。每条边都有一个术语频率 tf。
例如。 (伪代码):
#x, y, and z are arbitrary IDs.
T(x) - E(y) -> D(z)
E(y).tf = 20
T(x).outE()
=> A set of edges.
T(x).outE().inV()
=> A list of Documents, a subset of D
当我尝试执行以下操作时,如何编写计算 TF_IDF 的 germlin 脚本?
- A:给定一个术语t,计算与t直接相关的每个文档的TF_IDF。
- B:给定一组术语 Ts,计算
Ts.outE().inV()
中每个文档相对于每个适用术语的 TF_IDF 总和在Ts中。
到目前为止我所拥有的:
#I know this does not work
term = g.v(404)
term.outE().inV().as('docs').path().
groupBy{it.last()}{
it.findAll{it instanceof Edge}.
collect{it.getProperty('frequency')} #I would actually like to use augmented frequency (aka frequency_of_t_in_document / max_frequency_of_any_t_in_document)
}.collect{d,tf-> [d,
tf * ??log(??g.V.has('isDocument') / docs.count() ?? ) ??
]}
#I feel I am close, but I can't quite make this work.
最佳答案
我可能还没有讲到这一部分
B: ...in relation to each applicable term in Ts.
...但其余部分应该按预期工作。我编写了一个小辅助函数,它接受单个术语以及多个术语:
tfidf = { g, terms, N ->
def closure = {
def paths = it.outE("occursIn").inV().path().toList()
def numPaths = paths.size()
[it.getProperty("term"), paths.collectEntries({
def title = it[2].getProperty("title")
def tf = it[1].getProperty("frequency")
def idf = Math.log10(N / numPaths)
[title, tf * idf]
})]
}
def single = terms instanceof String
def pipe = single ? g.V("term", terms) : g.V().has("term", T.in, terms)
def result = pipe.collect(closure).collectEntries()
single ? result[terms] : result
}
然后我拿维基百科的例子来测试一下:
g = new TinkerGraph()
g.createKeyIndex("type", Vertex.class)
g.createKeyIndex("term", Vertex.class)
t1 = g.addVertex(["type":"term","term":"this"])
t2 = g.addVertex(["type":"term","term":"is"])
t3 = g.addVertex(["type":"term","term":"a"])
t4 = g.addVertex(["type":"term","term":"sample"])
t5 = g.addVertex(["type":"term","term":"another"])
t6 = g.addVertex(["type":"term","term":"example"])
d1 = g.addVertex(["type":"document","title":"Document 1"])
d2 = g.addVertex(["type":"document","title":"Document 2"])
t1.addEdge("occursIn", d1, ["frequency":1])
t1.addEdge("occursIn", d2, ["frequency":1])
t2.addEdge("occursIn", d1, ["frequency":1])
t2.addEdge("occursIn", d2, ["frequency":1])
t3.addEdge("occursIn", d1, ["frequency":2])
t4.addEdge("occursIn", d1, ["frequency":1])
t5.addEdge("occursIn", d2, ["frequency":2])
t6.addEdge("occursIn", d2, ["frequency":3])
N = g.V("type","document").count()
tfidf(g, "this", N)
tfidf(g, "example", N)
tfidf(g, ["this", "example"], N)
输出:
gremlin> tfidf(g, "this", N)
==>Document 1=0.0
==>Document 2=0.0
gremlin> tfidf(g, "example", N)
==>Document 2=0.9030899869919435
gremlin> tfidf(g, ["this", "example"], N)
==>this={Document 1=0.0, Document 2=0.0}
==>example={Document 2=0.9030899869919435}
我希望这已经有所帮助。
干杯, 丹尼尔
关于graph-databases - gremlin中的TF-IDF算法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23524529/