apache-spark - Spark HashingTF 的工作原理

我是 Spark 2 的新手。
我试过 Spark tfidf 示例

sentenceData = spark.createDataFrame([
    (0.0, "Hi I heard about Spark")
], ["label", "sentence"])

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)


hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=32)
featurizedData = hashingTF.transform(wordsData)

for each in featurizedData.collect():
    print(each)

它输出

Row(label=0.0, sentence=u'Hi I heard about Spark', words=[u'hi', u'i', u'heard', u'about', u'spark'], rawFeatures=SparseVector(32, {1: 3.0, 13: 1.0, 24: 1.0}))

我预计在 rawFeatures我会得到像 {0:0.2, 1:0.2, 2:0.2, 3:0.2, 4:0.2} 这样的词频.因为词频是:

tf(w) = (Number of times the word appears in a document) / (Total number of words in the document)

在我们的例子中是:tf(w) = 1/5 = 0.2对于每个单词，因为每个单词在文档中出现一次。
如果我们想象输出 rawFeatures字典包含单词索引作为键，单词出现在文档中的次数作为值，为什么键 1等于 3.0 ?没有在文档中出现 3 次的单词。
这对我来说很困惑。我错过了什么？

最佳答案

TL; 博士; 这只是一个简单的哈希冲突。 HashingTF需要 hash(word) % numBuckets来确定存储桶，并且像这里这样的存储桶数量非常少，预计会发生碰撞。一般来说，您应该使用更多数量的桶，或者，如果碰撞是 Not Acceptable ，CountVectorizer .

详细。 HashingTF默认情况下使用 Murmur 哈希。 [u'hi', u'i', u'heard', u'about', u'spark']将被散列到 [-537608040, -1265344671, 266149357, 146891777, 2101843105] .如果您 follow the source你会看到实现相当于:

import org.apache.spark.unsafe.types.UTF8String
import org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes

Seq("hi", "i", "heard", "about", "spark")
  .map(UTF8String.fromString(_))
  .map(utf8 => 
    hashUnsafeBytes(utf8.getBaseObject, utf8.getBaseOffset, utf8.numBytes, 42))

Seq[Int] = List(-537608040, -1265344671, 266149357, 146891777, 2101843105)

当你拿non-negative modulo您将获得这些值 [24, 1, 13, 1, 1] :

List(-537608040, -1265344671, 266149357, 146891777, 2101843105)
  .map(nonNegativeMod(_, 32))

List[Int] = List(24, 1, 13, 1, 1)

列表中的三个单词(i、about 和 spark)散列到同一个存储桶中，每个单词出现一次，因此得到的结果。

有关的:

What hashing function does Spark use for HashingTF and how do I duplicate it?

How to get word details from TF Vector RDD in Spark ML Lib?

关于apache-spark - Spark HashingTF 的工作原理，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42283766/

apache-spark - Spark HashingTF 的工作原理

上一篇：ruby-on-rails - Rails 控制台 - 进程已完成，退出代码为 0

下一篇：ansible - 你如何改变 ansible_default_ipv4？