machine-learning - 加权词嵌入是什么意思？

标签 machine-learning nlp word2vec tf-idf word-embedding

在 paper我正在努力实现，它说，

In this work, tweets were modeled using three types of text representation. The first one is a bag-of-words model weighted by tf-idf (term frequency - inverse document frequency) (Section 2.1.1). The second represents a sentence by averaging the word embeddings of all words (in the sentence) and the third represents a sentence by averaging the weighted word embeddings of all words, the weight of a word is given by tf-idf (Section 2.1.2).

我不确定第三种表示形式，它被提到为使用 tf-idf 给出的单词权重的加权单词嵌入。我什至不确定它们是否可以一起使用。

最佳答案

词嵌入的平均(可能是加权)是有意义的，尽管根据主要算法和训练数据，这个句子表示可能不是最好的。直觉如下:

您可能想要处理不同长度的句子，因此需要求平均值(比简单求和更好)。
句子中的某些单词通常比其他单词更有值(value)。 TF-IDF 是最简单的词值度量。请注意，结果的比例不会改变。

另请参阅this paper by Kenter et al 。有一个nice post在不同的算法中对这两种方法进行了比较，并得出结论，没有一种方法明显优于另一种:一些算法倾向于简单平均，一些算法使用 TF-IDF 加权表现更好。

关于machine-learning - 加权词嵌入是什么意思？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47727078/

上一篇：python-3.x - 神经网络预测第 n 个方格

下一篇：machine-learning - 我认为是机器学习问题的最佳方法

相关文章：

gensim - 如何将单词和向量手动添加到 Word2vec gensim？

python - BiLSTM_Classifier 中的输入/输出/循环 dropout 层以及它们如何影响模型和预测

machine-learning - 当我使用文本文件输入时，syntaxnet demo.sh 挂起

nlp - 我们可以以分布式方式构建 word2vec 模型吗？

python - TensorFlow 中出现错误

machine-learning - OpenAI 健身房 : How do I access environment registration data (for e. g。 max_episode_steps) 来自自定义 OP 环境？

python - 类型错误 : '_IncompatibleKeys' object is not callable

algorithm - 非数字参数的“使用 k 最近邻分类”

python - 有没有办法使用 scikit 或任何其他 python 包仅获取单词的 IDF 值？

python - gensim most_similar with positive 和 negative，它是如何工作的？

©2024 IT工具网联系我们