python - 用于文本分类算法的 word2Vec 向量表示

我正在尝试在文本分类算法中使用word2vec。我想使用 word2vec 创建矢量化器，我使用了下面的脚本。但我无法为每个文档获取一行，而是为每个文档获取不同维度的矩阵。例如，第一个文档矩阵为 31X100，第二个文档矩阵为 163X100，第三个文档矩阵为 73X100，依此类推。实际上我需要每个文档的尺寸为 1X100 ，以便我可以将它们用作训练模型的输入特征

任何人都可以帮助我吗？

import os
import pandas as pd       
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords # Import the stop word list
import gensim
import numpy as np

train = pd.read_csv("Data.csv",encoding='cp1252')
wordnet_lemmatizer = WordNetLemmatizer()

def Description_to_words(raw_Description):
    Description_text = BeautifulSoup(raw_Description).get_text() 
    letters_only = re.sub("[^a-zA-Z]", " ", Description_text)
    words = word_tokenize(letters_only.lower())    
    stops = set(stopwords.words("english")) 
    meaningful_words = [w for w in words if not w in stops]
    return( " ".join(wordnet_lemmatizer.lemmatize(w) for w in meaningful_words))

num_Descriptions = train["Summary"].size
clean_train_Descriptions = []
print("Cleaning and parsing the training set ticket Descriptions...\n")
clean_train_Descriptions = []
for i in range( 0, num_Descriptions ):
    if( (i+1)%1000 == 0 ):
        print("Description %d of %d\n" % ( i+1, num_Descriptions ))
    clean_train_Descriptions.append(Description_to_words( train["Summary"][i] ))

model = gensim.models.Word2Vec(clean_train_Descriptions, size=100)
w2v = dict(zip(model.wv.index2word, model.wv.syn0))

class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        # if a text is empty we should return a vector of zeros
        # with the same dimensionality as all the other vectors
        #self.dim = len(word2vec.itervalues().next())
        self.dim = 100

    def fit(self, X, y):
        return self

    def transform(self, X):
        return np.array([
            np.mean([self.word2vec[w] for w in words if w in self.word2vec]
                    or [np.zeros(self.dim)], axis=0)
            for words in X
        ])

a=MeanEmbeddingVectorizer(w2v)
clean_train_Descriptions[1]
a.transform(clean_train_Descriptions[1])

train_Descriptions = []
for i in range( 0, num_Descriptions ):
    if( (i+1)%1000 == 0 ):
        print("Description %d of %d\n" % ( i+1, num_Descriptions ))
    train_Descriptions.append(a.transform(" ".join(clean_train_Descriptions[i])))

最佳答案

您的代码中有 2 个问题导致出现问题，这两个问题都很容易解决。

首先，Word2Vec 要求句子实际上是单词列表，而不是作为单个字符串的实际句子。所以从你的Description_to_words只需返回列表，不加入。

return [wordnet_lemmatizer.lemmatize(w) for w in meaningful_words]

由于 word2vec 会迭代每个句子来获取单词，因此之前它会迭代字符串，而您实际上是从 wv 获得字符级嵌入。 .

其次，您调用转换的方式也存在类似问题 - X预计是一个文档列表，而不是单个文档。所以当你在做for words in X时，您实际上是在创建一个字符列表，然后对其进行迭代以创建嵌入。所以你的输出实际上是句子中每个字符的单个字符嵌入。简单更改，一次转换所有文档即可!

train_Descriptions = a.transform(clean_train_Descriptions)

(要一次执行一项操作，请包含在列表中 ( [clean_train_Descriptions[1]] )，或使用范围选择器 ( clean_train_Descriptions[1:2] ) 选择 1。

通过这两项更改，每个输入句子应该返回 1 行。

关于python - 用于文本分类算法的 word2Vec 向量表示，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47936578/

python - 用于文本分类算法的 word2Vec 向量表示

上一篇：python - ast.literal_eval - 遍历列表中的字符串元素

下一篇：python - 来自数据集的Python模式识别