python - 如何在预测中返回多个项目?

标签 python machine-learning logistic-regression

我有一个文本分类算法,正在运行。 但是,它只返回预测中的 1 项。 我想知道如何或是否有可能返回每个新条目 3 项。

火车:

def train(texts, marcas):
    print("Training.....")
    tvect = TfidfTransformer()  # -> Obsolete <- TfidfVectorizer(min_df=1, max_df=2, lowercase=False, preprocessor=None)
    # gera a vetorização (posição das palavras) para treino e predição
    count_vect = CountVectorizer(stop_words=None)
    vect_count_text = count_vect.fit_transform(texts)  # [vectorize_text(texto, tradutor) for texto in texts]
    vetoresDeTexto = tvect.fit_transform(vect_count_text)
    # Define o conjunto de dados X
    X = vetoresDeTexto  # np.array(vetoresDeTexto)
    # Define o conjunto de dados Y (labels)
    Y = np.array(marcas.tolist())
    # Define porcentagem do treino
    porcentagem_de_treino = 0.8
    # Separa o tamanho do treino a partir da porcentagem
    tamanho_do_treino = int(porcentagem_de_treino * len(Y))
    # O restante fica para a validacao
    tamanho_de_validacao = (len(Y) - tamanho_do_treino)
    # Separa os dados de treino
    treino_dados = X[0:tamanho_do_treino]
    # Separa as marcacoes de treino
    treino_marcacoes = Y[0:tamanho_do_treino]
    # Separa os dados de validacao
    validacao_dados = X[tamanho_do_treino:]
    # Separa as marcacoes de validacao
    validacao_marcacoes = Y[tamanho_do_treino:]
    print("Validacao Marcacoes: ")
    print(validacao_marcacoes)
    clf = LogisticRegression(class_weight=None)  # MultinomialNB() obtive 62% de acerto#GaussianNB()
    clf.fit(treino_dados, treino_marcacoes)
    # accuracy
    accuracy = clf.score(validacao_dados, validacao_marcacoes)
    file_name = 'train_data.pkl'
    cPickle.dump(clf, open(file_name, 'wb'))
    # fit_file = joblib.dump(clf, file_name)
    print("Accuracy: ")
    print("%.2f " % round(accuracy * 100) + "%\n")
    print("End of train...")
    predict(file_name, tvect, count_vect, treino_marcacoes, clf, treino_dados)
    # To get a fit_file
    # return fit_file

我的预测代码:

# to predict
def predict(fit, tvect, count_vect, y_test, clf, treinoX):
    print("\nPredict......")
    # new text to predict
    newTextToPredict = ["Just a new text to predict"]  # returns label J44

    new = count_vect.transform(newTextToPredict)
    # carrega o modelo treinado
    loaded_model = cPickle.load(open(fit, 'rb'))
    # faz a predição do novo texto de entrada
    result = loaded_model.predict(new)
    probs = clf.predict_proba(new)
    # precision_score(result, treinoY, average='samples')
    print(result)

当前输出:

....
End of train...

Predict......
['Z000']

此输出是否有可能向我显示 3 个最可能的结果?

编辑:

我尝试使用predict_proba,但我不明白结果,它遵循输出:

print(probs)

Predict......
['Z000']
[[0.00472141 0.00468681 0.00545111 0.00473597 0.00742972 0.00459905
  0.00472848 0.00471651 0.00830986 0.00472729 0.00537823 0.00539556
  0.00463566 0.00469166 0.00473597 0.00469889 0.00473122 0.00510944
  0.00475248 0.00475248 0.00472681 0.00465737 0.0046238  0.00538928
  0.0053852  0.00469701 0.00470745 0.0052977  0.00468655 0.00472517
  0.00601271 0.00540062 0.00471387 0.00471311 0.00471592 0.00468392
  0.00470526 0.00454069 0.00467939 0.00471795 0.00706113 0.00475248
  0.00470356 0.00451991 0.00473597 0.02389303 0.00472151 0.00475248
  0.00573423 0.00469125 0.00471707 0.00450935 0.00458729 0.00607249
  0.00556578 0.00661622 0.00747174 0.00528275 0.00469896 0.00527276
  0.00537725 0.0046918  0.00472592 0.00523041 0.00466061 0.00523704
  0.00535152 0.00471286 0.00456425 0.00473597 0.00466597 0.00475248
  0.00471198 0.00470039 0.00545111 0.00473597 0.0059082  0.00471645
  0.0050765  0.00536772 0.00469146 0.0047054  0.00583113 0.00556937
  0.00530836 0.00724415 0.00499861 0.00469217 0.00471454 0.00456743
  0.00473241 0.00468181 0.00545604 0.00471984 0.00466745 0.00606397
  0.01230014 0.00467241 0.00472609 0.00541621 0.00473499 0.00468064
  0.00472712 0.00470356 0.00497979 0.00453495 0.00469214 0.00668041
  0.00528025 0.00468329 0.00777699 0.00468618 0.00537916 0.00455798
  0.0046802  0.00468039 0.00534045 0.00466915 0.00521349 0.00465117
  0.00466947 0.00688886 0.00460614 0.00648024 0.00469368 0.00456555
  0.2215044  0.01841092 0.00594679 0.00467938 0.01121442 0.00537937
  0.00468134 0.00472712 0.00470844 0.00470639 0.00580538 0.00535144
  0.00473597 0.00465237 0.00577107 0.00539569 0.00472306 0.00538426
  0.00472506]]

最佳答案

尝试打印probs:

probs = clf.predict_proba(new)
print(probs)

您将获得一系列概率。这些概率的总和等于1。 然后,您可以从此数组中选择 3 个最高元素的索引。

top3_classes = np.argsort(probs)[:3]

这些是您需要的前 3 个类别标签索引。 所以你可以这样做:

print(your_classes[top3_classes])

您获得了预测的前 3 个类别。

关于python - 如何在预测中返回多个项目?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53158830/

相关文章:

hadoop - 在Mahout中使用Logistic回归

python - 比 If Else 更聪明

python - 如何将一维扁平化 MNIST Keras 转换为 LSTM 模型而不需要取消扁平化?

python - 从 'train model from_config' 导入 'deeppavlov.core.commands.train' 时出错

machine-learning - 迁移学习需要来自更相关的领域吗?

python - Sklearn 和 StatsModels 给出了截然不同的逻辑回归答案

python - 加载和预测新数据 sklearn

python - WatsonException : Error: unsupported text language, 代码:400

python - 特征是文本(标签)和数字组合的方法

python - 为什么我无法从 matplotlib.colors 导入 ListedColorMap?