machine-learning - 为什么 Mallet 文本分类对所有测试文件输出相同的值 1.0?

标签 machine-learning nlp classification text-classification mallet

我正在学习 Mallet 文本分类命令行。估计不同类别的输出值都是相同的 1.0。我不知道我哪里错了。你能帮忙吗?

木槌版本:E:\Mallet\mallet-2.0.8RC3

//there is a txt file about cat breed (catmaterial.txt) in cat dir.
//command 1
C:\Users\toshiba>mallet import-dir --input E:\Mallet\testmaterial\cat --output E
:\Mallet\testmaterial\cat.mallet --remove-stopwords

//command 1 output
Labels =
   E:\Mallet\testmaterial\cat

//command 2, save classifier as catClass.classifier
C:\Users\toshiba>mallet train-classifier --input E:\Mallet\testmaterial\cat.mall
et --trainer NaiveBayes --output-classifier E:\Mallet\testmaterial\catClass.clas
sifier

//command 2 output
Training portion = 1.0
Unlabeled training sub-portion = 0.0
Validation portion = 0.0
Testing portion = 0.0

-------------------- Trial 0  --------------------

Trial 0 Training NaiveBayesTrainer with 1 instances
Trial 0 Training NaiveBayesTrainer finished
No examples with predicted label !
No examples with true label !
No examples with predicted label !
No examples with true label !
Trial 0 Trainer NaiveBayesTrainer training data accuracy = 1.0
Trial 0 Trainer NaiveBayesTrainer Test Data Confusion Matrix
No examples with predicted label !
Trial 0 Trainer NaiveBayesTrainer test data precision() = 1.0
No examples with true label !
Trial 0 Trainer NaiveBayesTrainer test data recall() = 1.0
No examples with predicted label !
No examples with true label !
Trial 0 Trainer NaiveBayesTrainer test data F1() = 1.0
Trial 0 Trainer NaiveBayesTrainer test data accuracy = NaN

NaiveBayesTrainer
Summary. train accuracy mean = 1.0 stddev = 0.0 stderr = 0.0
Summary. test accuracy mean = NaN stddev = NaN stderr = NaN
Summary. test precision() mean = 1.0 stddev = 0.0 stderr = 0.0
Summary. test recall() mean = 1.0 stddev = 0.0 stderr = 0.0
Summary. test f1() mean = 1.0 stddev = 0.0 stderr = 0.0

//command 3, estimate classes of the three files about cat, deer and dog. The cat file is the same as the one for cat.mallet
C:\Users\toshiba>mallet classify-dir --input E:\Mallet\testmaterial\test_cat_dir
 --output - --classifier E:\Mallet\testmaterial\catClass.classifier


//command 3 output
file:/E:/Mallet/testmaterial/test_cat_dir/catmaterial.txt               1.0
file:/E:/Mallet/testmaterial/test_cat_dir/deertext.txt          1.0
file:/E:/Mallet/testmaterial/test_cat_dir/dogmaterial.txt               1.0

// why the three classes are all 1.0 ?

C:\Users\toshiba>

你能帮忙吗? 谢谢。

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

更新:

感谢您的回答,但所有文件仍然输出 1.0。

我的想法是,我将一些狗文件放在dog目录中,并将这些狗文件视为实例,训练模型,然后测试test_dir中的一些文件以查看结果。

我根据我对你的建议的理解进行了尝试,但仍然输出相同的1.0。

您能帮我处理下面的命令行吗?

在E:\Mallet\train_dir\dog中,有4个dog txt文件(dog 2.txt、dog4.txt、dog5.txt、dogmaterial.txt)。

在E:\Mallet\test_dir中,有9个txt文件(cat2.txt、catmaterial.txt、deermaterial.txt、dog3.txt、dog6.txt、dog 2.txt、dog4.txt、dog5.txt、狗 Material .txt)。

<小时/>
C:\Users\toshiba>mallet import-dir --input E:\Mallet\train_dir\dog --output E:\M
allet\classifier_dir\3animal.mallet --remove-stopwords
Labels =
   E:\Mallet\train_dir\dog


C:\Users\toshiba>mallet train-classifier --input E:\Mallet\classifier_dir\3anima
l.mallet --trainer NaiveBayes --output-classifier E:\Mallet\classifier_dir\3anim
alClass.classifier
Training portion = 1.0
Unlabeled training sub-portion = 0.0
Validation portion = 0.0
Testing portion = 0.0                          
-------------------- Trial 0  --------------------

Trial 0 Training NaiveBayesTrainer with 4 instances
Trial 0 Training NaiveBayesTrainer finished
No examples with predicted label !
No examples with true label !
No examples with predicted label !
No examples with true label !
Trial 0 Trainer NaiveBayesTrainer training data accuracy = 1.0
Trial 0 Trainer NaiveBayesTrainer Test Data Confusion Matrix
No examples with predicted label !
Trial 0 Trainer NaiveBayesTrainer test data precision() = 1.0
No examples with true label !
Trial 0 Trainer NaiveBayesTrainer test data recall() = 1.0
No examples with predicted label !
No examples with true label !
Trial 0 Trainer NaiveBayesTrainer test data F1() = 1.0
Trial 0 Trainer NaiveBayesTrainer test data accuracy = NaN

NaiveBayesTrainer
Summary. train accuracy mean = 1.0 stddev = 0.0 stderr = 0.0
Summary. test accuracy mean = NaN stddev = NaN stderr = NaN
Summary. test precision() mean = 1.0 stddev = 0.0 stderr = 0.0
Summary. test recall() mean = 1.0 stddev = 0.0 stderr = 0.0
Summary. test f1() mean = 1.0 stddev = 0.0 stderr = 0.0


C:\Users\toshiba>mallet classify-dir --input E:\Mallet\test_dir --output - --cla
ssifier E:\Mallet\classifier_dir\3animalClass.classifier

file:/E:/Mallet/test_dir/cat2.txt               1.0
file:/E:/Mallet/test_dir/catmaterial.txt                1.0
file:/E:/Mallet/test_dir/deertext.txt           1.0
file:/E:/Mallet/test_dir/dog%202.txt            1.0
file:/E:/Mallet/test_dir/dog3.txt               1.0
file:/E:/Mallet/test_dir/dog4.txt               1.0
file:/E:/Mallet/test_dir/dog5.txt               1.0
file:/E:/Mallet/test_dir/dog6.txt               1.0
file:/E:/Mallet/test_dir/dogmaterial.txt                1.0
C:\Users\toshiba>
<小时/>

谢谢。

最佳答案

有两个输入选项。 input-dir 将目录视为类,并将每个目录中的每个文件视为输入实例。 input-file 逐行读取输入文件,并将该行中的各个字段视为标签和实例数据。您正在使用目录中的文件输入类型,因此您正在创建一个包含一个类和一个实例的分类器。我猜您想要文件中的行类型。

关于machine-learning - 为什么 Mallet 文本分类对所有测试文件输出相同的值 1.0?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49649946/

相关文章:

python - 为什么我们必须在深度神经网络中嵌入列,而不是在 tensorflow 的线性分类器中嵌入列?

python - MNIST 上的神经网络——结果出乎意料

r - 基准实验中使用的学习器的特征重要性 - MLR

algorithm - 神经网络能否找到固定大小列表的第 i 个排列?

python-3.x - 无法预测表情符号的情绪

nlp - Allennlp 配置错误 : key "matrix_attention" is required at location "model."

machine-learning - 运行斯坦福 CoreNLP 时,某些 HPC 集群是否只缓存一个结果?

python - 俄语单词列表的 SnowballStemmer

solr - 抓取网页后识别产品,进行价格比较

python - scikit-learn 包中的 CountVectorizer 问题