我正在从头开始创建一个词袋模块。我不确定是否删除标点符号是否是这种方法的最佳实践。考虑这个句子
I've been "DMX world center" for long time ago.Are u?
问题:对于词袋,我应该考虑
- token
DMX
(无引号)或"DMX
(包含左引号) u
(不带问号)或u?
(带问号)
简而言之,在获取不同的单词时是否应该删除所有标点符号?
提前致谢
已更新 这是我实现的代码
示例文本:火腿,我.. 在滑雪之旅中。我想知道你是否打算在我们出发前让大家聚在一起……进行一次见面和打招呼之类的 Activity ?干杯,
HashSet<String> bagOfWords = new HashSet<String>();
BufferedReader reader = new BufferedReader(new FileReader(path));
while (reader.ready()) {
String msg = reader.readLine().split("\t", 2)[1].toLowerCase(); // I get only the 2nd part. 1st part indicate wether message is spam or ham
String[] words = msg.split("[\\s+\n.\t!?+,]"); // this is the regex that I've used to split words
for (String word : words) {
bagOfWords.add(word);
}
}
最佳答案
尝试替换您的代码
while (reader.ready()) {
String msg = reader.readLine().split("\t", 2)[1].toLowerCase(); // I get only the 2nd part. 1st part indicate wether message is spam or ham
String[] words = msg.split("[\\s+\n.\t!?+,]"); // this is the regex that I've used to split words
for (String word : words) {
bagOfWords.add(word.replaceAll("[!-+.^:,\"?]"," ").trim()); // it removes all sepecial characters what you mentioned
}
}
关于java - 标点符号保存在词袋中吗?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18605846/