java - 计算余弦相似度

标签 java jdbc

如何使用 jdbc 计算余弦相似度来完成我的搜索引擎项目。 我有表术语频率查询来存储用户的输入,表术语频率文档来存储有关文档的所有信息,我已经完成了计算查询和文档加权。 计算余弦相似度后的输出是显示与用户输入的查询相关的文档。 我不知道,也不知道如何计算,因为它涉及数据库中的表。

最佳答案

这是计算两个句子之间的余弦相似度的程序,我希望您可以进行所需的更改以获得您想要的结果。

import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

/**
 * 
* @author Xiao Ma
* mail : 409791952@qq.com
*`enter code here`
*/
  public class SimilarityUtil {

public static double consineTextSimilarity(String[] left, String[] right) {
    Map<String, Integer> leftWordCountMap = new HashMap<String, Integer>();
    Map<String, Integer> rightWordCountMap = new HashMap<String, Integer>();
    Set<String> uniqueSet = new HashSet<String>();
    Integer temp = null;
    for (String leftWord : left) {
        temp = leftWordCountMap.get(leftWord);
        if (temp == null) {
            leftWordCountMap.put(leftWord, 1);
            uniqueSet.add(leftWord);
        } else {
            leftWordCountMap.put(leftWord, temp + 1);
        }
    }
    for (String rightWord : right) {
        temp = rightWordCountMap.get(rightWord);
        if (temp == null) {
            rightWordCountMap.put(rightWord, 1);
            uniqueSet.add(rightWord);
        } else {
            rightWordCountMap.put(rightWord, temp + 1);
        }
    }
    int[] leftVector = new int[uniqueSet.size()];
    int[] rightVector = new int[uniqueSet.size()];
    int index = 0;
    Integer tempCount = 0;
    for (String uniqueWord : uniqueSet) {
        tempCount = leftWordCountMap.get(uniqueWord);
        leftVector[index] = tempCount == null ? 0 : tempCount;
        tempCount = rightWordCountMap.get(uniqueWord);
        rightVector[index] = tempCount == null ? 0 : tempCount;
        index++;
    }
    return consineVectorSimilarity(leftVector, rightVector);
}

/**
 * The resulting similarity ranges from −1 meaning exactly opposite, to 1
 * meaning exactly the same, with 0 usually indicating independence, and
 * in-between values indicating intermediate similarity or dissimilarity.
 * 
 * For text matching, the attribute vectors A and B are usually the term
 * frequency vectors of the documents. The cosine similarity can be seen as
 * a method of normalizing document length during comparison.
 * 
 * In the case of information retrieval, the cosine similarity of two
 * documents will range from 0 to 1, since the term frequencies (tf-idf
 * weights) cannot be negative. The angle between two term frequency vectors
 * cannot be greater than 90°.
 * 
 * @param leftVector
 * @param rightVector
 * @return
 */
private static double consineVectorSimilarity(int[] leftVector,
        int[] rightVector) {
    if (leftVector.length != rightVector.length)
        return 1;
    double dotProduct = 0;
    double leftNorm = 0;
    double rightNorm = 0;
    for (int i = 0; i < leftVector.length; i++) {
        dotProduct += leftVector[i] * rightVector[i];
        leftNorm += leftVector[i] * leftVector[i];
        rightNorm += rightVector[i] * rightVector[i];
    }

    double result = dotProduct
            / (Math.sqrt(leftNorm) * Math.sqrt(rightNorm));
    return result;
}

public static void main(String[] args) {
    String left[] = { "Julie", "loves", "me", "more", "than", "Linda",
            "loves", "me" };
    String right[] = { "Jane", "likes", "me", "more", "than", "Julie",
            "loves", "me" };
    System.out.println(consineTextSimilarity(left,right));
}
}

关于java - 计算余弦相似度,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/5638526/

相关文章:

java - 如何从 JDBC 接收 Statement 的所有 ResultSet?

java - long.Class 和 Long.TYPE 的区别

java - JAR 文件的 Proguard 混淆

java - 是否有任何孤立的配置文件测试应用程序的技术?

java - Solr - 索引大型数据库

scala - 无法从azure databricks连接到sql server托管实例

java - 将一些节点放在Dot中的同一级别上

Java 检查一个字符串是有效的 JSON 还是有效的 XML 或者两者都不是

postgresql - clojure.java.jdbc 中的嵌套事务

java - Spring JcbcTemplate 调用 Oracle Stored Proc。 Spring 3.2