java - Mahout:基于项目的推荐系统的调整余弦相似度

标签 java mahout recommendation-engine mahout-recommender cosine-similarity

对于一项任务,我应该测试不同类型的推荐系统,我必须首先实现这些推荐系统。我一直在四处寻找一个好的库来做这件事(我一开始想到了 Weka)并偶然发现了 Mahout。

因此我必须提出:a) 我是 Mahout 的新手 b) 我在推荐系统和他们的算法方面没有很强的背景(否则我不会上这门课...)和 c)抱歉,但我远不是世界上最好的开发人员 ==> 如果您能使用外行术语(尽可能...),我将不胜感激 :)

我一直在关注一些教程(例如 this 以及 part2 )并在基于项目和基于用户的推荐系统上获得了一些初步结果。

但是,我对基于项目的预测不是很满意。到目前为止,我只发现了没有考虑用户评分偏差的相似度函数。我想知道是否有类似 adjusted cosine similarity 的东西。有什么提示吗?

最佳答案

这是我创建的 AdjustedCosineSimilarity 的示例。您必须记住,由于 sqrt 计算,这将比 PearsonCorrelationSimilarity 慢,但会产生更好的结果。至少对于我的数据集,结果要好得多。但是您应该在质量/性能之间做出权衡,并且根据您的需要使用您想要的实现。

/**
 * Custom implementation of {@link AdjustedCosineSimilarity}
 * 
 * @author dmilchevski
 *
 */
public class AdjustedCosineSimilarity extends AbstractSimilarity {

  /**
   * Creates new {@link AdjustedCosineSimilarity}
   * 
   * @param dataModel
   * @throws TasteException
   */
    public AdjustedCosineSimilarity(DataModel dataModel)
            throws TasteException {
        this(dataModel, Weighting.UNWEIGHTED);
    }

    /**
     * Creates new {@link AdjustedCosineSimilarity}
     * 
     * @param dataModel
     * @param weighting
     * @throws TasteException
     */
    public AdjustedCosineSimilarity(DataModel dataModel, Weighting weighting)
            throws TasteException {
        super(dataModel, weighting, true);
        Preconditions.checkArgument(dataModel.hasPreferenceValues(),
                "DataModel doesn't have preference values");
    }

    /**
     * Compute the result
     */
    @Override
    double computeResult(int n, double sumXY, double sumX2, double sumY2, double sumXYdiff2) {
        if (n == 0) {
            return Double.NaN;
        }
        // Note that sum of X and sum of Y don't appear here since they are
        // assumed to be 0;
        // the data is assumed to be centered.
        double denominator = Math.sqrt(sumX2) * Math.sqrt(sumY2);
        if (denominator == 0.0) {
            // One or both parties has -all- the same ratings;
            // can't really say much similarity under this measure
            return Double.NaN;
        }
        return sumXY / denominator;
    }

    /**
     * Gets the average preference
     * @param prefs
     * @return
     */
    private double averagePreference(PreferenceArray prefs){
        double sum = 0.0;
        int n = prefs.length();
        for(int i=0; i<n; i++){
            sum+=prefs.getValue(i);
        }
        if(n>0){
            return sum/n;
        }
        return 0.0d;
    }

    /**
     * Compute the item similarity between two items
     */
    @Override
    public double itemSimilarity(long itemID1, long itemID2) throws TasteException {
        DataModel dataModel = getDataModel();
        PreferenceArray xPrefs = dataModel.getPreferencesForItem(itemID1);
        PreferenceArray yPrefs = dataModel.getPreferencesForItem(itemID2);
        int xLength = xPrefs.length();
        int yLength = yPrefs.length();

        if (xLength == 0 || yLength == 0) {
            return Double.NaN;
        }

        long xIndex = xPrefs.getUserID(0);
        long yIndex = yPrefs.getUserID(0);
        int xPrefIndex = 0;
        int yPrefIndex = 0;

        double sumX = 0.0;
        double sumX2 = 0.0;
        double sumY = 0.0;
        double sumY2 = 0.0;
        double sumXY = 0.0;
        double sumXYdiff2 = 0.0;
        int count = 0;

        // No, pref inferrers and transforms don't appy here. I think.

        while (true) {
            int compare = xIndex < yIndex ? -1 : xIndex > yIndex ? 1 : 0;
            if (compare == 0) {
                // Both users expressed a preference for the item
                double x = xPrefs.getValue(xPrefIndex);
                double y = yPrefs.getValue(yPrefIndex);
                long xUserId = xPrefs.getUserID(xPrefIndex);
                long yUserId = yPrefs.getUserID(yPrefIndex);

                double xMean = averagePreference(dataModel.getPreferencesFromUser(xUserId));
                double yMean = averagePreference(dataModel.getPreferencesFromUser(yUserId));

                sumXY += (x - xMean) * (y - yMean);
                sumX += x;
                sumX2 += (x - xMean) * (x - xMean);
                sumY += y;
                sumY2 += (y - yMean) * (y - yMean);
                double diff = x - y;
                sumXYdiff2 += diff * diff;
                count++;
            }
            if (compare <= 0) {
                if (++xPrefIndex == xLength) {
                    break;
                }
                xIndex = xPrefs.getUserID(xPrefIndex);
            }
            if (compare >= 0) {
                if (++yPrefIndex == yLength) {
                    break;
                }
                yIndex = yPrefs.getUserID(yPrefIndex);
            }
        }

        double result;

        // See comments above on these computations
        double n = (double) count;
        double meanX = sumX / n;
        double meanY = sumY / n;
        // double centeredSumXY = sumXY - meanY * sumX - meanX * sumY + n *
        // meanX * meanY;
        double centeredSumXY = sumXY - meanY * sumX;
        // double centeredSumX2 = sumX2 - 2.0 * meanX * sumX + n * meanX *
        // meanX;
        double centeredSumX2 = sumX2 - meanX * sumX;
        // double centeredSumY2 = sumY2 - 2.0 * meanY * sumY + n * meanY *
        // meanY;
        double centeredSumY2 = sumY2 - meanY * sumY;
//      result = computeResult(count, centeredSumXY, centeredSumX2,
//              centeredSumY2, sumXYdiff2);

        result = computeResult(count, sumXY, sumX2, sumY2, sumXYdiff2);

        if (!Double.isNaN(result)) {
            result = normalizeWeightResult(result, count,
                    dataModel.getNumUsers());
        }
        return result;
    }

}

关于java - Mahout:基于项目的推荐系统的调整余弦相似度,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29419222/

相关文章:

Python/mysql : Recommender system

java - mahout 0.7中mahout 0.5中VectorWritable.addTo的等效方法是什么?

java - 选择多行 JTable

hadoop - 编码以在 Apache Mahout 中查找 Z 分数并计算相似度

java - 使用 mahout 出现 NoClassDefFoundError

solr - 自动产品分类和查询加权

machine-learning - 用于学习字符串模式的机器学习技术

java - 保持 MATLAB 的 "pwd"和内部 "user.dir"同步

java - OpenGL 正确附加纹理

java - 调用 Class.forName() 两次