python - 如何使用 Python 在 Spark 中添加两个稀疏向量

标签 python apache-spark sparse-matrix

我到处搜索,但找不到如何使用 Python 添加两个稀疏向量。 我想像这样添加两个稀疏向量:-

(1048576, {110522: 0.6931, 521365: 1.0986, 697409: 1.0986, 725041: 0.6931, 749730: 0.6931, 962395: 0.6931})

(1048576, {4471: 1.0986, 725041: 0.6931, 850325: 1.0986, 962395: 0.6931})

最佳答案

像这样的东西应该可以工作:

from pyspark.mllib.linalg import Vectors, SparseVector, DenseVector
import numpy as np

def add(v1, v2):
    """Add two sparse vectors
    >>> v1 = Vectors.sparse(3, {0: 1.0, 2: 1.0})
    >>> v2 = Vectors.sparse(3, {1: 1.0})
    >>> add(v1, v2)
    SparseVector(3, {0: 1.0, 1: 1.0, 2: 1.0})
    """
    assert isinstance(v1, SparseVector) and isinstance(v2, SparseVector)
    assert v1.size == v2.size 
    # Compute union of indices
    indices = set(v1.indices).union(set(v2.indices))
    # Not particularly efficient but we are limited by SPARK-10973
    # Create index: value dicts
    v1d = dict(zip(v1.indices, v1.values))
    v2d = dict(zip(v2.indices, v2.values))
    zero = np.float64(0)
    # Create dictionary index: (v1[index] + v2[index])
    values =  {i: v1d.get(i, zero) + v2d.get(i, zero)
       for i in indices
       if v1d.get(i, zero) + v2d.get(i, zero) != zero}

    return Vectors.sparse(v1.size, values)

如果您只喜欢单次传递并且不关心引入的零,您可以像这样修改上面的代码:

from collections import defaultdict

def add(v1, v2):
    assert isinstance(v1, SparseVector) and isinstance(v2, SparseVector)
    assert v1.size == v2.size
    values = defaultdict(float) # Dictionary with default value 0.0
    # Add values from v1
    for i in range(v1.indices.size):
        values[v1.indices[i]] += v1.values[i]
    # Add values from v2
    for i in range(v2.indices.size):
        values[v2.indices[i]] += v2.values[i]
    return Vectors.sparse(v1.size, dict(values))

如果你愿意,你可以试试 monkey patch SparseVector:

SparseVector.__add__ = add
v1 = Vectors.sparse(5, {0: 1.0, 2: 3.0})
v2 = Vectors.sparse(5, {0: -3.0, 2: -3.0, 4: 10})
v1 + v2
## SparseVector(5, {0: -2.0, 4: 10.0})

或者,您应该能够使用 scipy.sparse

from scipy.sparse import csc_matrix
from pyspark.mllib.regression import LabeledPoint

m1 = csc_matrix((
   v1.values,
   (v1.indices, [0] * v1.numNonzeros())),
   shape=(v1.size, 1))

m2 = csc_matrix((
   v2.values,
   (v2.indices, [0] * v2.numNonzeros())),
   shape=(v2.size, 1))

LabeledPoint(0, m1 + m2)

关于python - 如何使用 Python 在 Spark 中添加两个稀疏向量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32981875/

相关文章:

python - 如何选择数值最大的列名作为新的列元素?

python - 从另一个服务(微服务架构)验证 Flask 单元测试客户端?

python - 如何使用 python 有效地找到两个大文件的交集?

scala - uber jar 中的 NoSuchMethodError 异常

R 或 MATLAB : permute a large sparse matrix into a block diagonal matrix

R:使用 data.table 进行制表和插入

python - Scipy稀疏矩阵切片返回IndexError

Python将对象属性写入文件

apache-spark - Spark : PartitionBy, 更改输出文件名

apache-spark - pyspark:有效地让partitionBy写入与原始表相同数量的总分区