Spark - Dense Vector

> Database > Spark

1 - About

A DenseVector is class within the module pyspark.mllib.linalg.

DenseVector is used to store arrays of values for use in PySpark.

3 - Implementation

DenseVector actually stores values in a NumPy array and delegates calculations to that object.

Note that:

  • DenseVector stores all values as np.float64, so even if you pass in an NumPy array of integers, the resulting DenseVector will contain floating-point numbers.
  • DenseVector objects exist locally and are not inherently distributed. DenseVector objects can be used in the distributed setting by either passing functions that contain them to resilient distributed dataset (RDD) transformations or by distributing them directly as RDDs.
Advertising

4 - Construction

You can create a new DenseVector using DenseVector() and passing in an NumPy array or a Python list.

# Create a DenseVector consisting of the values [3.0, 4.0, 5.0]
myDenseVector = DenseVector([3.0, 4.0, 5.0])

5 - Function

5.1 - dot

dot operates just like np.ndarray.dot().

#Numpy vector
numpyVector = np.array([-3, -4, 5])
print '\nnumpyVector:\n{0}'.format(numpyVector)
 
#Dense vector
myDenseVector = DenseVector([3.0, 4.0, 5.0])
print 'myDenseVector:\n{0}'.format(myDenseVector)
 
# The dot product between the two vectors.
denseDotProduct = myDenseVector.dot(numpyVector)
print '\ndenseDotProduct:\n{0}'.format(denseDotProduct)
numpyVector:
[-3 -4  5]
myDenseVector:
[3.0,4.0,5.0]
denseDotProduct:
0.0