(Prediction|Recommender System) - Collaborative filtering

> (Statistics|Probability|Machine Learning|Data Mining|Data and Knowledge Discovery|Pattern Recognition|Data Science|Data Analysis)

1 - About

Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating).

But in general, collaborative filtering is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc

Note that these predictions are specific to the user, but use information gleaned from many users. This differs from the simpler approach of giving an average (non-specific) score for each item of interest, for example based on its number of votes.

These predictions are built upon the existing ratings of other users, who have similar ratings with the active user.


3 - Assumption

The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue x than to have the opinion on x of a person chosen randomly.

4 - Example

The predictions are built upon the existing ratings of other users, who have similar ratings with the active user. For instance, in the image below, there are 2 persons with similar ratings (with a green background) therefore the predictive rating for the movie will be negative. Ie that the active user will not like the video

5 - Implementation

With collaborative filtering, the idea is to approximate the ratings matrix by factorizing it as the product of two matrices:

  • one that describes properties of each user,
  • and one that describes properties of each rated object (movies,…).

We want to select these two matrices such that the error for the users/movie pairs where we know the correct ratings is minimized.

The Alternating Least Squares algorithm (ALS) does this by first randomly filling the users matrix with values and then optimizing the value of the movies such that the error is minimized. Then, it holds the movies matrix constant and optimizes the value of the user's matrix. This alternation between which matrix to optimize is the reason for the “alternating” in the name.

5.1 - Spark

See Spark ALS.

5.1.1 - Model Training

Create a model using

model = ALS.train(trainingRDD, rank, seed=seed, iterations=iterations, lambda_=regularizationParameter)


  • trainingRDD consisting of tuples of the form (UserID, ObjectID, rating) used to train the model,
  • an integer rank (for instance, 4, 8, or 12),
  • a number of iterations to execute (for instance, 5 ),
  • and a regularization coefficient (for instance, 0.1).

The most important parameter to ALS.train() is the rank, which is the number of rows in the first matrix or the number of columns in the second matrix. In general, a lower rank will mean higher error on the training dataset, but a high rank may lead to overfitting.


5.1.2 - Prediction

Predict rating values by calling Model.predictAll

predictionRDD = predictAll(userProductRDD)

where :

  • model is the model generated with ALS.train().
  • userProductRDD is an entry RDD with the following format (userID, objectID) for each entry
  • predictionRDD is an output RDD with the format (userID, objectID, rating) for each entry

5.1.3 - Evaluation

Evaluate the quality of the model with an Regression Accuracy metrics.

Spark RegressionMetrics

6 - Documentation / Reference