Data Mining - Similarity

> (Statistics|Probability|Machine Learning|Data Mining|Data and Knowledge Discovery|Pattern Recognition|Data Science|Data Analysis)

1 - About

Simliarity = closest distance ?

You can find similarities by looking at:

  • the metadata:
    • Were they created at roughly the same time?
    • Do they tend to get the same ratings?
  • user behavior (browsing, playing, searching)

What’s similar depends on who you’re talking about. Take director Pedro Almodóvar. You might have four very different movies by Almodóvar. But he’s such a strong voice that, by himself, he makes those videos similar to one another. For a different director—say, Spielberg—that might not be the case.

Similarity is symmetric.

Advertising

3 - Function

  • Regular (“Euclidean”) distance? (sum of squares of differences). Regular Euclidean distance: normally the square root but as we compare only two instances we don't need to take the square root
  • Manhattan (“city‐block”) distance? (sum of absolute differences)
  • Nominal attributes? Distance = 1 if different, 0 if same
  • Data Mining - Cosine Similarity (Measure of Angle) String similarity Product of vector by the cosinus