Natural Language - Document (Cosine) Similarity

Text Mining

About

Cosine similarity applied to document similarity.

Implementation

Each document becomes a vector in some high dimensional space. To compare two documents we compute the cosine of the angle between their two document vectors.

The dot product and norm computations are simple functions of the bag-of-words document representations.

The geometric interpretation is more intuitive. When the angle between two document vectors is small, they are pointing roughly the same direction because they share many tokens in common.

  • If the angle is small (they share many words in common), the cosine is large.
  • If the angle is large (and they have few words in common), the cosine is small.





Discover More
Text Mining
What is a Term-document Matrix?

A term-document matrix is an important representation for text analytics. Each row of the matrix is a document vector, with one column for every term in the entire corpus. Naturally, some documents...
Thomas Bayes
What is the Cosine Similarity or Cosine Distance? (Measure of Angle)

The cosine similarity (or cosine distance) is a distance that measures the angle between two vectors, normalized by magnitude. You just divide the dot product by the magnitude of the two vectors. ...



Share this page:
Follow us:
Task Runner