Cosine similarity, Pearson correlations, and OLS coefficients can all be viewed as variants on the inner product — tweaked in different ways for centering and magnitude (i.e. location and scale, or something like that).
Details:
You have two vectors \(x\) and \(y\) and want to measure similarity between them. A basic similarity function is the inner product
\[ Inner(x,y) = \sum_i x_i y_i = \langle x, y \rangle \]
If x tends to be high where y is also high, and low where y is low, the inner product will be high — the vectors are more similar.
The inner product is unbounded. One way to make it bounded between -1 and 1 is to divide by the vectors’ L2 norms, giving the cosine similarity
\[ CosSim(x,y) = \frac{\sum_i x_i y_i}{ \sqrt{ \sum_i x_i^2} \sqrt{ \sum_i y_i^2 } }
= \frac{ \langle x,y \rangle }{ ||x||\ ||y|| }
\]
This is actually bounded between 0 and 1 if x and y are non-negative. Cosine similarity has an interpretation as the cosine of the angle between the two vectors; you can illustrate this for vectors in \(\mathbb{R}^2\) (e.g. here).
Cosine similarity is not invariant to shifts. If x was shifted to x+1, the cosine similarity would change. What is invariant, though, is the Pearson correlation. Let \(\bar{x}\) and \(\bar{y}\) be the respective means:
\begin{align}
Corr(x,y) &= \frac{ \sum_i (x_i-\bar{x}) (y_i-\bar{y}) }{
\sqrt{\sum (x_i-\bar{x})^2} \sqrt{ \sum (y_i-\bar{y})^2 } }
\\
& = \frac{\langle x-\bar{x},\ y-\bar{y} \rangle}{
||x-\bar{x}||\ ||y-\bar{y}||} \\
& = CosSim(x-\bar{x}, y-\bar{y})
\end{align}
Correlation is the cosine similarity between centered versions of x and y, again bounded between -1 and 1. People usually talk about cosine similarity in terms of vector angles, but it can be loosely thought of as a correlation, if you think of the vectors as paired samples. Unlike the cosine, the correlation is invariant to both scale and location changes of x and y.
This isn’t the usual way to derive the Pearson correlation; usually it’s presented as a normalized form of the covariance, which is a centered average inner product (no normalization)
\[ Cov(x,y) = \frac{\sum (x_i-\bar{x})(y_i-\bar{y}) }{n}
= \frac{ \langle x-\bar{x},\ y-\bar{y} \rangle }{n} \]
Finally, these are all related to the coefficient in a one-variable linear regression. For the OLS model \(y_i \approx ax_i\) with Gaussian noise, whose MLE is the least-squares problem \(\arg\min_a \sum (y_i – ax_i)^2\), a few lines of calculus shows \(a\) is
\begin{align}
OLSCoef(x,y) &= \frac{ \sum x_i y_i }{ \sum x_i^2 }
= \frac{ \langle x, y \rangle}{ ||x||^2 }
\end{align}
This looks like another normalized inner product. But unlike cosine similarity, we aren’t normalizing by \(y\)’s norm — instead we only use \(x\)’s norm (and use it twice): denominator of \(||x||\ ||y||\) versus \(||x||^2\).
Not normalizing for \(y\) is what you want for the linear regression: if \(y\) was stretched to span a larger range, you would need to increase \(a\) to match, to get your predictions spread out too.
Often it’s desirable to do the OLS model with an intercept term: \(\min_{a,b} \sum (y – ax_i – b)^2\). Then \(a\) is
\begin{align}
OLSCoefWithIntercept(x,y) &= \frac
{ \sum (x_i – \bar{x}) y_i }
{ \sum (x_i – \bar{x})^2 }
= \frac{\langle x-\bar{x},\ y \rangle}{||x-\bar{x}||^2}
\\
&= OLSCoef(x-\bar{x}, y)
\end{align}
It’s different because the intercept term picks up the slack associated with where x’s center is. So OLSCoefWithIntercept is invariant to shifts of x. It’s still different than cosine similarity since it’s still not normalizing at all for y. Though, subtly, it does actually control for shifts of y. This isn’t obvious in the equation, but with a little arithmetic it’s easy to derive that \(
\langle x-\bar{x},\ y \rangle = \langle x-\bar{x},\ y+c \rangle \) for any constant \(c\). (There must be a nice geometric interpretation of this.)
Finally, what if x and y are standardized: both centered and normalized to unit standard deviation? The OLS coefficient for that is the same as the Pearson correlation between the original vectors. I’m not sure what this means or if it’s a useful fact, but:
\[ OLSCoef\left(
\sqrt{n}\frac{x-\bar{x}}{||x-\bar{x}||},
\sqrt{n}\frac{y-\bar{y}}{||y-\bar{y}||} \right) = Corr(x,y) \]
Summarizing: Cosine similarity is normalized inner product. Pearson correlation is centered cosine similarity. A one-variable OLS coefficient is like cosine but with one-sided normalization. With an intercept, it’s centered.
Of course we need a summary table. “Symmetric” means, if you swap the inputs, do you get the same answer. “Invariant to shift in input” means, if you add an arbitrary constant to either input, do you get the same answer.
Function
| Equation
| Symmetric?
| Output range
| Invariant to shift in input?
| Pithy explanation in terms of something else
|
Inner(x,y)
| \[ \langle x, y\rangle\]
| Yes
| \(\mathbb{R}\)
| No
|
|
CosSim(x,y)
| \[ \frac{\langle x,y \rangle}{||x||\ ||y||} \]
| Yes
| [-1,1] or [0,1] if inputs non-neg
| No
| normalized inner product
|
Corr(x,y)
| \[ \frac{\langle x-\bar{x},\ y-\bar{y} \rangle }{||x-\bar{x}||\ ||y-\bar{y}||} \]
| Yes
| [-1,1]
| Yes
| centered cosine; or normalized covariance
|
Cov(x,y)
| \[ \frac{\langle x-\bar{x},\ y-\bar{y} \rangle}{n} \]
| Yes
| \(\mathbb{R}\)
| Yes
| centered inner product
|
OLSCoefNoIntcpt(x,y)
| \[\frac{ \langle x, y \rangle}{ ||x||^2 }\]
| No
| \(\mathbb{R}\)
| No
| (compare to CosSim)
|
OLSCoefWithIntcpt(x,y)
| \[ \frac{\langle x-\bar{x},\ y \rangle}{||x-\bar{x}||^2} \]
| No
| \(\mathbb{R}\)
| Yes
|
|
Are there any implications? I’ve been wondering for a while why cosine similarity tends to be so useful for natural language processing applications. Maybe this has something to do with it. Or not. One implication of all the inner product stuff is computational strategies to make it faster when there’s high-dimensional sparse data — the Friedman et al. 2010 glmnet paper talks about this in the context of coordinate descent text regression. I’ve heard Dhillon et al., NIPS 2011 applies LSH in a similar setting (but haven’t read it yet). And there’s lots of work using LSH for cosine similarity; e.g. van Durme and Lall 2010 [slides].
Any other cool identities? Any corrections to the above?
References: I use Hastie et al 2009, chapter 3 to look up linear regression, but it’s covered in zillions of other places. I linked to a nice chapter in Tufte’s little 1974 book that he wrote before he went off and did all that visualization stuff. (He calls it “two-variable regression”, but I think “one-variable regression” is a better term. “one-feature” or “one-covariate” might be most accurate.) In my experience, cosine similarity is talked about more often in text processing or machine learning contexts.