The Dice similarity is the same as F1-score; and they are monotonic in Jaccard similarity. I worked this out recently but couldn’t find anything about it online so here’s a writeup.

Let \(A\) be the set of found items, and \(B\) the set of wanted items. \(Prec=|AB|/|A|\), \(Rec=|AB|/|B|\). Their harmonic mean, the \(F1\)-measure, is the same as the Dice coefficient:

\begin{align*}

F1(A,B)

&= \frac{2}{1/P+ 1/R}

= \frac{2}{|A|/|AB| + |B|/|AB|} \\

Dice(A,B)

&= \frac{2|AB|}{ |A| + |B| } \\

&= \frac{2 |AB|}{ (|AB| + |A \setminus B|) + (|AB| + |B \setminus A|)} \\

&= \frac{|AB|}{|AB| + \frac{1}{2}|A \setminus B| + \frac{1}{2} |B \setminus A|}

\end{align*}

It’s nice to characterize the set comparison into the three mutually exclusive partitions \(AB\), \(A \setminus B\), and \(B \setminus A\). This illustrates Dice’s close relationship to the Jaccard metric,

\begin{align*}

Jacc(A,B)

&= \frac{|AB|}{|A \cup B|} \\

&= \frac{|AB|}{|AB| + |A \setminus B| + |B \setminus A|}

\end{align*}

And in fact \(J = D/(2-D)\) and \(D=2J/(1+J)\) for any input, so they are monotonic in one another.

The Tversky index (1977) generalizes them both,

\begin{align*}

Tversky(A,B;\alpha,\beta)

&= \frac{|AB|}{|AB| + \alpha|A\setminus B| + \beta|B \setminus A|}

\end{align*}

where \(\alpha\) and \(\beta\) control the magnitude of penalties of false positive versus false negative errors. It’s easy to work out that all weighted F-measures correspond to when \(\alpha+\beta=1\). The Tversky index just gives a spectrum of ways to normalize the size of a two-way set intersection. (I always thought Tversky’s more mathematical earlier work (before the famous T&K heuristics-and-biases stuff) was pretty cool. In the 1977 paper he actually does an axiomatic derivation of set similarity measures, though as far as I can tell this index doesn’t strictly derive from them. Then there’s a whole debate in cognitive psych whether similarity is a good way to characterize reasoning about objects but that’s another story.)

So you could use either Jaccard or Dice/F1 to measure retrieval/classifier performance, since they’re completely monotonic in one another. Jaccard might be a little unintuitive though, because it’s always less than or equal min(Prec,Rec); Dice/F is always in-between.

Pingback: Any bibliographical reference where it says that the Dice Coefficient is the same as the F1 – measure? | Question and Answer

Pingback: Interpretation of cross validation results when comparing models | Question and Answer