Count-Based Index
term: [{ docID: word_count }]
Motivation
In an inverted index that contains only document information (previous page), the features are binary (1 contains a term, 0 otherwise). This information is important, but it is too coarse to find the best few documents when there are a lot of possible matches.
For instance, consider the query "tropical fish". Three documents match this query: .
The data in the document-based index gives us no reason to prefer any of these documents over any other.
Additional Data
Look at the index above. This index has the same words and the same number of postings, and the first number in each posting is the document ID, which is the same as in the previous index.
However, each posting now has a second number. This second number is the number of times the word appears in the document.
{
tropical: [{ 1: 2 }, { 2: 2 }, { 3: 1 }],
fish: [{ 1: 2 }, { 2: 3 }, { 3: 2 }, { 4: 2 }],
}
With this small amount of additional data, we are able to prefer over and for the query "tropical fish", since contains "tropical" twice and "fish" three times.
Why Word Count
In general, word counts can be a powerful predictor of document relevance.
In particular, word counts can help distinguish documents that are about a particular subject from those that discuss that subject in passing.
Imagine two documents:
- doc about tropical fish.
- doc about tropical islands.
The doc about tropical islands would probably contain the word "fish", but only a few times. On the other hand, the doc about tropical fish would contain the word "fish" many times.