Skip to main content

Count-Based Index

word count inverted list

term: [{ docID: word_count }]

Motivation

In an inverted index that contains only document information (previous page), the features are binary (1 contains a term, 0 otherwise). This information is important, but it is too coarse to find the best few documents when there are a lot of possible matches.

For instance, consider the query "tropical fish". Three documents match this query: S1,S2,S3S_1, S_2, S_3.

The data in the document-based index gives us no reason to prefer any of these documents over any other.

Additional Data

Look at the index above. This index has the same words and the same number of postings, and the first number in each posting is the document ID, which is the same as in the previous index.

However, each posting now has a second number. This second number is the number of times the word appears in the document.

{
tropical: [{ 1: 2 }, { 2: 2 }, { 3: 1 }],
fish: [{ 1: 2 }, { 2: 3 }, { 3: 2 }, { 4: 2 }],
}

With this small amount of additional data, we are able to prefer S2S_2 over S1S_1 and S3S_3 for the query "tropical fish", since S2S_2 contains "tropical" twice and "fish" three times.

Why Word Count

In general, word counts can be a powerful predictor of document relevance.

In particular, word counts can help distinguish documents that are about a particular subject from those that discuss that subject in passing.

Example

Imagine two documents:

  • doc about tropical fish.
  • doc about tropical islands.

The doc about tropical islands would probably contain the word "fish", but only a few times. On the other hand, the doc about tropical fish would contain the word "fish" many times.