Query Topics Models Using SPUD
Query topic modelling (QTM) is a way of expanding an initial query so that it contains terms that are topically related. The approach is more effective than the relevance modelling (RM) approach (Lavrenko, 2001). In particular, it extracts terms from the top retrieved documents of an initial retrieval run that are more likely to have been generated by the topical aspect of the documents (as opposed to a background model). In short, the main reason for its improved effectiveness is that it penalises noisy terms in a theoretically consistent manner, unlike the RM3 approach. QTMs can be used as a fast way of automatically creating topic models given a few initial seed words. The output of the approach returns the probability of a term being drawn from the query topic model. So for example, consider the expansion terms for the query {us gun violence} below. There are many common words returned high in the term ranking for the relevance modelling approach (i.e. have, use, also, he). Although these could be removed using a stopword list, the ranking of terms is quite different than the QTM approach (and nowadays search engines use only limited stopword removal). The Lucene code is available to here.
SPUD-RM3 | p(t|RM) | SPUD-QTM | p(Q|t) |
---|---|---|---|
gun | 0.0332 | firearm | 0.9928 |
have | 0.0143 | gun | 0.9845 |
violence | 0.0119 | violence | 0.9767 |
firearm | 0.0116 | control | 0.8328 |
law | 0.0091 | law | 0.8168 |
control | 0.0068 | weapon | 0.8119 |
use | 0.0061 | ban | 0.7995 |
from | 0.0060 | crime | 0.7804 |
state | 0.0060 | advocate | 0.7500 |
united | 0.0055 | policy | 0.7227 |
us | 0.0049 | handgun | 0.7172 |
other | 0.0048 | rifle | 0.7096 |
states | 0.0048 | prevent | 0.7001 |
weapon | 0.0046 | us | 0.7000 |
crime | 0.0045 | enforcement | 0.6952 |
which | 0.0041 | legislation | 0.6749 |
also | 0.0040 | state | 0.6442 |
handgun | 0.0039 | homicide | 0.6441 |
he | 0.0039 | firearms | 0.6406 |
rate | 0.0036 | criminal | 0.6395 |