Ronan Cummins

Query Topics Models Using SPUD

Query topic modelling (QTM) is a way of expanding an initial query so that it contains terms that are topically related. The approach is more effective than the relevance modelling (RM) approach (Lavrenko, 2001). In particular, it extracts terms from the top retrieved documents of an initial retrieval run that are more likely to have been generated by the topical aspect of the documents (as opposed to a background model). In short, the main reason for its improved effectiveness is that it penalises noisy terms in a theoretically consistent manner, unlike the RM3 approach. QTMs can be used as a fast way of automatically creating topic models given a few initial seed words. The output of the approach returns the probability of a term being drawn from the query topic model. So for example, consider the expansion terms for the query {us gun violence} below. There are many common words returned high in the term ranking for the relevance modelling approach (i.e. have, use, also, he). Although these could be removed using a stopword list, the ranking of terms is quite different than the QTM approach (and nowadays search engines use only limited stopword removal). The Lucene code is available to here.

Related terms and their term-selection score on Wikipedia for the topic ***us gun violence***
SPUD-RM3	p(t\|RM)	SPUD-QTM	p(Q\|t)
gun	0.0332	firearm	0.9928
have	0.0143	gun	0.9845
violence	0.0119	violence	0.9767
firearm	0.0116	control	0.8328
law	0.0091	law	0.8168
control	0.0068	weapon	0.8119
use	0.0061	ban	0.7995
from	0.0060	crime	0.7804
state	0.0060	advocate	0.7500
united	0.0055	policy	0.7227
us	0.0049	handgun	0.7172
other	0.0048	rifle	0.7096
states	0.0048	prevent	0.7001
weapon	0.0046	us	0.7000
crime	0.0045	enforcement	0.6952
which	0.0041	legislation	0.6749
also	0.0040	state	0.6442
handgun	0.0039	homicide	0.6441
he	0.0039	firearms	0.6406
rate	0.0036	criminal	0.6395

Ronan Cummins (BEng MSc PhD PgCert)

Query Topics Models Using SPUD