A core area of study in the Information Retrieval sciences involves the concept of clustering or cluster-based retrieval. At the foundation of cluster-based information retrieval is the Cluster Hypothesis:
Closely associated documents tend to be relevant for the same requests.
Clustering is a subject of great interest in the search engine technology field as this proven method can greatly enhance both relevancy and user experience.
In practice, clustering involves the implementation of an algorithm based upon pre-defined heuristics which determine similarity between all documents within a set. Generally this is done at query-time (live). The algorithm can be implemented across the entire corpus of documents or pre-screened candidate sets that are known to be related to the query. Documents that are related via similitude above a certain thresh-hold are grouped into a cluster. The clusters are given semantically meaningful titles and then presented with related clusters to the user. The presentation of the actual clusters is important, as they must be organized in a descending fashion beginning with the most query relevant cluster and ending with the cluster, which is related to the query, but least likely to provide relevant information to the user.
A sound clustering system can profoundly improve the user’s ability to locate the documents that are best suited for their needs. The best example of such a system that I have found to date is Microsoft’s Search Result Clustering (SRC) tool which is currently in Beta. SRC implements a variety of top level clustering methods to furnish intelligently clustered results. Methodologies include:
- Query disambiguation
- Sub-topics discovery
- Fact finding
- Relationship finding
Any of which are implemented automatically depending on the type of query entered. The system is definitely worth a test run by anyone who uses search engines for informational research. SRC also has a toolbar available which can be downloaded from the SRC Toolbar page.