Ranking forms an integral part of various information retrieval issues like online advertising, sentiment analysis, collaborative filtering and retrieval of documents.
Training data comprises of documents and queries matching them all together with a certain degree of relevance associated with every match. Human assessors might prepare them manually and the results are constantly checked for queries and the relevance for every result is also determined (Liu, 2009). It not a practical job to assess the relevance for every document which is why a method called pooling is implemented. Here, only some of the top documents are retrieved by checking few of the current ranking models. Sometimes training data is automatically derived by assessing click-through logs and query chains etc.
The development and marketing of Boolean systems can be traced back to about 30 years when the power to compute was minimal. Thus, the systems needed the users to provide adequate syntactical limitations in their queries for limiting the documents that were retrieved. These documents have not been ranked in any particular order with respect to the queries of the users. The Boolean systems is known to offer strong on-line search options to librarians and intermediaries but their service options for the end-users is very poor (Xia et.al, 2008). These users are acquainted with the data set’s terminology being searched by them but lack the practice and training required for getting good outcomes from the system due to complicated query syntax needed by the systems. Having a ranking approach for retrieval is highly oriented towards its end-users. This helps the users to enter a simplified query like a phrase or sentence and acquire the set of documents that are ranked as per relevance.
The reason behind the popularity of the ranking/natural language approach is its effectiveness for the users as every term associated with the query is utilized for retrieval and the results are ranked as per co-occurrence of the terms used in the query. This approach discards the Boolean syntax utilized by the users and creates a result even when the query terms are wrong. This approach is also suitable for complicated queries which are complicated for the users for expressing the Boolean logic.
The Vector Space Model
The query vectors and sample document discussed in section 14.2 are considered as an n-dimensional vector space. Here, n matches with the total number of unusual items present within the data set. Thus, a vector matching activity depending on cosine correlation for measuring the angle cosine existing between the vectors can be used for measuring the similarity between the query and the document (Bradley, 1997)
tdij = the ithterm in the vector for document j
tqik = the ithterm in the vector for query k
n = the number of unique terms within the data set
The model is used basically for various ranking retrieving experiments mainly for SMART system experiments taking place under Salton and allied associates. The ranking experiments began in the year 1964 at the Harvard University and shifted to Cornell University in the year 1968 forming a larger part of the studies taking place in information retrieval. These experiments cover multiple information retrieval areas like phrases, synonyms, clustering and relevance feedback.
Maron and Kuhns proposed as well as tested a model based on probabilistic indexing in the year 1960 but the most commonly used model had been designed by Robertson and Sparck Jones in the year 1976. The model is based on the concept that the terms used in a document retrieved previously for a particular query must be highly weighed than the ones that did not appear in the relevant documents. The table given below is presents the term t distribution in both non-relevant and relevant documents pertaining to query q (Brazdil et.al, 2003)
N = the number of documents within the collection
R = count of documents to be used for query q
n = count of documents that have mention of term t
r = count of documents that have mention of term t
This table is used for deriving 4 formulas which indicate the relative distribution of different terms within the non-relevant and relevant documents using them for term-weighting.
Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7), 1145-1159.
Brazdil, P. B., Soares, C., & Da Costa, J. P. (2003). Ranking learning algorithms: Using IBL and meta-learning on accuracy and time results. Machine Learning, 50(3), 251-277.
Liu, T. Y. (2009). Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval, 3(3), 225-331.
Xia, F., Liu, T. Y., Wang, J., Zhang, W., & Li, H. (2008, July). Listwise approach to learning to rank: theory and algorithm. In Proceedings of the 25th international conference on Machine learning(pp. 1192-1199). ACM.