1. Field of the Invention
The present invention relates to an information retrieval system and method, and more particularly, to a query and document topic category transition analysis system and method in which a query topic category of a query input from a user in the form of a set of keywords and a document topic category of a document which a user regards as relevant and selects from information retrieval results are classified to analyze transition between the query topic category and the document topic category, and a query expansion-based information retrieval system and method using query and document topic category transition analysis in which a query input from a user is expanded using a topic category transition analysis result, and corresponding information or documents are retrieved using the expanded query.
2. Discussion of Related Art
Conventional techniques for online (Internet) information retrieval services include a document similarity ranking technique for a search engine, a topic category-based document classification technique, and a topic category-based log analysis technique.
A Document Similarity Ranking Technique for a Search Engine (Hereinafter, “Conventional Art 1”)
In conventional art 1, documents relevant to a query input from a user are retrieved based on a similarity between the document and the query. Most of information retrieval web portal sites which are commercialized (in service) rank various kinds of web contents such as blogs, knowledge, images, news, and shopping information based on retrieval queries through a search engine and provide users with ranked retrieval results.
To this end, all documents on the web have to be indexed in advance, and a search engine statistically analyzes terms of documents and links between documents using document indexes, generates retrieval results suitable for a query input by a user in the form of a ranked list (a set of links indicating documents) and provides the user with the retrieval results through a web page.
However, in information retrieval ranking, usually texts and metadata of documents and relation information (for example, links or topic categories) between documents are used. A method which gives attracting contents from public as high rank is restrictedly used, but there is a problem in that a user preference which depends on a query category of information retrieval keywords entered by a user is excluded from factors for determining information retrieval ranking.
Topic Category-Based Document Classification Technique (Hereinafter, “Conventional Art 2”)
In conventional art 2, in constructing an information retrieval system, a document is classified in advance into one topic category which is previously defined or multiple topic categories which are previously defined.
For example, a document classification process of conventional art 2 is described below.
A process of representing documents in a form suitable for machine learning is performed, and, during the document representing process, selecting appropriate features, and weighting the features are preceded.
Then, in order to accurately allocate a category within an appropriate time, a process of learning a document categorization rule is performed, newly inputted documents are classified according to the learning result.
In particular, in the case in which text-based taxonomy which is already constructed is equipped, a method of extracting input vectors from input documents, generating similarity to vectors representing topic categories which are previously defined, and allocating topic categories to the documents is used.
The document classification process described above may be variously applied to fields such as a voice recognition-based customer center automatic call classification system, a topic category classification system of advertisement contents for keyword advertisements, and an automatic classification system of web sites/patents/academic literature/books.
Meanwhile, a method of automatically identifying a topic category of a user query or a topic category of a document using taxonomy which continuously evolves such as an open directory project (hereinafter, “ODP”) has been attempted, but no research on analyzing transition between a query topic category and a category of relevant documents has been conducted.
Topic Category-Based Log Analysis Technique (Hereinafter, “Conventional Art 3”)
In conventional art 3, based on session information included in a web log related to a query input from a user, session information included in a web log related to retrieval results for the query, and a topic category of a user input query and a user read content, a user navigation path is detected, and navigation path transition is analyzed and used in an information retrieval system.
For example, in “Analysis of Topic Dynamics in Web Search” by Xuehua Shen et al, Int. Conf. of World Wide Web, 2005, an experiment for analyzing and learning topic category transition between web pages which a user queries and then visits according to time and a user (personal/group/general public) based on a Markov model and anticipating a web page which a user will visit later has been conducted. An aspect of user behavior could be somewhat anticipated through an experimental result, and when users were classified into groups of persons with similar behavior and analyzed, it turned out that performance was improved.
However, the conventional art described above anticipates a topic category of a web page which a user will visit without considering a difference between a query input from a user and a web page visited by users.
Also, the conventional art described above uses the ODP taxonomy, but has a problem in that it uses only a small number (15) of highest level (coarse-grained) topic categories as topic categories and cannot perform precise (fine-grained) topic category classification based on the ODP taxonomy.
For the foregoing reasons, there is an urgent need for technology which can more precisely analyze transition of a topic interesting to a user who uses an information retrieval service and classify the user's intention or interesting topic into more detailed query and document topic categories in view of a phenomenon (a propensity or a tendency) in which a topic interesting to a user when an information retrieval keyword is input is different from a topic interesting to the user when the user selects a document which the user regards as relevant from information retrieval results.
There is also an urgent need for technology which can analyze topic category transition between a user query and a relevant document (a document selected by a user) more precisely based on a query and document topic category classification.
There is also an urgent need for technology which automatically extracts a topic transition tendency and expands a user query based on query and document topic category transition analysis and a user log and thus provides information retrieval results with high user satisfaction.
There is also an urgent need for technology which can detect a topic category of a document (content) which is attracting public attention or which a user prefers according to a topic and give documents corresponding to the topic category high rankings among retrieval results.