1. Field of the Invention
The present invention generally relates to a search engine system communicating with a full text search engine; and more particularly relates to a search engine system communicating with a full text search engine to retrieve most similar documents in response to a query document, using vectors to represent documents.
2. Description of Related Art
Fundamentally, computers are tools for helping people with their everyday activities. Processors may be considered as extensions to our reasoning capabilities and storage devices may be considered as extensions to our memories.
Search engines (e.g. internet search engines) allow a user to identify relevant documents, in response to a query comprising e.g. one or more search terms or documents. Search engines typically make use of significant computing resources (with regards to processing power and with regards to memory), in order to provide the user with a reliable list of potentially relevant documents.
Various electronic devices (e.g. smartphones, computers, laptops, tablet computers, notebook computers, etc.) allow a user to carry around a large database of text documents (such as electronic books, CV, marketing reports, internal business documents, emails, sms, calendar database entries, address book entries, downloaded webpages, and others). The user should be enabled to reliably and efficiently determine relevant text documents from the database of text documents, in response to a query.
There are existing ways of storing, searching and retrieving text documents based on different techniques such as: full text keyword search, full text index, inverted index, semantic search, semantic vector analysis, vector index and vector search etc.
The vector space model of representing documents in high-dimensional vector spaces has been validated by decades of research and development. Extensive deployment of inverted-index-based information retrieval (IR) systems has led to the availability of robust open source IR systems such as Sphinx, Lucene or its popular, horizontally scalable extensions of Elasticsearch and Solr.
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure. Indexes are used to quickly locate data without having to search every row in a database table every time a database table is accessed. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.
Generally, inverted index is a type of database index used to optimize the search of indexed documents from the inputted search keywords query. The inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents.
Today, systems based on distributional semantics and deep learning allow the construction of semantic vector space models representing words, sentences, paragraphs or even whole documents as vectors in high-dimensional spaces with accuracy superior to keyword search.
Vectors are superior representation of documents. To allow searching through the documents represented as vectors, there is a need of a vector search engine. To implement, configure, maintain a vector search engine is a costly, tedious and complex task. On the other hand full text search engine are not costly, not tedious and not complex to use. Therefore, there is a need of a search engine system for performing semantic vector search for an input query document using a full text search engine to retrieve most similar documents.