1. Field
One embodiment hereof relates to a document retrieval system, specifically to such a system which can search documents with images and stock market data, and other non-word-based documents.
2. Description of Prior Art
Recent progress of word-based information retrieval, especially related to an Internet document search, has been much more advanced than non-word-based information retrieval. Non-word-based information includes images in physics, medicine, geology, science, engineering, etc. Non-word-based information also includes stock market information, which is primarily represented by curves. In contrast to word-based information that contains strings of words, non-word-based information contains data over an n-dimensional space, and each datum comprises a plurality of values from m measurements, where m and n are integers.
With respect to word-based information, a word-based document consists of strings of words. Note that words may include regular words and other “words” such as email addresses, dates, numbers, URLs (uniform resource locators—Internet addresses), etc. For non-word-based information, such as stock market information, the data associated with a stock includes prices and transaction volume, which are usually represented in curves that show such parameters over time. While word-based documents can be quickly searched in a data source or even on the Internet, there is no way to search non-word-based stock market information related to a particular stock in a data source efficiently and systematically, not to mention searching such information on the Internet.
Some US patents disclose methods for stock market analysis and forecasting. U.S. Pat. No. 6,012,042 to Black et al. (Jan. 4, 2000) discloses a method for converting time-series-based data and non-time-series-based data of a stock into a unified format for stock market analysis. U.S. Pat. No. 6,853,991 to Kermani (Feb. 8, 2005) discloses a method for stock market forecasting based on fuzzy logic. U.S. Pat. No. 6,901,383 to Ricketts et al. (May 31, 2005) discloses another method for stock market forecasting by formulating stock purchase indices from last trading data. However, none of these systems are able to efficiently and systematically search and retrieve non-word-based stock market information. In other words, insofar as I am aware, there is no way to efficiently and systematically search non-word-based stock data in a data source or on the Internet in the way a word-based document is searched.
Other non-word-based information comprises images, including photographs and pictures. An image shows a value or a combination of values over a two-dimensional array. A picture can be a regular color picture taken by a camera, an X-ray picture, an infrared picture, an ultrasound picture, etc. Similarly, there is no efficient and systematic way to search a specific image of interest (e.g., an eye) embedded in an image document (e.g., a human face), which is stored in a stack of image documents (e.g., various pictures), not to mention an Internet search of such an image.
Some known searching methods are able to retrieve information from image documents, albeit inefficiently. U.S. Pat. No. 5,010,581 to Kanno (Apr. 23, 1991) discloses a method for retrieving an image document using a retrieval code, which is not an image. U.S. Pat. No. 5,748,805 to Withgott et al. (May 5, 1998) and U.S. Pat. No. 6,396,951 to Grefenstette (May 28, 2002) disclose methods for searching word-based documents by searching for an image of the word-based document, for example, an input from a scanner. The image is interpreted to provide a word-based meaning, for example using an OCR (optical character reader). However, insofar as I am aware there is no method for efficiently and systematically searching image data in a data source or on the Internet, as a word-based document is searched.
In general, non-word-based information contains data comprising a plurality of values obtained from m measurements over n-dimensional space. Stock data mentioned above comprises multiple values (various kinds of prices, transaction volume, etc.) over a one-dimensional space, which is time. A color picture has three values over a two-dimensional space, generally R, G, and B, representing red, green, and blue values over the space. Insofar as I am aware, there is also no efficient and systematic way to search information containing data that comprises m-values over n-dimensional space.
Traditionally, to detect an image of interest having M2 pixels in an image document having N2 pixels, where M and N can be any integers and N>M, a mathematical process called correlation is required, which includes M2×N2 steps of operation. If the data source contains k documents, k×M2×N2 steps are needed. Similarly, to scan k documents in an n-dimensional pattern, k×Mn×Nn steps are needed. The number of the steps needed increases exponentially as the size of document increases.
On the other hand, methods for searching word-based documents in a data source or on the Internet are widely known in the art and are used in word-based search engines. In principle, a basic way to search word-based documents can be explained as follows.                A data source contains a plurality of word-based documents: for example, Doc 1, Doc 2, Doc 3, Doc 4, . . . Doc n. These documents may be fetched or collected from the Internet.        A document consists of strings or words. For example, Doc 2 may include the words, “. . . He is a computer science professor at XYZ University. You may contact him at prof@xyz.edu. . . . ”        Each document is decomposed, also known in the art as tokenized, into a collection of components or tokens. For example, the tokens of Doc 2 may include: computer, science, professor, xyz, university, contact, prof@xyz.edu, computer science, computer science professor, xyz university, etc.        The tokens of all documents are collected in a master collection of tokens for indexing. A list of documents containing a specific token can then be compiled. Each token has its own list. For example, for the token “university”, a list might be: Doc 2, Doc 3, Doc 6, Doc 15, Doc 22; for the token “prof@xyz.edu”, a list might be: Doc 2, Doc 25; etc.        When a query is presented, the query is also tokenized in the same way. The semantic collection of query tokens is searched over the indexed master collection of tokens. If the token “university” is found in a query, the search engine will return Doc 2, Doc 3, Doc 6, Doc 15, and Doc 22. If the query contains a logic operation among tokens such as “university” AND “prof@xyz.edu”, the result will be Doc 2. And so on.        The order of the documents displayed may follow the matching scores. For example, the matching score may be determined by the frequency of token occurrence in the document, the position of token in the document, or other criteria.        When the matching document is displayed, the matching tokens, which are words, may be flagged or highlighted.        
The main advantage of prior-art word-based document searches, which are based on tokenization, is that they are linear processes and contain no exponential complexity, so that the search can be performed efficiently and systematically.
To summarize, search methods used for word-based information retrieval are linear processes, which are efficient and systematic. However, these methods cannot be directly applied to non-word-based information. Insofar as I am aware, the only method available for non-word-based information retrieval has exponential complexity. Consequently, as far as I am aware, no method is available for efficiently and systematically searching stock data or image data, or other non-word-based data in a data source or on the Internet.