1. Field of the Invention
This invention relates, in general, to retrieval systems and methods of searching of information in the Internet.
2. Description of the Prior Art
At present the Internet is one of the main sources of information for people together with TV, the radio, newspapers, books, magazines and other kinds of press products.
The main part of information in the Internet is present in the form of Web sites, which are stored on the numerous network servers. Retrieval systems are used for the search of information in the Internet. These are Google.com, Yahoo.com, Search.com, Rambler.ru and others. Web sites are registered in retrieval systems. Web sites specify URL addresses and key words for Web sites in whole and for separated Web pages. This information is stored in database of server of retrieval system.
In order to find needed information, a user has to fill any key words in the specified field in a retrieval system. The search for information is performed on the basis of said key words. The search for information is implemented with the help of special searching programs that retrieve relevant key words in databases of the retrieval system and provide corresponding links to the accessible Web sites and/or Web pages in the Internet. The collected information is stored on the server of the retrieval system in the form of a list of URL addresses of Web sites and Web pages corresponding to key words specified by the user. The user normally sees on a screen of his computer a portion of the collected information, i.e. a list with 10-20 URL addresses out of the total number of the Web sites found by the retrieval system. Then user can get an access to any Web site and/or Web page with the help of the browser by selecting a corresponding URL address provided by the retrieval system.
There are various algorithms of searching on the basis of key words used by retrieval systems. The common feature of these algorithms is that for some requests extremely long lists can be provided with hundreds, thousands and even millions of URL addresses, if according to the retrieval system there is any relation between the requested keywords and the provided URL addresses. For the available amount of information in the Internet, this situation is not uncommon. In most cases the user is unable to browse all the provided offered information. As experience shows, there is no need to browse all the provided URL addresses, because there are only a few tens or a few hundreds of addresses, which are truly related to what the user is looking for. The rest of information is in the most cases irrelevant to the request. This is variegated information from different branch of knowledge of people or from different field of activity of people and so on. Moreover, it is not always certain that the required information could be found on the first page of result of search, even. The above mentioned problem takes place, because a search via key words is based on mathematical algorithms, such as a comparison of requested key words with key words specified for or in Web sites, an estimation of a number of matches between the requested key words and the words in the title or in the text of Web pages, and so on. The search results on the basis of mathematical algorithms do not always represent the meaning of site's information. Therefore the user gets a huge amount of unnecessary information on his request. As the amount of information in the Internet steadily increases, this problem will worsen. The improvement of search algorithms operating on the basis of key words will not solve this problem, because identical key words can be situated in sources of information belonging to different branches of knowledge, different fields of people's activity and so on.
A flow of unnecessary information slows down operation of local and global computer networks increases demands for extra space on hard disks of servers of retrieval systems, puts additional requirements on improvement of searching programs based on analysis of key words and causes inefficient usage of other material and human resources.
A special skill is needed in the selection of key words in order to find required information. A change of the order of key words, a change of the search phrase often affects the search result. If key words have homonyms one can get information for needed and not needed significances of these key words.
Existent retrieval systems do not provide a possibility of selecting of the required data from obtained results of searching on the basis of specified criterions.
Existent retrieval systems do not give any guarantee to owners of Web sites that their site will appear in the list of search result even if its content completely corresponds to the specified key words. Some retrieval systems apply mathematical methods for estimation of the specified key words. Some retrieval systems apply mathematical methods for estimation of popularity and ranking of Web sites, which gives a possibility for the Web sites with the highest rank to appear in the list of the first 10-20 URL addresses. For artificial increasing the rating of a Web site, some owners of Web sites create spam-Web sites, which increase number of references to needed Web sites. Some companies of Web designers elaborate and propose methods of increasing rating of Web sites. These measures not improve situation for searches of information.
Some retrieval systems attempt to improve the quality of search of information by introducing catalogues. Catalogues are available at Google.com, Yahoo.com, Apport.ru and others. These catalogues have a small numbers of the main categories (generally less than 20). But this is insufficient for the existing amount of information available in the Internet and does not solve the problem of improvement of the quality of the search of information in the Internet. These catalogues typically include the following categories: computers, work, education, house, society, entertainment, recreation, sport, manufacture, business, Internet for kids, mass media, inquiries and so on. Obviously, retrieval systems make attempts to classify information on edutainment and entertainment, as this kind of information seems more popular among the users of the Internet in opinion retrieval systems. However, all the information available in the Internet must be classified including information required for scientists, politicians, students and others.
There are a great number of patents devoted to the problem of the search of information in the Internet. The following patents are more relevant to the subject of the proposed invention.
In patent, U.S. Pat. No. 5,369,763 “Data storage and retrieval system with improved database structure” by Biles from 29th of November 1994, a system of storing and searching information, based on the modified Library of Congress of USA Classification System, is proposed for a local computer system. According to this patent, data on numerous topics and subjects are stored in the Subject Database. Descriptor phrases, associated with an every subject and topic, are introduced into this Data Base together with identifying information. Data based on a classification system are stored in the Typology Database. The Identification Database facilitates an access to the information stored in the Subject Database. Titles of topics, designation numbers and corresponding descriptor phrases, identification information from the Subject Database are stored in the Composite Catalogue. With the help of stored descriptor phrases related to a specific topic, a user can find needed information. This information is searched in the following way. The user selects the descriptor phrase. Then the number of this descriptor phrase is searched in the Composite Catalogue. The desired information is searched using this number. An alphabet sorting and sorting on the basis of the level in the catalogue are proposed in this patent. Only the use of specified descriptor phrases is proposed to use in said patent. Arbitrary descriptor phrases cannot be used for search in this patent. This limits freedom and capability of searching. Moreover, the proposed retrieval system does not deal with search of information in the Internet.
In patent, U.S. Pat. No. 5,907,838 “Information search and collection method and system” by Miyasaka et al. from 25th of May 1999, the method of searching for information in the Internet based on object-oriented programming is proposed. According to the proposed method, properties are set for information units for each category of class and the method of data collection is described for each property. A user formulates his request for search of required information in terms of key words, which is transformed in a format understandable for the system. The request is then classified into the class category and information units are found according to the properties of the class, which are determined by the request of the user. This method is designed for collecting specific information in the Internet.
In patent, U.S. Pat. No. 6,233,575 “Multilevel taxonomy based on features derived from training documents classification using Fisher values as discrimination values” by Agrawal et al. from 15th of May 2001, the method is proposed for evaluation of large text documents on the basis of Fisher value and addition of these documents into a hierarchic structure. A topic path of hierarchic structure is used along with key words for the purpose of improving searching.
Unfortunately, the problem of searching of information in the Internet not finds a full solution in existing retrieval systems and in patents literature, at present time. There is a need for determination of characteristic features of information for its structurization, storage of the data about information in the rank-order form in a retrieval system. A classification of different directions of human activity and different branches of knowledge for information, registered in a retrieval system, can be for this purpose used. In this case, a searcher of information will get information not from all volume of information of the Internet, but from part of information that is interested for a user. Thus there is a need in a search system and a method of searching of information based on a global classification of information in the Internet. Such system would be capable to structurization the entering information according to sections of classification of information and to obtaining of information according to these sections. This would be a solution for increasing the efficiency of a search for information.
At present, there are some library classifications of information available. These classifications exist some centuries before. Within these classifications a successful system of classifying a large amount of existing information has been developed. Well-known examples of such classifications are the Library of Congress of USA Classification System, the Decimal Classification, the Bibliothecal-bibliographical Classification and others. The amount of information within of largest libraries is comparable with the amount of information in the Internet. Library Classifications are convenient and simple in usage. They are logical and understandable for users. Library Classifications are constantly improving and accommodate changes happening in the information world. Evidently, some Library Classifications of information can be used as an example for the development of Global Classification of Information in the Internet. An application of any Classification of Information in the Internet for structuring and searching of information in the Internet can solve existing problems. The new classification of the information in the Internet could be represented as a catalogue, similar to the classification in the librarianship. Of course, such classification would have to be adapted to the needs and specifics of the Internet. There is a need in classifying additional sources of information, such as electronic shops, forums and others available only in the Internet. An every division and a subdivision of the catalogue cover a certain field of information. For users' comfort, a brief characteristic has to be provided for an every division and subdivision of the catalogue. An every division and subdivision of the catalogue must have a specific code. The classification must have a possibility of evolution and take into account all possible future changes in the world information system and in the Internet.
Therefore there is a need in a system and a method addressing the abovementioned problems in the search of information in the Internet.