The present invention relates generally to searching a network and particularly to searching a network which includes documents that have a plurality of tags.
Computer networking systems such as the Internet are exploding in popularity all over the world. There are many reasons for this phenomenal growth, not the least of which is the ability to discover and access needed information in an efficient manner. The power of the Internet enables the average person with very little technical training to search for information in minutes instead of days, weeks, or even months of searching libraries, telephone books, directories or other conventional research means. To better understand conventional Internet search technology, refer now to FIG. 1. FIG. 1 represents a flowchart of how an Internet user performs a conventional web search.
First, the Internet user accesses a web search engine, via step 10. Next, the Internet user enters a search term(s) into the web search engine, via step 12. The web search engine then identifies the web pages that contain the search term(s), via step 14. Finally, the web pages containing the search term(s) are listed by the search engine, via step 16.
However, as more and more information comes online, at accelerating rates, today""s search engine interfaces and features are not keeping pace. Searches that would have previously produced less than a dozen relevant documents are now producing hundreds of documents. This is making it very difficult and time consuming for the Internet user to evaluate and investigate the results. More sophisticated searches, sometimes beyond the grasp of a non-professional researcher, are not always the answer as the narrower searches introduce greater risk of eliminating relevant and useful information. The severity of this problem is growing day by day at an ever-increasing rate.
One of the circumstances greatly exacerbating this problem is the tendency of web page developers to add large numbers of keywords to each and every page of their web site as a strategy to boost their standings with the Internet search engines. Thus, a single web site, which an Internet user may decide is not relevant after accessing the web site home page, may produce dozens or even hundreds of result pages listed in the search results. FIG. 2 shows a typical web search results list. The search term(s) 20 appears on multiple web pages of the xe2x80x9cwww.pinemountainlake.comxe2x80x9d 22 and xe2x80x9cwww.pmlr.comxe2x80x9d 24 web sites. Even with enhanced bandwidth and greater network speeds, wading through hundreds of these xe2x80x9chitsxe2x80x9d to move to the next interesting web site is inefficient, cumbersome and annoying. An Internet user may actually lose patience after viewing dozens of pages of results with redundant information and terminate his search prematurely missing the relevant page buried deep down in the list.
However, the Standard Generalized Markup Language (SGML) working group of the W3 Consortium has proposed a new standard, called XML (extensible Markup Language) which is a subset of SGML. The goal of XML is to provide many of SGML""s benefits that are currently not available with current HTML (Hypertext Markup Language).
One of XML""s benefits is its simplicity. FIG. 3 shows a typical XML document. An XML document is a sequence of tags. Data along with the associated tag is referred to as an element. For example, a book has a title, an author, a publisher, and a price. FIG. 4 accordingly illustrates the tag structure associated with a book entitled xe2x80x9cPresenting XMLxe2x80x9d.
The only restriction is that tag elements must match, e.g. each  less than ADDRESS greater than  must have a matching  less than /ADDRESS greater than , and must nest properly. An XML Document that has matching and properly nested tags is called well-formed. The elements in XML loosely correspond to objects in object oriented or object-relational databases. For example, a  less than PERSON greater than  . . . less than /PERSON greater than  would correspond to an object of type class PERSON{. . . }. Nested XML elements correspond to an object""s fields, e.g.,  less than NAME greater than ,  less than PHONE greater than  and  less than ADDRESS greater than  elements in  less than PERSON greater than  would correspond to the name, phone, and address fields of a PERSON object.
This simplicity allows users to produce XML data with complex structure without having to first define a schema. It can be useful, however, to have some specification of XML data""s structure, especially for a user community to define its own ontology for data exchange. In this case DTDs (Document Type Definitions) can be used to specify the data""s known structure. FIG. 5 shows a typical DTD schema. While DTDs are similar to schemas in object-oriented or object-relational databases, they are less restrictive and permit more variation in the data. For example, DTDs can specify that some fields are optional and that others may occur multiple times, and DTDs do not require that the type of a reference be specified.
Given its flexibility, it is likely that XML will facilitate the exchange of huge amounts of data on the Web. Dozens of application of XML already exist, including a Chemical Markup Language for exchanging data about molecules and the Open Financial Exchange for exchanging financial data between banks or banks and customers. Based on the availability of huge amounts of XML data, one is faced with a problem when the need arises to extract data from these documents. The problem is that conventional search engines, although equipped to search HTML documents, are not able to effectively search XML documents. This is due to the fact that conventional search engines aren""t equipped to handle documents comprising the element tags that the XML format utilizes.
Accordingly, what is needed is an effective method for searching XML documents. The method should be simple, cost effective and capable of being easily adapted into existing technology. The present invention addresses such a need.
A method and system for conducting a search on a network is disclosed. The network has a plurality of sites. One or more of the sites has a plurality of documents wherein at least one of the documents comprises a plurality of tags. The method and system comprises identifying at least one of the plurality of tags, receiving a query, parsing the query, and matching the parsed query with at least one of the plurality of tags of the at least one of the plurality of documents.
Accordingly, through the use of a method and system in accordance with the present invention, the extraction of information from networks comprising XML documents is done in a more precise fashion.