1. Technical Field
This invention relates to information access and finds particular application in locating information contained in documents that have been annotated using a structured markup language.
2. Related Art
To assist in locating information stored, for example, in a computer-based distributed file store, search engines of various types have been implemented in software to assist with identifying data sets that contain information of at least some relevance to a user's search criteria. To assist with information location, search engines are often able to make use of already constructed indexes to particular fields or domains of information, or to exploit summary or keyword data stored within data sets themselves.
However, it is often necessary for a search engine to analyse the contents of a data set to try to determine it's primary information content and to assess the relevance of that information to the user's requirements. This is a more or less difficult task, according to the way the information is presented and structured.
In the context of a distributed information store such as that provided by the Wordwide Web (known as the “web”), a markup language has been developed and standardised to improve identification and access to information contained in web pages. The Hypertext Markup Language (HTML) used to annotate web pages includes a <META> tag for use in identifying a list of keywords provided by the web page author and indicative of the information content of the web page. Search engines may search for a <META> tag within a web page and compare any associated keywords with a user's search criteria to determine whether or not the information in the page is likely to be relevant.
More recently, a mark-up language called extensible Markup Language (XML) has been developed to provide a more flexible and structured means for annotating information. One of the biggest potential benefits of XML is its ability to improve the accuracy of searches through the millions of documents now stored on intranets and the Internet. Exploitation of meta-information provided by XML tagging has the potential to dramatically reduce the number of irrelevant hits returned compared with current HTML-based search engines. However, whereas all tags within the HTML markup language are standardised, XML tags are, but for a small core of standard tags, entirely user-definable. To some extent, the usefulness of XML tagging is therefore subject to the skills of a document author. However, XML does allow user communities, from industry groups to single users, to develop an individual mark-up language that best suits their needs. In order to coordinate proposals for XML standards, in e-commerce applications for example, the Organisation for the Advancement of Structured Information Standards (OASIS) has created the Web Portal “XML.org”.
A known XML search engine such as “GoXML” provides a largely conventional keyword-based search facility to locate relevant information in conventional web pages as well as XML tagged documents. Where XML documents are located in a search, GoXML compiles and presents a flat list of the tags that mark up document parts within which search keywords were found, together with a conventional list of references to those documents. The user can then explore this list of “hit” tags by selecting a particular tag, causing the document list to be reduced to only those documents where a search keyword was found to occur in a part marked up by the selected tag. However, GoXML does not carry out further analysis of “hit” tags to enable a user to fully exploit the potential contextual information provided by those tags and to navigate the search results more effectively.
According to a first aspect of the present invention there is provided a method of accessing sets of information stored in an information system, wherein portions of said sets of information are enclosed by tags of a hierarchical tag structure defined according to a structured mark-up language, the method comprising the steps of:
(i) generating a search query comprising specified search criteria;
(ii) identifying portions of said sets of information matching said specified search criteria, and outputting a list of references to said identified sets of information;
(iii) identifying, for each matching portion identified at step (ii), an enclosing tag structure and outputting a list of said identified tag structures;
(iv) receiving a selection signal specifying a tag structure from the list output at step (iii);
(v) adjusting said list of references from step (ii) to comprise references only to said identified sets of information that contain the tag structure selected at step (iv);
(vi) adjusting said list of tag structures to comprise tag structures contained in information sets referenced in said adjusted list at step (v); and
(vii) repeating step (iv) in respect of said adjusted list of tag structures, and step (v) to identify a more specific list of references to sets of information.
According to preferred embodiments of the present invention, apparatus and methods are provided to enable a user to locate and retrieve sets of information relevant to search criteria specified in a search query submitted by the user. In particular, as for all embodiments of the present invention, apparatus and methods are designed to enable the user to exploit contextual information provided within documents that have been annotated using tags defined according to a structured markup language such as XML. Besides locating portions of a document that appear to match the user's search criteria, embodiments of the present invention enable the user to use XML or other markup language tags, inserted into a document by the author, to help identify those documents from a potentially large set of search results that are most relevant to the original search query or, more particularly, to what the user hoped to find.