1. Field of the Invention
This disclosure relates in general to semantic searching in content management applications, and more particularly to a method, apparatus and program storage device for processing semantic subjects that occur as terms within document content.
2. Description of Related Art
Content management applications manage collections of data and are used to save data search and retrieval time. In computer applications a client process runs on a local or client computer that accesses and updates databases located, for example, on a remote or server computer running a server process. Client processes and server processes may be connected together through a network or collection of networks, such as the Internet. An example of a client process is a Web browser or spreadsheet program and an example of a server process is a Web server or database server.
The Internet exchanges information via hypertext transfer protocol (HTTP). The use of the Internet computer network for commercial and noncommercial uses is expanding rapidly. Via its networks, the Internet computer network enables many users in locations around the world to access information stored in data sources (e.g., content management applications) stored in different locations.
The World Wide Web (i.e., the “WWW” or the “Web”) is a hypertext information and communication system used on the Internet computer network with data communications operating according to a client/server model. Typically, a Web client computer will request data stored in data sources from a Web server computer, at which Web server software resides. The Web server software interacts with an interface connected to, for example, a content management application system connected to other data sources. Computer programs residing at the Web server computer can then retrieve the data and transmit the data to the client computer. Retrieved data can be any type of information, including database data, static data, HTML data, or dynamically generated data.
Accompanying the growing popularity of the Internet and the World Wide Web (also known as “WWW” or the “Web”) is the fast growing demand for Web access to databases. Thus, database searches are becoming increasingly important. And as data continues to grow, it becomes more difficult to provide simple menu-based navigation systems to information, and database searching by the user is the more efficient way to find information.
To address this demand, web content is authored in extensible markup language (XML) that provides users the capability to define their own tags. A tag is a keyword that identifies what the data is which is associated with the tag, and is typically composed of a character string enclosed in special characters, e.g., whether given text is a heading or a paragraph. This makes XML a very powerful language that enables users to easily define a data model, which may change from one document to another, which provides a way for an author to create a custom markup language to suit a particular kind of document.
XML can be likened to a Hypertext Markup Language (HTML) file because both are based on the standard generalized markup language (SGML) and use tags to convey basic information about the structure of a web document. The style and logic of HTML documents are hardcoded, however, and a limited number of HTML element tags are available. As a result, HTML tags do not define the meaning of every page element. In XML, each document is an object, and each element of the document is an object. The logical structure of the document typically is specified in XML grammar such as a Document Type Definition (DTD), XML Schema Definition, or Relax NG grammar. A DTD may be used by the author to define a grammar for a set of tags for the document so that a given application may validate the proper use of the tags. A DTD comprises a set of elements and their attributes, as well as a specification of the relationship of each element to other elements. Once an element is defined, it may then be associated with a stylesheet, a script, HTML code or the like. Thus, with XML, an author may define his or her own tags and attributes to identify semantic elements of a document, which may then be validated automatically.
When an application generates XML tags (and corresponding data) for a document according to a particular XML data model and transmits that document to another application that also understands this data model, the XML notation functions as a conduit, enabling a smooth transfer of information from one application to the other. By parsing the tags of the data model from the received document, the receiving application can re-create the information for display, printing, or other processing, as the generating application intended it. Conversely, HTML uses a particular set of predefined tags, and is therefore not a user-extensible language.
XML is a well-formed notation, meaning that all opening tags have corresponding closing tags (with the exception of a special “empty” tag, which is both opened and closed by a single tag, such as “<email/>”), and each tag that nests within another tag is closed before the outer tag is closed. HTML, on the other hand, is not a well-formed notation. Some HTML tags do not require closing tags, and nested tags are not required to follow the strict requirements as described for XML (that is, in HTML a tag may be opened within a first outer tag, and closed within a different outer tag).
XML was optimally supposed to enable semantic search: the ability to distinguish the different senses of a word (such as the chemical, markup, and programming senses of the word “element”) and thus find precisely the information of interest. This promise contrasts with the behavior of full text search engines such as Google™, which match all occurrences of the lexical string “element” regardless of sense.
XML provides the ability to mark up the semantics of documents. However, the only way to support a semantic search historically has been to write a search implementation that was sensitive to the custom markup.
More recently, semantic web technologies such as resource description framework (RDF) and TopicMaps have introduced standard ways to represent semantic information with structures suitable for databases. Search implementations have been written for these semantic representations. The semantic web technologies do not, however, provide a way to bridge the gap between the markup of document content and these generic semantic representations.
It can be seen that there is a need for a method, apparatus and program storage device for generating and representing semantic information related to subjects within a knowledge representation.