1. Field of the Invention
The present invention generally relates to a method and computer system of providing semantic information for data. More particularly, the present invention relates to a method and a computer system annotating a large volume of semi-structured or unstructured data with semantics.
2. Description of the Related Art
Advancements in technology including computing, network, and sensor equipment, etc. have resulted in large volumes of data being generated. The collected data generally need to be analyzed, and this is traditionally accomplished within a single application. However, in many areas, such as bioinformatics, meteorology, etc, the data produced/collected by one application may need to be further used in other applications. Additionally, interdisciplinary collaboration, especially in the scientific community, is often desirable. Therefore, one key issue is interoperability in terms of the ability to exchange information (syntactic interoperability) and to use the information that has been exchanged (semantic interoperability). IEEE Standard Computer Dictionary: A Compilation of IEEE Standard Computer Glossaries, IEEE, 1990.
Conventional semantic World Wide Web, or “Web,” technologies involving ontology-based representations of information enable the cooperation of computers and humans and can be used to assist with data sharing and management. Through ontological representation, the modeling of entities and relationships in a domain allows the software and computer to process information as never before [www.sys-con.com/xml/article.cfm?id=577, retrieved on Oct. 22, 2004]. Conventional semantic Web technologies are an extension of the World Wide Web, which rely on searching Web pages and bringing the Web page to the semantic Web page level. Therefore, conventional semantic Web technologies process Web pages, which as tagged documents, such as hypertext markup language (HTML) documents, are considered fully structured documents. Further, the conventional semantic Web technologies are only for presentation, but not for task computing (i.e., computing device to computing device task processing). WEB SCRAPER software is an example of a conventional semantic Web technology bringing Web pages, as structured documents, to the semantic level. However, adding semantics to semi-structured or unstructured data, such as a flat file, is not a trivial task, and traditionally this function has been performed on a case-by-case (per input data) manner, which can be tedious and error-prone. Even when annotation is automated, such automation only targets a specific domain to be annotated.
Therefore, existing approaches to semi-structured and unstructured data annotation, depend completely on user knowledge and manual processing, which is not suitable for annotating data in large quantities, in any format, and in any domain, because such existing data annotation approaches are too tedious and error-prone to be applicable to large data, in any format and in any domain. For example, existing approaches, such as GENE ONTOLOGY (GO) annotation [www.geneontology.org, retrieved on Oct. 22, 2004] and TRELLIS by University of Southern California's Information Sciences Institute (ISI) [www.isi.edu/ikcap/trellis, retrieved on Oct. 22, 2004], depend completely on user knowledge, are data specific, and per input data based, which can be tedious and error-prone. In particular, GENE ONTOLOGY (GO) provides semantic data annotated with gene ontologies, but GO is only applicable to gene products and relies heavily on expertise in gene products (i.e., generally manual annotation, and if any type of automation is provided, the automation targets only, or is specific to, gene products domain). Further, in TRELLIS, users add semantic annotation to documents through observation, viewpoints and conclusion, but TRELLIS also relies heavily on users to add new knowledge based on their expertise, and further, in TRELLIS semantic annotation results in one semantic instance per observed document.
To take full advantage of any collected data in semi-structured or unstructured format for successful data sharing and management, easier ways to annotate data with semantics are much needed.