This invention relates to managing taxonomic information.
With modern advances in computer technology and network and Internet technologies, vast amounts of information have become readily available in homes, businesses, and educational and government institutions throughout the world. Indeed, many businesses, individuals, and institutions rely on computer-accessible information on a daily basis. This global popularity has further increased the demand for even greater amounts of computer-accessible information. However, as the total amount of accessible information increases, the ability to locate specific items of information within the totality becomes increasingly more difficult.
The format with which the accessible information is arranged also affects the level of difficulty in locating specific items of information within the totality. For example, searching through vast amounts of information arranged in a free-form format can be substantially more difficult and time consuming than searching through information arranged in a pre-defined order, such as by topic, date, category, or the like. Due to the nature of certain on-line systems much of the accessible information is placed on-line in the form of free-format text. Moreover, the amount of on-line data in the form of free-format text continues to grow very rapidly.
Search schemes employed to locate specific items of information among the on-line information content typically depend upon the presence or absence of key words (words included in the user-entered query) in the searchable text. Such search schemes identify those textual information items that include (or omit) the key words. However, in systems, such as the World Wide Web (“Web”), or large Intranets, where the total information content is relatively large and free-form, key word searching can be problematic, for example, resulting in the identification of numerous text items that contain (or omit) the selected key words, but which are not relevant to the actual subject matter to which the user intended to direct the search.
As text repositories grow in number and size and global connectivity improves, there is a need to support efficient and effective information retrieval (IR), searching, and filtering. A manifestation of this need is the proliferation of commercial text search engines that crawl and index the Web, and subscription-based information mechanisms.
Common practices for managing such information complexity on the Internet or in database structures typically involve tree-structured hierarchical indices such as the Internet directory Yahoo!™, which is largely manually organized in preset hierarchies. Patent databases are organized by the U.S. patent office's class codes, which form a preset hierarchy. Digital libraries that mimic hardcopy libraries support subject indexing inspired by the Library of Congress Catalogue, which is also hierarchical.
Querying or filtering by key words alone can produce unsatisfactory results, since there may be many aspects to, and often different interpretations of, the key words, and many of these aspects and interpretations may be irrelevant to the subject matter that the searcher intended to find.
For example, if a wildlife researcher is attempting to find information about the running speed of the jaguar by submitting the query “jaguar speed” to an Internet search engine, a variety of responses may be generated, including responses relating to Jaguar® cars and a Jaguar sports team, as well as responses relating to the jaguar animal.
If an index such as Yahoo!™ is used, the user can seek documents containing “jaguar” in the topical context of animals. It is labor- and time-intensive to maintain such an index as the Web changes and grows.
Biocentric information is information associated with at least one instance of something that is or was alive (“biotic entity” or “organism”), and, as illustrated in FIGS. 1-2, may include human observations recorded to physical media, physical specimens, and other biocentric data items that libraries store and that museums collect, including photographs, slides, and annotations on physical specimens. Libraries house vast collections of publications, many of which refer to the observations and recordings about the natural world (see, e.g., FIG. 3).
As shown by example in FIG. 4, biocentric data items can be electronic objects that represent biocentric information in an electronically accessible way. Biocentric data items can be derived from biocentric information in a variety of formats.
Biocentric files may be served through applications, as illustrated by examples in FIG. 5. Observations may be recorded in tables that are served via database management tools. A suite of software tools may allow table data to be flexibly delivered to the Web. Accordingly, information on specimen collections, bibliographic references, and field observations of organisms can be recorded and retrieved.
Multimedia objects having audio, illustrations, photographs, or video (sometimes referred to as “binary large objects” or “BLOBs”) may be served by many applications. FIG. 6 illustrates an example of a combination of database and image server used to serve photos to the Web. The images are served through the image data server, which communicates with the database server to locate and serve the associated text annotations.
Full-text documents represent a resource of biocentric information. Books, journals, monographs, and manuscripts are historic means of communicating and storing knowledge of the natural world. The recording, parsing, and serving of full-text is a complicated endeavor. Technologies known as Standard Generalized Markup Language (SGML) and Extensible Markup Language (XML) offer a flexible infrastructure for serving full-text data, as diagrammed in FIG. 7.
Applications reside on host computers that serve that biocentric data through network protocols. Often these hosts are specialized for a particular task or group of tasks, as illustrated by examples in FIG. 8. One server may supply many different services or the services may reside on more than one machine. The information that is served may reside at a particular location to take advantage of services on the host or proximity to a data manager to facilitate management.
Multiple hosts can be organized within logical subnetworks that can be viewed as a logical entity known as a domain, as shown by example in FIG. 9. A domain may represent a collection of hosts, each with its own collection of applications, each with its own collection of biocentric data.
FIG. 10 illustrates by example that relationships exist between domains and applications within domains concerning the biocentric data. A domain is a user-defined arbitrary collection of applications. For example, an institution may have a library catalog application and an on-line encyclopedia application, both of which rely on animal names.
Scientific interest in the creation of a unified catalog of the 1.75 million known species of living organisms has been recognized by Species 2000 and North America's Integrated Taxonomic Information Systems (ITIS).
Such an attempt to organize biocentric information is put forth in the context of the large number of species, the variation within species, and the expression of individual and species information with historical and geographical dimensions, from scales ranging from the molecular to the ecosystem, and modified as a function of a myriad of potential biotic and abiotic interactions. Furthermore, much of the known information was collected in a pre-electronic format, and the ranks of the custodians of much of that information (primarily taxonomists) are not being fully replenished as the custodians retire. Bioinformatics tools such as GenBank® are available to deal with molecular data. However, data on biodiversity can be difficult to assemble. The challenge of making biodiversity information available electronically is of such a magnitude that it has been described as requiring a “mega science” response. Federal and intergovernmental programs such as Global Biodiversity Information Facility (GBIF), Partnerships for Enhancing Expertise in Taxonomy (PEET), Australian Biological Resources Study (ABRS), and Species 2000 have emerged to address this problem. One strategy is based on assembling large databases.
Facilities that have been proposed include the following. A GBIF connects smaller databases and creates a directory of the three billion specimens in museums and seed banks. GBIF is an initiative of the United Nations Environment Programme/Organization for Economic Cooperation and Development (UNEP/OECD) and inter-governmental programs committed to documenting the diversity of life. GBIF includes Species 2000 and the Expert Taxonomy Institute (ETI) as associates.
Species Analyst is a biodiversity site that provides access to natural-history databases to promote taxonomy in the United States. Species Analyst seeks to integrate biodiversity information through the Web.
Species 2000, which aims to index all the world's known species, has data on 250,000 species in a rigid database structure. Species 2000 is a focal point for many biodiversity enterprises.
Deep Green presents data on the genetics and evolution of plants.