Unstructured content, for example, multimedia, does not fit well in conventional databases. Conventional databases perform admirably on structured content, but for unstructured content they lack the ability to insert and query via efficient indexes. This presents a problem.
Unstructured content includes, among other things, text, multimedia and cutting-edge data types such as genomic sequences. Text covers documents, emails, blogs, etc. Multimedia encompasses images, music, voice, video, etc. The absence of robust, scalable indexing techniques distinguishes unstructured content from the structured content. While structured content relies heavily on indexes based on hash-table and tree-based techniques to make it possible to rapidly search a repository for items that satisfy given criteria. Unstructured content uniformly lacks the equivalent kind of indexing. This presents a problem.
One stop-gap solution designates certain characteristics of unstructured content as “features.” Next, apply conventional indexing techniques on those synthetically generated features. For example, for a repository of digital images, attach features consisting of the time an image was taken, the camera used, who took the picture, the location, and additional descriptive text. Adding “features” takes effort. This presents a problem. First, when the number of items is large it is often impractical to manually apply features, commonly referred to as, “hand-tagging.” Second, content might be manually tagged once, but it can be impractical to revisit them to tag them for another reason. For example, one could plausibly imagine tagging a collection of images of faces with the shape of the nose, eyes, or mouth. However, when a new inquiry arises, it may be impractical to rescan the entire collection of images to annotate for a particular mole near the nose or for a scar on the forehead. This presents a problem.