Automatic albuming—the automatic organization of photographs, either as an end in itself or for use in other applications—has been the subject of recent research. Relatively sophisticated image content analysis techniques have been used for image indexing and organization. For image indexing and retrieval applications, simple text analysis techniques have also been used on text or spoken annotations associated with individual photographs. The recent research has involved a number of techniques and tools for automatic albuming of photographs, including                Using date and time information from the camera to perform event segmentation.        Analyzing image content to perform event segmentation and to identify poor images.        Analyzing video frames for purposes of browsing.        Retrieving images or video segments using text keywords.The work described herein extends the functionality of albuming applications by extracting certain types of information from spoken annotations, or the transcriptions of spoken annotations, associated with photographs, and then using the results to perform:        Event segmentation—determining how many events are in a roll of film, and which photographs belong to which event.        Event identification—determining the type (e.g. birthday, wedding, holiday) of each event in a roll of film.        Summarization—identifying the date, time and location of events, as well as the people, objects and activities involved, and summarizing this information in various ways.In this case, natural language (or text based on the natural language) is processed to extract the desired information and the resultant extracted information is used to identify and describe the events.        
Broadly speaking, there are currently three different fields that depend on the processing of natural language: information retrieval, information extraction and natural language parsing. In information retrieval, the task involves retrieving specific items from a database, based on a text query. For example, keywords associated with academic papers can be used to retrieve those papers when the user asks a query using those keywords; text associated with images can be used to retrieve images when the same words occur in another text; text found in video sequences can be used to retrieve those sequences when a user clicks on the same text in an article. There is generally very little, if any, text processing involved in these applications; for instance in copending, commonly assigned U.S. patent application Ser. No. 09/685,112, “An Agent for Integrated Annotation and Retrieval of Images”, word frequency measures are used to identify keywords to search for in an image database. However, some work has shown that, by applying partial parsing techniques to typed queries, retrieval from a database of annotated photographs can be improved.
In information extraction (IE), the idea is to extract pre-determined information from a text. Gaizauskas and Wilks (in R. Gaizauskas and Y. Wilks, “Information extraction: Beyond document retrieval”, Computations Linguistics and Chinese Language Processing, 3(2), 1998) put it this way: “IE may be seen as the activity of populating a structured information source (or database) from an unstructured, or free text, information source”. Applications include analysis, data mining, summarization and indexing. There is a long history of research in automatic information extraction from written news reports (see J. Hobbs, D. Appelt, J. Bear, D. Israel, M. Kameyama, M. Stickel, and M. Tyson. “FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text”, in Roche and Schabes, editors, Finite State Devices for Natural Language Processing, MIT Press, Cambridge, Mass., 1996); some more recent work has begun to investigate information extraction from spoken language.
Both information retrieval and information extraction are minimal-processing approaches in that they use only parts of the input text, and ignore any other structure or components that may be involved. Natural language parsing involves the detailed analysis of a piece of text or segment of speech to uncover the structure and meaning of its parts, possibly to identify the intentions behind its production, and possibly to relate it to other parts of a larger discourse. Natural language parsers include linguistically-motivated rule-based parsers and statistical parsers. Partial parsers are capable of analyzing the syntactic structure of selected parts of input texts.
While it would be theoretically possible to use full natural language parsing for the present invention, in practice it is both infeasible and unnecessary. No existing parser is sufficiently general to robustly handle general text input in real or near-real time. Very few parsers even attempt to handle the fluidity and variety of spoken language. Furthermore, natural language parsers would produce unneeded information (detailed information about the syntactic structure of the input) without necessarily yielding information that is needed (the semantic classes of items in annotations).
The use of photograph annotations for automatic albuming is an ideal application for information extraction. Typically, there is interest in the information contained in the annotation associated with a photograph, but not in all of it; for instance, the quality of the photograph or the photographer's feelings at the time are generally not of interest, even though the photographer may have chosen to discuss those things. In addition, there would be little interest in all of the rich semantics and pragmatics that may underlie the language used; in other words, often a very simple understanding of the annotations will suffice. Finally, the robustness of information extraction techniques make them particularly attractive in a situation where the photographer may use incomplete sentences or even just single words or phrases, as in “the fishing trip august nineteen ninety eight adrian mike and Charles”.
In the past information extraction techniques have been mainly used on newswire texts. These are written texts, relatively short but nevertheless much longer than the typical photograph annotation. Furthermore, photograph annotations (especially with the increasing use of digital cameras with attached microphones) are not carefully organized texts, and may be spoken rather than written. This means that extraction based on photographic annotation cannot depend on some of the textual clues (punctuation, capitalization) on which certain information extraction techniques rely heavily.