This invention relates to the field of image processing, and more particularly to the processing of photographic data in order to automatically organize photographs into photographic albums.
Automatic albumingxe2x80x94the automatic organization of photographs, either as an end in itself or for use in other applicationsxe2x80x94has been the subject of recent research. Relatively sophisticated image content analysis techniques have been used for image indexing and organization. For image indexing and retrieval applications, simple text analysis techniques have also been used on text or spoken annotations associated with individual photographs. The recent research has involved a number of techniques and tools for automatic albuming of photographs, including
Using date and time information from the camera to perform event segmentation.
Analyzing image content to perform event segmentation and to identify poor images.
Analyzing video frames for purposes of browsing.
Retrieving images or video segments using text keywords.
The work described herein extends the functionality of albuming applications by extracting certain types of information from spoken annotations, or the transcriptions of spoken annotations, associated with photographs, and then using the results to perform:
Event segmentationxe2x80x94determining how many events are in a roll of film, and which photographs belong to which event.
Event identificationxe2x80x94determining the type (e.g. birthday, wedding, holiday) of each event in a roll of film.
Summarizationxe2x80x94identifying the date, time and location of events, as well as the people, objects and activities involved, and summarizing this information in various ways.
In this case, natural language (or text based on the natural language) is processed to extract the desired information and the resultant extracted information is used to identify and describe the events.
Broadly speaking, there are currently three different fields that depend on the processing of natural language: information retrieval, information extraction and natural language parsing. In information retrieval, the task involves retrieving specific items from a database, based on a text query. For example, keywords associated with academic papers can be used to retrieve those papers when the user asks a query using those keywords; text associated with images can be used to retrieve images when the same words occur in another text; text found in video sequences can be used to retrieve those sequences when a user clicks on the same text in an article. There is generally very little, if any, text processing involved in these applications; for instance in copending, commonly assigned U.S. patent application Ser. No. 09/685,112, xe2x80x9cAn Agent for Integrated Annotation and Retrieval of Imagesxe2x80x9d, word frequency measures are used to identify keywords to search for in an image database. However, some work has shown that, by applying partial parsing techniques to typed queries, retrieval from a database of annotated photographs can be improved.
In information extraction (IE), the idea is to extract predetermined information from a text. Gaizauskas and Wilks (in R. Gaizauskas and Y. Wilks, xe2x80x9cInformation extraction: Beyond document retrieval xe2x80x9d, Computations Linguistics and Chinese Language Processing, 3(2), 1998) put it this way: xe2x80x9cIE may be seen as the activity of populating a structured information source (or database) from an unstructured, or free text, information sourcexe2x80x9d. Applications include analysis, data mining, summarization and indexing. There is a long history of research in automatic information extraction from written news reports (see J. Hobbs, D. Appelt, J. Bear, D. Israel, M. Kameyama, M. Stickel, and M. Tyson. xe2x80x9cFASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Textxe2x80x9d, in Roche and Schabes, editors, Finite State Devices for Natural Language Processing, MIT Press, Cambridge, Mass., 1996); some more recent work has begun to investigate information extraction from spoken language.
Both information retrieval and information extraction are minimal-processing approaches in that they use only parts of the input text, and ignore any other structure or components that may be involved. Natural language parsing involves the detailed analysis of a piece of text or segment of speech to uncover the structure and meaning of its parts, possibly to identify the intentions behind its production, and possibly to relate it to other parts of a larger discourse. Natural language parsers include linguistically-motivated rule-based parsers and statistical parsers. Partial parsers are capable of analyzing the syntactic structure of selected parts of input texts.
While it would be theoretically possible to use full natural language parsing for the present invention, in practice it is both infeasible and unnecessary. No existing parser is sufficiently general to robustly handle general text input in real or near-real time. Very few parsers even attempt to handle the fluidity and variety of spoken language. Furthermore, natural language parsers would produce unneeded information (detailed information about the syntactic structure of the input) without necessarily yielding information that is needed (the semantic classes of items in annotations).
The use of photograph annotations for automatic albuming is an ideal application for information extraction. Typically, there is interest in the information contained in the annotation associated with a photograph, but not in all of it; for instance, the quality of the photograph or the photographer""s feelings at the time are generally not of interest, even though the photographer may have chosen to discuss those things. In addition, there would be little interest in all of the rich semantics and pragmatics that may underlie the language used; in other words, often a very simple understanding of the annotations will suffice. Finally, the robustness of information extraction techniques make them particularly attractive in a situation where the photographer may use incomplete sentences or even just single words or phrases, as in xe2x80x9cthe fishing trip august nineteen ninety eight adrian mike and charlesxe2x80x9d.
In the past information extraction techniques have been mainly used on newswire texts. These are written texts, relatively short but nevertheless much longer than the typical photograph annotation. Furthermore, photograph annotations (especially with the increasing use of digital cameras with attached microphones) are not carefully organized texts, and may be spoken rather than written. This means that extraction based on photographic annotation cannot depend on some of the textual clues (punctuation, capitalization) on which certain information extraction techniques rely heavily.
The present invention is directed to overcoming one or more of the problems set forth above. Briefly summarized, according to one aspect of the present invention, a method for automatically organizing digitized photographic images into events based on spoken annotations comprises the steps of: providing natural-language text based on spoken annotations corresponding to at least some of the photographic images; extracting predetermined information from the natural-language text that characterizes the annotations of the images; segmenting the images into events by examining each annotation for the presence of certain categories of information which are indicative of a boundary between events; and identifying each event by assembling the categories of information into event descriptions. The invention further comprises the step of summarizing each event by selecting and arranging the event descriptions in a suitable manner, such as in a photographic album, as well as the utilization of a novel gazetteer in the extraction step that is specialized for consumer images.
The advantage of the invention is that it allows the user""s indications that are offered up as casual spoken annotations to be a guide for event segmentation. It has been found possible to use text analysis techniques to extract information from relatively unstructured consumer annotations, with the goal of applying the results to image organization and indexing applications.
These and other aspects, objects, features and advantages of the present invention will be more clearly understood and appreciated from a review of the following detailed description of the preferred embodiments and appended claims, and by reference to the accompanying drawings.