1. Field of the Invention
The present invention is directed toward the field of computer systems that capture digital ink, and more particularly toward automated generation of an index for handwritten notes.
2. Art Background
Some computer systems, including personal digital assistants (PDAs), permit users to enter handwritten material into the computer. Essentially, these computers and PDAs include a user interface that permits a user to write handwritten material onto a surface, and the handwritten material or notes are subsequently sampled into xe2x80x9cdigital ink.xe2x80x9d One application of these computer systems is to permit a user to perform electronic note-taking.
One potential advantage of electronic note-taking over paper note-taking is the ability, in electronic note-taking, to create indexes. In general, indexes provide a means to locate specific information within the handwritten notes. With paper note-taking, such indexes must be created manually. Since this manual process is difficult, paper note-takers tend to mark important items or keywords by underlining, circling, or entering asterisks next to the important material. Although this type of highlighting helps users to locate important information while browsing notes, it does not provide an index.
In electronic document or text systems (i.e., systems where the text is cognitively recognized by the system), techniques exist to create automatic xe2x80x9cback-of-the-bookxe2x80x9d indexes (See H. Schutze, xe2x80x9cThe Hypertext Concordance: A Better Back-Of-The-Book Indexxe2x80x9d, Proc. COMPUTERM, ACL Coling, 1998). These back-of-book indexes allow users to scan a list of keywords in the index and find occurrences of the index terms in the text. However, these electronic text systems are based on the user entering text, such as from a keyboard, directly into the system.
In other electronic text systems, information retrieval techniques are used to automatically create indexing of textual documents. For example, in one such system, index terms are selected for Web pages based on relative frequency of term occurrence (See H. Schutze, xe2x80x9cThe Hypertext Concordance: A Better Back-Of-The-Book Indexxe2x80x9d, Proc. COMPUTERM, ACL Coling, 1998). However, these techniques do not apply directly to digital ink, since words, in digital ink, are not cognitively identified. In theory, an attempt to convert digital ink to text using character recognition may be attempted. However, character recognition is not accurate on handwritten data. Accordingly, it is desirable to automatically generate indexes from handwritten data entered as digital ink into a computer without character recognition.
Manual indexing by the user is possible in electronic systems that use digital ink rather than text. One example is the application of keywords to sections of electronic notes, as provided by the Dynomite System, developed at FX Palo Alto Laboratory, and as provided by Marquis (See K. Weber and A. Poon, xe2x80x9cMarquis: A Tool For Real-Time Video Loggingxe2x80x9d, CHIN 94). However, requiring the user to manually identify keywords to generate the index requires, during the note taking process, cognitive effort on the part of the user.
Another application for manual indexing of digital ink by a user is through the development of ink properties in the Dynomite system. An ink property is a data type applied to selected digital ink, that allows that ink to be subsequently retrieved by type. Example data types include xe2x80x9cnamexe2x80x9d or xe2x80x9cto doxe2x80x9d items. Ink index pages for a given ink property are created by a user to subsequently permit quick scanning of all notes that contain that property. In addition, notes on the index page are hyper linked back to the original location in the notes. One significant problem associated with both the keyword and ink property manual approaches to generate indexes for digital ink systems is that they require significant cognitive effort on the part of the user. As a result, these techniques are not practical because the user is typically not disciplined enough to do it.
A system for manually indexing historical handwritten document images is described in R. Manmatha, Chengfeng Han, E. M. Riseman and W. B. Croft, xe2x80x9cindexing Handwriting Using Word Matchingxe2x80x9d, ACM Digital Libraries, 1996. In this technique, images are segmented into words, and word equivalence classes are found by thresholding match scores between words. This technique requires the user to manually input words to specify the word equivalence classes. Index terms are then chosen from the largest word equivalence classes. In addition, stop words are manually eliminated. Since no stroke information on the handwritten data is available, match scores are computed based on the word images alone. Accordingly, it is desirable to automatically create indexes for handwritten digital ink, without user effort.
In A. Poon, K. Weber, T. Cass, xe2x80x9cScribbler: A Tool For Searching Digital Inkxe2x80x9d, CHI 95, a technique called scribble matching is described. In general, scribble matching involves finding occurrences of a given word in a handwritten document. This technique is based on using dynamic programming to compute a score between the given handwritten word and the words in the document. A similar method is also described in D. Lopresti and A. Tomkins, xe2x80x9cOn The Searchability Of Electronic Inkxe2x80x9d, Fourth International Workshop on Frontiers of Handwriting Recognition, December, 1994.
As is described fully below, the present invention provides a system for automatically generating indexes for handwritten notes based on the stokes of the digital ink.
A system automatically generates indexes for handwritten notes captured as digital ink in a computer. Ink words, which roughly correspond to words in the notes, are identified. Features of the ink words are computed, and pairwise distances or match scores, which measure the distance in the features between two ink words, are calculated. From the pairwise distances, equivalence classes of ink words are determined from clustering the ink words. Index terms, which appear in the index for the handwritten notes, are selected from the equivalence classes of ink words. The system generates location information for the index terms that identifies a location in the handwritten notes where the index terms appear. An index of the index terms are displayed with the location information. In one embodiment, the notes index contains page numbers, displayed next to the index terms, to identify the page in the handwritten notes where the index term appears. In another embodiment, the index contains hyper-linked index terms.
The system includes a novel technique to identify equivalence classes of ink words in handwritten notes. A threshold is generated to identify a maximum pairwise distance for the clustering of ink words. Specifically, a distribution curve, which represents a relationship between a number of occurrences among pairs of the ink words in the handwritten notes verse a pairwise distance, is generated. A knee of the distribution curve, xcfx84, is approximated with a first line of gradient 0 to xcfx84, and a second line comprising a constant gradient from the knee, xcfx84, throughout pairwise distances on the distribution curve. The knee of the distribution curve, xcfx84, is selected as the threshold for clustering.