Conversion of paper documents to electronic form speeds and enhances many business processes. Business documents often contain identifying information that allows documents to be routed or stored properly and there is great value in extracting this information automatically from scanned document images.
Business processes in the office are facilitated by networks of computers and so-called multifunction devices. These devices incorporate printers, faxes, and scanners that, coupled with servers running the proper software, create functionality to convert paper documents to electronic and vice-versa. Furthermore, these networked devices can connect personal digital assistants, cell phones, and other hand-held devices. It may be desirable to extract business information from documents in a networked environment to route, share, store, and/or display the information where it is most useful.
Owing to the expense of paper handling, many businesses, such as banks, law firms, and the like, seek to eliminate paper workflows by scanning mail and converting faxes to electronic form as soon as they are delivered to the mailroom and routing them electronically. This is faster and cheaper than using hardcopy.
Many of today's offices receive and distribute numerous types or genres of business documents each day. For instance, a typical office may receive and distribute business cards, business letters, memoranda, resumes, invoices, and the like.
Conventional document handling systems exist for scanning business documents, processing the scanned document, and storing the scanned document in a desired repository. Similarly, conventional systems exist for receiving a fax of a document as an electronic file, processing the electronic file, and routing the file to a desired location. Also, in conventional systems, an electronic document can be distributed as an attachment to an email, processed, and routed to a desired user or storage location.
These conventional systems are labor-intensive and prone to error. For example, these systems typically require a user to instruct the system as to the type of document being input and what type of routing or distribution of the processed document should be followed.
Therefore, it may be desirable to provide an electronic document genre classification system and method that is automated and substantially error free. Moreover, it may be desirable to provide an electronic document genre classification system that converts various types of business documents to a universal format, for example, a rich text format, so that the text of the documents can be parsed, tokenized, and sequenced. Furthermore, it may be desirable to provide an electronic document genre classification system that can determine the probabilities associated with parsing the processed electronic document with a number of predefined document grammars to classify the genre of the electronic document and route the document based on the determined genre.