Personal and corporate content are dominated by unstructured data. According to various estimates, from 70-90 percent of all usable data in organizations is represented by unstructured information. Flexible capturing of unstructured data in a variety of formats has been greatly facilitated by the development of universal content management systems, such as the Evernote software and cloud service developed by the Evernote Corporation of Redwood City, Calif. In parallel with typed text entry, documents and web clips, contemporary content collections can include handwritten notes taken on a variety of electronic devices, such as tablets running various operating systems, on regular or special paper via intelligent pen & paper solutions, on conventional whiteboards and smart walls, and interactive displays, as well as scanned from traditional paper notes taken on legacy pads, bound notebooks, etc. Similarly, audio notes, including voice transcripts, are increasingly recorded on smartphones, tablets, specialized conferencing systems, home audio systems, wearable devices such as intelligent watches, and other recording hardware. Some models of intelligent pens are also capable of capturing and synchronizing handwritten and voice recordings.
A significant prevalence of unstructured data over the organized information (represented by a majority of database content, by forms, tables, spreadsheets and many kinds of template-driven information) poses a major productivity challenge and an impediment to efficient productivity workflow. Mainstream productivity systems used in sales, CRM (Customer Relationship Management), project management, financial, medical, industrial, civil services and in many other areas are based on structured data represented by forms and other well-organized data formats. Manual conversion of freeform, unstructured information obtained in the field, in the office, at meetings and through other sources into valid data for productivity systems (for example, entering sales leads data into CRM software) takes a significant time for many categories of workers and negatively affects job efficiency.
In response to this challenge, a sizable amount of research and R&D work has been dedicated to creating methods and systems for automatic and semi-automatic conversion of unstructured data into structured information. NLP (Natural Language Processing) and various flavors of data mining, NER (Named Entity Recognition for detecting personal, geographic and business names, date & time patterns, financial, medical and other “vertical” data) and NERD (NER+Disambiguation), together with other Al and data analysis technologies have resulted in general purpose and specialized, commercial and free systems for automatic analysis and conversion of unstructured data.
Notwithstanding advances in facilitating unstructured data analysis and conversion into structured information, many challenges remain. For example, automatic recognition (conversion to text, transcription) of handwritten and voice data results in multi-variant answers where each word may be interpreted in different ways (e.g. ‘dock’ and ‘clock’ may be indistinguishable in handwriting and it may be difficult to tell ‘seventy’ from ‘seventeen’ in a voice note); even word segmentation may be uncertain (e.g., a particular sound chunk might represent one word or multiple different two word arrangements), which may prevent or complicate instant application of known data analysis methods.
Accordingly, it is desirable to develop methods and systems for automatic conversion of unstructured handwritten and audio data into structured information.