Unstructured information these days represents the majority of the data collected and used in the professional world. It plays an important role in the conduct of the professional processes but the fact that the unstructured information is not immediately available and usable in the context of the processes constitutes a major handicap.
The coding of the unstructured information and the model used in this coding (dimensions or parameters) allows for storage in a database management system or DBMS and thus renders the unstructured information available and usable by the professional processes (decision-making processes/operational processes).
In many fields, companies have to generate, store and manage large quantities of information in electronic form. Access to this information and understanding the latter can play an important role in decision-making at all levels of the company (marketing strategy, commercial strategy, quality control, control of the customer relationship, etc.). This information, in most cases, is in an unstructured form which does not allow its content to be analysed easily. Given the large volume of this information, most of the people involved use automatic text analysis techniques.
Various methods are known from the prior art for resolving the technical problems that appear in the automatic text analysis field. This automatic analysis relates, for example, to an analysis of the sentiments and of the opinions, to an analysis of risks, etc. Thus, there is the multi-mode merging technique. Indeed, companies now need methodologies with which to automatically synthesize information of different types: texts and structured data, speeches and structured data, texts and speeches, etc.
As an example, in the field of customer relationship management, known by the abbreviation “CRM”, companies need to correlate the information to the needs and the expectations of the customers (obtained from telephone calls, from customer correspondence, from messages or customer emails, surveys, forums, etc.) and the information obtained from the analysis of the behavioural and demographic data. This “bringing in relation” demands the integration and the synthesis of unstructured heterogeneous data such as speech data, textual data on the one hand and structured data on the other hand.
Also known are heterogeneous information processing methods. The issue of the heterogeneity of the data to be processed is linked not only to the multi-modality but also to the intrinsic heterogeneity of each type of data. As an example: if the interest is focused on the textual data obtained from writing from which information of feeling and opinion type is likely to be extracted, the user is faced with free texts—summaries of correspondence, electronic messages, or verbatim customer records of telephone calls, open responses to opinion surveys—which includes highly heterogeneous data, in terms of source, nature and quality, when it comes to structured data, and in terms of source, nature or genre, quality, language register and idiom when it comes to unstructured data.
When faced with an automatic analysis perspective, the inclusion of this structural heterogeneity is a methodological imperative that guarantees the effectiveness and the quality of the results that will be obtained, at the end of the analysis, whether the latter is conducted for decision-making and/or operational purposes. There are also speech modelling methods. The job of extracting sentiments and opinions from streams of text or transcribed speeches requires speech to be modelled.
U.S. Pat. No. 7,249,312 discloses a method and a system for giving a score to unstructured data. The author of this patent uses a maximum probability method to assign a score to parts of a document of a stream, and then aggregates the scores obtained to assign a final score to the document or to the data stream.
One aim of the present invention is to offer a method and a system that makes it possible notably to process large volumes of data.
The invention relies notably on the use of coding via a “scoring” process, that is, a process of assigning a score to an element or a set of elements, without learning unstructured information as structured information, in the normal operation of the method. It also uses steps for modelling and extracting unstructured information, in order to analyse the content of the texts, extract therefrom relevant information given the target applications and represent them in structured form.