With the advent of the Internet and of the Web, an incredibly large amount of information is available to each user connected thereto. However, a drawback of this huge available amount of information is that it is often difficult and time consuming to identify the most relevant user posts, comments or reviews. Indeed, often each comment, post or review is linked to many others via direct references or anaphora.
Many search engines have been developed for conducting searches on the Web. For example, search engines are used to locate texts, images or videos stored on personal computers, corporate intranets computers and networks such as the World Wide Web using keywords. In order to simplify searches, classification is of primary importance. Indeed, a good classification allows for easily finding and discovering documents including keywords searched by the users.
For example, faceted metadata classification is often used to organize and present web content in e-commerce environments, where products can easily be, for presentation purposes, broken down into their respective features. The respective features are generally represented in a vector form, which allows for easily locating a particular product having those features or aspects searched by the user. Those features or aspects are typically referred to as facets.
Several methods or approaches for automated classification of documents have been attempted based either on the sole use of large scale ontologies, thesauri or taxonomies, in order to predefine associations between concepts and assist with the determination of most relevant categories based on presence within these knowledge representation structures. It is believed that these approaches face severe limitations due to the difficulty in maintaining these often very large structures, in particular in the context of quickly evolving language (new acronyms, initialisms appearing daily, neologisms, etc.) and also as relates to localization of structures. Ensuring that they are as effective across multiple languages makes for very labour-intensive efforts to maintain, translate and update these structures.
Several methods or approaches for automated classification of documents have also been based solely on machine learning approaches (unsupervised, semi-supervised and supervised), and though these approaches can compensate more adequately for concepts that would otherwise be missing in knowledge representation structures, in order to work with satisfactory accuracy, require extensive corpora for training purposes and suffer from this dependency on the corpus data itself. Numerous examples can be shown where associations between concepts based on these approaches is limited to the training set that was used and therefore do not evolve well as real data changes and evolves over time. They require frequent retraining across large corpora of documents which are again very labour intensive in terms of collection and organization.
Therefore, there is a need of overcoming the above-discussed drawbacks related to classifying and searching user posts on the web. Accordingly, a search system and method using a faceted classification of text objects for uncovering unexpected links between different concepts related to one or many user posts are sought. In particular the manner with which the drawbacks of each of the above approaches or methods is key to the invention and will be further detailed below.
The foregoing and other aspects, advantages and features of the present invention will become more apparent upon reading of the following non-restrictive description of illustrative embodiments thereof, given by way of example only with reference to the accompanying drawings.