Today, advancements and changes in technology space are happening at a quick pace. New infrastructure, products, applications, services and technologies are coming up to cope up with the needs of the era. New applications (apps) are being developed with assistance from underlying hardware improvisations. Activities such as analysis, detection, prediction, forecasting, design, planning, etc., are essential for either creation of new advancement or deciding upon futuristic road maps. This requires a hard-bound dependency on valuable information for carrying out the activities mentioned before.
Multi-dimensional data sets are available both online such as Internet and off-line in the form of directory services (pdf, xml, doc, html etc.) and are usually unstructured. It is highly likely that the presentation structure, keywords used, acronyms defined, table formats followed etc., differ from one source to another. Therefore, there lies a challenging task of extracting the valuable/meaningful information specific to a particular domain from multi-dimensional data sets in an automated manner. However, collection of data from these kinds of sources will be a humongous task and highly time consuming in practice.
In literature there are methods/tools/systems available to perform crawling and data mining through independent processes, which may lead to increase in time, memory and processor complexity and may not guarantee the desired accuracy. Therefore a joint crawling and data mining approach is required. To evaluate the complexity and accuracy of the joint crawling and data mining approach, we consider the automatic update of network device repository as a use case through a tool based system implementation which can be extended to other domains as well without any difficulty.
In the use case considered above, the repository needs to be refreshed periodically in order for the tools to sustain, since manual update of such a huge repository is a time consuming and error prone process, and thereafter automated information extraction becomes imperative. Furthermore, the extraction of information becomes more challenging due to different documentation standards used everywhere.
The tools/methods/systems available in prior art do not provide the flexibility of extracting meaningful information by learning the difference in documentation standards as used in the public domain through self-learning systems. Furthermore, systems available with self-learning capability fail to extract relevant information due to challenges posed by data mining techniques.
Thus, there exists a need for a system that allows the automated information retrieval with the on-going ability to learn such documentation standards by incorporating the self-learning crawler and extracting meaningful information from the crawled dataset by applying rule-based data mining techniques.