1. Technical Field
The present teaching relates to the methods, systems and programming for processing information. Particularly, the present teaching is directed to methods, systems, and programming for processing information using de-duplication.
2. Discussion of Technical Background
The advancement of the Internet has made it possible to make a tremendous amount of information accessible to users located anywhere in the world. With the explosion of information, new issues have arisen. First, much effort has been put into organizing the vast amount of information to facilitate the search for information in a more effective and systematic manner. Along that line, different techniques have been developed to automatically or semi-automatically categorize content on the internet into different topics and organize them in an, e.g., hierarchical fashion. Some techniques involve the creation of a grid-based system to facilitate large-scale clustering of data and the de-duplication of redundant data objects within the system. Imposing organization and structure on content has led to more meaningful search and has promoted more targeted commercial activities. For example, associating a piece of content with a designated topic identifier often greatly facilitates the presentation of information that is more on the point and relevant. However, the categorization of new content for incorporation into an existing database is a relatively slow process.
New, time-sensitive information content relevant to existing content is constantly being created and existing solutions fail to incorporate such changes into existing categorization systems in a timely manner. An important issue has to do with how to quickly categorize useful information out of massive amounts of available content in order to make that information available to users within a matter of minutes. For example, certain processing and enriching systems commissioned with the task of identifying relationships between pieces of information content take in source objects from various feeds, finds duplicates, and merge them to create a composite object. These processes may be performed periodically at specific times and may require several hours or even days to fully integrate newly generated content into a searchable database, grid, index, or other system. Whereas certain types of content, such as limited-time offers or auctions, are time-sensitive, existing processing methods and systems are simply too slow to be able to categorize and index this information such that it can be provided to users within the pertinent timeframe.