1. Technical Field
The present teaching relates to methods, systems and programming for data processing. Particularly, the present teaching is directed to methods, systems, and programming for characterizing heterogeneous aspects of data and systems incorporating the same.
2. Discussion of Technical Background
The advancement in the world of the Internet has made it possible to make a tremendous amount of information accessible to users located anywhere in the world. With the explosion of information, new issues have arisen. First, faced with all the information available, how to efficiently and effectively identify data of interest poses a serious challenge. Much effort has been put into organizing the vast amount of information to facilitate the search for information in a more systematic manner. Along that line, different techniques have been developed to classify content into meaningful categories in order to facilitate subsequent searches or queries. Imposing organization and structure on content has made it possible to achieve more meaningful searches and promoted more targeted commercial activities. For example, categorizing a piece of content into a class with a designated topic or interest often greatly facilitates the selection of advertisement information that is more on the point and relevant.
To categorize data into appropriate categories requires that the data be represented in a way that it accurately characterizes the underlying data. In general, each piece of data can have properties that reflect the multi-faceted nature of the data. For example, an image can be characterized based on colors present in the image (e.g., bright red color), individual objects present in the image (e.g., Tiger Woods appearing in the image), or a central theme to be conveyed by the entire image (e.g., the golf tournament in England featuring Tiger Woods with a sunset background). It is clear that a data set can be characterized by heterogeneous sets of features, some highly semantic (e.g., the golf tournament scene) and some associated with non-semantic aspects of the data (e.g., bright red color in an image). Different aspects of a data set can be useful for different purposes. For instance, although the feature of bright red color does not seem to have any semantic meaning, it can be very descriptive when a user is searching for a sunset scene. In this case, the feature characterizing a data set (e.g., an image) such as a golf tournament scene is not that helpful. Fully describing different aspects of a data set is not an easy task.
Traditionally, various aspects of a data set can be characterized using heterogeneous sets of features, as shown in FIG. 1(a) (Prior Art), where data 125 can be characterized using feature set 1 110, feature set 2 115, feature set 3 120, . . . , feature set K 105. Each feature set can have more than one feature and each feature in any feature set can have different values. This is shown in FIG. 1(b) (Prior Art). For example, there are multiple feature sets in FIG. 1(b), feature set 1 155, feature set 2 160, feature set 3 165, . . . , feature set K 167. Feature set 1 155 has multiple features, e.g., F11, F12, . . . , F1,N1, and each feature can take one of multiple values. As illustrated, feature F11 may take any value of a set of possible values for that feature, [V11,1, V11,2, . . . , V11,m11]. Different features often have inherently very different types of feature values. For instance, the color red can be represented using a color code (numerical) but an estimated theme of an image, e.g., “golf tournament in England” may be represented by a text string. Because of this, traditionally, different feature sets are processed differently. For example, to match a data set 1 with a data set 2, features for each may be extracted first. Such extracted features frequently fall within different feature sets and have different types of feature values. To determine whether data set 1 is similar to data set 2, conventionally, corresponding feature sets are compared. For example, the color feature of data set 1 is compared with the color feature of data set 2 to determine whether the two data sets are similar in color. To compare color codes, the underlying processing is likely directed to numerical processing. In addition, a feature characterizing the central theme of the data set 1 is compared with the corresponding feature of data set 2 to see if they have a similar underlying theme. To compare such a feature, the processing is likely directed to text processing, which may be very different from color processing. Therefore, to process data, often different algorithms and processing modules need to be developed in both extracting features from data and in matching data based on their features. Therefore, there is a need to develop a representation scheme that provides a uniform way to characterize different aspects of a data set so that processing associated with the data set, such as archiving or searching can be accordingly made more uniform.