Clustering plays an important role in knowledge discovery and data mining, which are very useful in varied domains from biology to astronomy, and from medicine to Web mining. Traditionally, clustering is done on data sets, where the underlying relation R is defined between any two data points in the data set or is defined between any two points in the space containing the data points. The relation between two points is by default considered to be symmetric. However, there exist many data sets where the relation between any two points need not be symmetric. For example, consider a set of English sentences where the relation between two sentences reflects closeness of their meaning. Obviously, one sentence may subsume the meaning of the other sentence, may be equal to the other sentence in meaning, or may not at all relate to the meaning of the other sentence. (One more example of asymmetric relation is that between products—R(electronic toys, batteries) !=R(batteries, electronic toys); R(Levi's jeans, Wrangler jeans) !=R(Wrangler jeans, Levi's jeans).) Thus, there exist many examples in which the relation is not symmetric and one needs to mine the data under such circumstances.
Text summarization tools work on a single document or a collection documents and generate text of shorter length that summarize the input. These summaries are useful in knowing the contents of the documents without actually reading the whole documents. This may in turn help in judging the relevance of a document or a collection of documents. Most of the text summarization methods rank individual blocks of the input text such as sentences, and paragraphs, based on different criteria and summarize the text with the highly ranked blocks. The number of blocks of text that is output by these methods is either fixed or may be specified by the user. The blocks are ranked based on various criteria viz., block's position in the document, semantic content of the block, and block's similarity with the entire document. See for example, “US 05867164—Interactive Document Summarization,” “US 05638543—Method and apparatus for automatic document summarization,” “US 05963969—Document abstraction system and method thereof.” There are some tools that are specific to the domain of documents. They construct a summary by finding the occurrence of certain phrases in the document.
Information retrieval and filtering systems most often work with a set of keywords called a dictionary. These keywords are generally obtained from a set of documents related to the system's domain of application. The performance of the systems critically depends on the selection of keywords and their usage. Organization of the keywords, viz., vocabulary organization, involves imposing a hierarchical structure on the keywords. The structure of hierarchy is dependent on how the organization is used. There are many uses of organized vocabulary. Vocabulary organization can be done such that the keywords corresponding to children of a node are conceptually independent yet they are all related to their parent node. This kind of organization addresses the problem commonly encountered by information systems viz., the one caused due to representing closely related concepts as independent concepts [Foltz 1990, Deerwester et. al. 1988, Savia et. al. 1998, Frakes et. al. 1992—Chapter 9]. In [Foltz 1990, Deerwester et. al. 1988], a set of orthogonal combination of keywords are extracted and used for retrieval and filtering. In [Frakes et. al. 1992], a hierarchy is formed based on the frequencies and density functions of various keywords. In [Savia et. al. 1998], a hierarchy structure on the keywords is assumed and is used for representing user profiles as well as documents for information retrieval. The other use of vocabulary organization could be to summarize a collection of documents using a hierarchy of keywords that reflect their distribution among the documents.
Arrangement of products in a physical or electronic store crucially decides the comfort of a visiting customer in finding the products she needs and hence her future visits to the stores. Therefore, products in an ideal physical or electronic store should be arranged such that any customer visiting the store finds the products she needs with minimal search effort. Designing a store involves arranging various products within the store. Traditionally, store design is done by experts who have an understanding of the needs of the customers as well as knowledge about all the products in the store. As the number of products and the number of customers increase (which is very much true in case of electronic stores), designing a store by an expert becomes difficult. The sources of information that can be used to automate the designing of a store are the data relating to different attribute values of the products in the store and the data on purchase history of the customers. Data mining methods are employed to derive association rules between various products in the stores using the customer purchase history [Agrawal et. al. 1996, Srikant et. al. 1995]. These methods result in associations between a subset of products. For example, “people buying wine also buy milk and bread.” Store designers use these association rules to design a store. These can be particularly useful in case of physical stores where it is sufficient to mine the data for some major categories of items. In an electronic store, where there is no restriction on the physical layout of the store, one can mine for rules containing any number of items and use them to design the store. One of the embodiments of the present invention can be used to automatically generate a hierarchy of important products in a store and thus design the store at an item level rather than product category level. Apart from designing the stores, these hierarchies are useful also in analyzing the key selling items in the store. Store design can also be done using various attributes of the products in the store. To automatically find a hierarchy of products based on this data, one needs to know the similarity or dissimilarity between any two products in the store. Typically, these similarity relations are asymmetric.