With the rapid spread of Internet, that has being growing exponentially since the last two decades every part of human life and the activities surrounding it are now done through the Internet. Similar is the case for business and online trading. Previously while buying and selling of items people used to browse through huge paperback catalogs containing thousands of records and then take a decision. In order to search for a product of interest the person concerned has to first find out from the index or content page, the probable topics/categories in which product with that description might occur. Then he has to browse through each of the entries of that page to find the product of his need. He has to repeat the procedure for new topics if he gets no satisfactory results.
For the ease of the user to search through catalogs more and more companies are reverting to electronic catalogs. The user can search through the catalogs quickly and hence place an order for the product immediately. This saves lot of time and money.
Today one can see every commodity of business or of daily life being available online on the Internet. One can buy food items of daily need online, and also medicines or machinery parts and even cars or bikes on the Internet. When a person goes for shopping in a real market he/she will find many different shops or outlets each selling different items. So the person can easily choose the shop of interest, go inside it and fetch the product of his/her need. Another scenario is that of a super market, where in one place a large variety of different products and stocked together. And the items here are arranged in accordance to its type. Like for example food items at one end, within it cereals in one section, vegetables in another and a different section for each food type.
Like wise if one is speaking of hardware machinery parts, there will be one section displaying nuts of various kinds, another displaying bolts of various kinds and so on. So when multiple items are stacked in the same place they are arranged in a form according to its type and category. Now comparing this case to that of an online store, here too the items need to be stored in different sections so as to distinguish from different items. But different items come from different sources and due to which they do not always contain the proper standardized categorization. Moreover the supplier simply gives information of the catalogs but does not provide any categorization for the same. But for this catalog to be of any use so that it can be put for display online there ought to be a category attached to it. Here at this point arises a need to have a system that can classify the catalogs into the relevant categories so that the catalogs can be put onto any further use or processing.
This is where catalog classification comes into play. Classifiers can be parametric or non-parametric. Two well-known classes of non-parametric classifiers are decision trees, and neural networks. For such classifiers, feature sets larger than 100 are considered extremely large. Document classification may require more than 50,000.
The most mature ideas in IR systems and text databases, which are also successfully integrated into commercial text search systems involve processing at a relatively syntactic level e.g. stopword filtering, tokenizing, stemming, building inverted indices, computing heuristic term weights, and computing similarity measures between documents and queries in the vector-space model. More recent work includes statistical modeling of documents, unsupervised clustering (where documents are not labeled with topics and the goal is to discover coherent clusters), supervised classification, query expansion. Singular value decomposition on the term-document matrix has been found to cluster semantically related documents together even if they do not share keywords.
Further the classification system might be rule based or machine learning based. In some instances, textual content must be classified with absolute certainty, based on certain accepted logic. A rule-based system may be used to effect such types of classification. Basically, rule-based systems use production rules of the form:
IF condition, THEN fact.
The conditions may include whether the textual information includes certain words or phrases, has a certain syntax, or has certain attributes. For example, if the textual content has the word “close”, the phrase “nasdaq” and a number, then it is classified as “stock market” text.
Unfortunately, in many instances, rule-based systems become unwieldy, particularly in instances where the number of measured or input values (or features or characteristics) becomes large, logic for combining conditions or rules becomes complex, and/or the number of possible classes becomes large. Since textual information may have many features and complex semantics, these limitations of rule-based systems make them inappropriate for classifying text in all but the simplest applications.
Over the last decade or so, other types of classifiers have been used increasingly. Although these classifiers do not use static, predefined logic, as do rule-based classifiers, they have outperformed rule-based classifiers in many applications. Such classifiers typically include a learning element and a performance element. Such classifiers may include neural networks, Bayesian networks, and support vector machines.
Most of the present day document classification systems available classifies the document into the most relevant category. But in real life we often face situations where a document needs to be classified into more than one category. The importance for the same becomes more intense when one talks of catalog data. A certain product in a catalog may be a medical instrument for measuring blood pressure. So a doctor will try finding this product in the medical domain. A mechanical or electrical engineer manufacturing this product will look for this and similar products in the measuring instruments section. Likewise many other people from varying background may look for the similar product. But a major disadvantage of present day classification systems is that none of them allows the catalogs to be classified into more than one category.
The present day classification systems available are based on statistical machine learning techniques. These systems have to be trained with adequate training data to get good output from them. But even if a system is not properly trained it does not report the same but instead makes decisions on whatever training it has received. Hence whether these sytems do the classification task correctly or wrongly it does the same with full confidence and reveal nothing as to whether the training was inadequate or whether the classification task at hand very new and different to the learner. Hence in such cases where the data from the clasifier is directly put onto for online display it becomes very risky for the end user to completely rely on the classification task. Hence there arises a need for the user to provide him/her with a confidence value for each classification task. Hence based on this value the end user can decide whether to do a manual checking for the same or to use the classification result directly. And if such a confidence value is provided to the user then the user would like to classify items only with certain confidence and the rest the user can keep for manual classification, the system is not quite sure of the exact class into which it should be classified.
Often when the user is not quite sure of the classification process and wants to get an overall idea of how the classification has being performed, he/she is left with no other option other than to go through all the catalogs again and manually check each of them. This idea leads to a lot of wastage of time. And another thing is that if the user has to manually check all the catalogs then he/she may as well manually classify all of them rather than using any automatic classification software. Then the role of the software becomes completely redundant if the user has to manually recheck all the catalogs. In such a scenario it'll be very useful to the user if he/she is provided with only a very small set of the entire catalogs and by manually checking this small set of catalogs he/she will get a very good estimate of the overall accuracy.
The existing classification systems available are very rigid in their framework, i.e. it takes as input the document and returns the most relevant category as the output. In case the user is a very experienced one and wants to provide some information to help in better classification to the system, he/she is unable to do so because the system allows no interaction with the user. The user may have a rough idea of the product catalog by knowing the supplier from which the catalog has come. Now the user wishes to convey to the system the possible categories or possible segements in the hierarchy into which the catalog may lie. But in the present framework or the classification systems this is not a feature that is provided by any of them.
The existing classification systems classify the content into one of the leaf level categories of the category hierarchy. But there may exist cases where a catalog item is not quite appropriate for any of the child categories below a certain parent category and it'll be more appropriate if the catalog were classified at that non-leaf category instead of any of its child categories. But such functionality is not supported by any of the existing classification systems at present.
A classification system classifies the whole content by assigning equal weightage/importance to all the terms in the content. But there exists certain terms that are not much important for the catalog content. Whereas some other terms may be very deciding and based on these terms the system can decide the category into which this catalog should be classified. Hence such terms ought to be given more weightage than the other less important term. Hence some sort of feature selection procedure needs to be a very immediate requirement for any classification system. Such a feature selection procedure should ideally distinguish the more important terms in a catalog with respect to those of lesser importance. And hence based on this distinction it should assign different weightages to these terms, like giving more weightage to the more important terms as compared to those with lesser importance.
A variety of algorithms and methods are available for the task of text and catalog classification. On certain catalog data it has being observed that rule based methods give better results while on other catalog data sets statistical methods give far better results than those that are rule based. Hence an ideal classifier for the task of catalog classification will be a one that has all the good qualities of both rule based and statistical techniques. But among the present day classification systems available, these are either statistical or rule based but not one based on the combination of the both.
Usually catalogs come in more than one field like long description, short description, supplier name, dimensions etc. Now if a present day classification system is put onto to classify this particular catalog split in multiple fields, it will simply club all the information in a single field and send the same for classification. Now the user is aware of the fact that certain fields like supplier name and dimensions are of much lesser importance than those of the description fields. But the user is unable to convey this very valuable information to the system as it accepts all the information into one unified field. It'd be very convenient for the user if he/she can input the different information to the system in different fields and also assign some numerical value to each of these fields as a measure of importance of the field contents. For example the user may assign high weightage to the description fields as compared to the supplier or dimension fields.
The statistical model is built on the given input training catalogs. Now after the model is built the user may have a feeling that certain categories have not being adequately trained as compared to the classification data that he/she may receive. Hence the user may wish to tweak the computed values of some terms in certain categories. But none of the present day classification systems allow the user the flexibility to tweak or change the built training model. Hence an addition of such a feature will be very valuable and useful for the user working with it.
The statistical model once build on the basis of the training catalogs is either stored in the database or on flat files in most classification systems. And if new catalogs are added to the training data or if any of the existing catalogs are changed the user is left with no other option other than to delete the old model and build the new model in its place. This is too much of a time consuming procedure, as the system has to re-build the whole model from scratch and also repeat the process for the already processed catalogs. Hence a utility that takes care of the incremental building of the training model will be very useful and convenient for the user. By this if only a few catalogs are added or deleted, the system should to do the processing only for the newly added or deleted catalogs. Also if certain catalogs are changed, the system should carry on the processing only for the changed catalogs rather than for all the catalogs.
At times certain different categories may contain a similar kind of catalog data. But while training data is provided they are split up into many different categories. But if the training were done in such a category schema it'll make the training model quite weak. Also there may exist cases where certain categories in the hierarchy need to be mapped to a different category for better training and to strengthen the training model. Also a situation may arise in which the training catalogs has being provided in a given category hierarchy but after that the category hierarchy has changed and due to which the system needs to report the output of the classification task in another category hierarchy. But none of the present day available classification systems support this functionality. Hence the addition of such a functionality that allows the user to map the category hierarchy with another different hierarchy for internal classification will be very useful. Hence if such functionality were available based on this the user can do the classification on a different hierarchy and report the results in another hierarchy.
A user may have a need to classify catalogs from various languages, in which a single system could be trained to classify catalogs from various languages. But the present day classifiers are made specific to one particular language. Hence a classifier build for English language will not be able to classify catalgos in different language, say German or Japanese. This is due to the fact that the classifier made for English language will only understand English characters and can extract only English tokens. Hence such a system will not be able to fulfill the purpose the user is left with no other alternative rather than to use different classifiers for all the languages. This can have more difficulties, like each of them may require inputs in different formats; hence the user has to supply the input specific to the language. To tackle this multi-lingual issue the user has to bear lots of extra overheads of cost, time and resources. This is mainly due to the fact that no single present day classification system is able to handle classification in more than one language.
U.S. Pat. No. 6,223,575 describes a multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values. This patent uses tokens from the catalog rapidly build and update the classification models. The hierarchical model built helps in efficient context sensitive classification. But the drawback here is that a user can not know the efficiency of the classification achieved by the system.
U.S. Pat. No. 6,192,360 describes a text classifier and building the text classifier by determining appropriate parameters for the text classifier. Though this patent describes an efficient method for parameter extraction through training catalogs but is inefficient in the classification phase and the subsequent testing phase.
Another drawback with both the above classifiers is that they are essentially for document classification and do not tackle the issues specific for catalog classification.