Information overloading in computer networks and in data sources is a widely recognized yet largely unsolved problem. Sufferers include knowledge workers searching for the “right” information, analysts evaluating and summarizing customer concerns and feedback in the light of ever-changing products and services, managers trying to make predictions and decisions based on monitoring complex business processes, customers looking for help, and so on.
When faced with a large volume of data, a natural human approach is to “divide and conquer”, i.e. to look for similarities and group together closely equivalent items, in a taxonomic or hierarchical fashion. Frequently, this analysis can lead to generally accepted hierarchical classification schemes. For example, data items in a data source relating to items in a supermarket or web-based retailer, or to product catalogues, topics in an online discussion forum, software component libraries, or scientific literature, inter alia, use “divide and conquer” style taxonomies.
FIG. 1 shows an example of data items arranged in a hierarchical structure of classification scheme. The structure is a tree shape, with a top level node 100, and a number of second level nodes 110, 120, 130 linked to the top level node 100 by branches. Second level node 110 is connected to third level nodes 112 and 114, second level node 120 is connected to third level nodes 122, 124, 126, and second level node 130 is connected to third level nodes 132, 134. Each node has a node name. Each node at each level of the hierarchy is associated with a data item in a database. In the present example, the database relates to electronic products, and the top level node 100 is named “Electronic Products”. The second level nodes are named “Video”, “Audio”, and “Telecommunication”. The third levels nodes are named as subsets of the second level nodes, and are named “Digital Camera”, “Camcorder”, “Cassette Player”, “CD Player”, “MP3 Player”, “Telephone”, and “Mobile Phone”. It is noted that it is not essential that all bottom level nodes are all at the same level in the hierarchy.
Data items in a database are generally organized in conceptual structures that define the conceptual relationships among data items to allow data to be expressed in a useful format. As an example, the collection of electronic products described in the preceding paragraphs may be considered as the objects of interest, namely: “Digital Camera”, “Camcorder”, “Cassette Player”, “CD Player”, “MP3 Player”, “Telephone”, and “Mobile Phone”. The attributes of these objects are identified as “Audio”, “Video”, and “Telecommunication”. The identification of objects and attributes allows a context to be determined. Essentially, the context defines the relationships between the objects and the attributes.
As illustrated in FIG. 2, a context is represented as a table in an array of rows and columns. However, it will be appreciated by the skilled person in the art that a context may be represented in other formats. In the table, the rows are labelled by objects, and the columns are labelled by attributes. As shown in the table of FIG. 2, a “1” indicates that an object possesses the attribute, whereas a “0” indicates absence of the attribute. For example, a digital camera is associated with the attribute “Video”. In another example, a mobile phone is associated with the attributes “Video”, “Audio”, and “Telecommunication”.
The context in FIG. 2 is commonly represented by a lattice diagram generated using Formal Concept Analysis (FCA). A lattice diagram generated from the context of FIG. 2 is illustrated in FIG. 3. Each node in the lattice diagram represents a set of zero, one or more of the attributes from the context and the set of zero, one or more of the objects from the context which has that set of attributes.
The lattice diagram in FIG. 3 begins from a common node 200, and a number of first level nodes 202, 204, 206 linked to the common node 200 by branches. Each of the first level nodes is connected to its corresponding second level node 208. In FIG. 3, each of the first and second level nodes is associated with a product (or set of products) (for example: digital camera (DC), telephone, or mp3 player (mp3)), and each product (or set of products) associated with a node has all of the attributes associated with the same node (e.g. node 202 is associated with the sole attribute Video (V) and the products Digital Camera (DC) and Camcorder (C)). As shown in FIG. 3, the lower half of each node is shaded to represent an associated object (or set of objects). The upper half of a node is “cross hatched” to represent the highest node corresponding to an attribute. The second level node 208 in this example is associated with all three attributes (Video (V), Telecommunication (T), and Audio (A)) and with the (single) device with all the three attributes (the Mobile Phone MP).
However, categorization of data items may be based on subjective criteria, rather than objective definitions. For example, in categorizing data relating to films, there is no clear definition as to whether a film falls in a specific category, such as comedy or drama. Fuzzy set theory that applies graded membership to objects in a context is generally used to categorize this type of data. The graded membership simply assigns “0” and “1” to correspond to non-membership and full membership respectively, and any number in the range of 0 to 1 to represent an ordered scale of intermediate membership.
An example of the fuzzy set theory applied to data relating to films is illustrated in FIG. 4. As shown in the table of FIG. 4, a graded membership is used to indicate the scale in which an object possesses an attribute, and a “0” indicates absence of the attribute. For example, a romantic comedy (Movie 1) may be graded as “0.8” under comedy, and “0.7” under romance.
Similarly, the context in FIG. 4 can be represented by a lattice diagram generated using fuzzy FCA. A lattice diagram generated from a context determined based on fuzzy set theory is illustrated in FIG. 5. However, as shown in FIG. 5, a fuzzy formal concept lattice can be very complex, due to a large number of concepts which differ slightly in membership. The complex appearance of the lattice diagram is unlikely to provide useful information to a user. Furthermore, it will be appreciated by a skilled person that a complex and large lattice requires more memory and is computationally less efficient. Thus, it is desirable to reduce the size of a lattice diagram such that it is represented in a useful and manageable form. It is also desirable to retrieve data from a data source in an efficient manner.
“Fuzzy Concept Lattice Based Classifier” by Wen Zhou et al (published in the proceedings of the Pourth International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2007 on 1 Aug. 2007 at pages 598-602, XP031192478, IEEE, Piscataway, N.J., US ISBN: 978-0-7695-2874-8) proposes a method for compressing a lattice formed using Formal Concept Analysis (FCA) in order to address the problem that the use of FCA generates very large and complex lattices. However, although it wishes to address fuzzy memberships, it actually proceeds in its method by firstly crispening the lattice before then reducing it (see especially Definition 2 of the paper where it selects the extents and intents of the concepts using a windowing technique which selects crisp members only (based on whether their membership falls within or without a window).
The invention in certain example embodiments provides a method for processing data items in a data source, and an equivalent apparatus. More specifically, the method reduces a lattice representation of data items in a data source. The method comprises the steps of retrieving data items from the data source and determining a context defining relationships between data items and one or more properties associated with the data items. An initial lattice data structure is generated based on the determined context. The size of the lattice is reduced by dividing the initial lattice data structure into a plurality of subgroups of lattice (or lattice subgroups), reducing the size of each of the lattice subgroups, and combining them to form a reduced lattice data structure. The invention in certain example embodiments provides a data structure that is less complex and represents more relevant, useful information. The data also occupy less memory space and subsequent operations on or using the data are computationally more efficient. For example, in an online shopping recommendation system, “outlier” purchases would tend to be ignored by the recommendation system, thereby providing the shopper with more relevant recommendations. The reduced data can also be stored on a handheld device without occupying significant memory space.
In a first aspect of the invention there is provided a method for processing data items in a data source, wherein each of the data items comprises one or more properties associated therewith, the method comprising determining a context defining relationships between the data items and the associated one or more properties, generating an initial lattice data structure based on said context, wherein said initial lattice comprises a plurality of nodes, a node representing one of said one or more properties, and one or more data items are associated with nodes of said plurality of nodes such that data items that are associated with the same node are associated with the same property, and reducing the initial lattice by removing nodes within the initial lattice such that the number of nodes in the reduced lattice is less than the number of nodes in the initial lattice.
The step of reducing the initial lattice may further comprise the step of dividing said initial lattice into a plurality of subgroups of lattice such that a subgroup of lattice comprises a subset of said plurality of nodes, and applying lattice reduction to reduce one or more of said subgroup of lattice.
Preferably the initial lattice and each subgroup are fuzzy lattices (in as much as one or other of the extent or intent or both of at least some of the nodes of the lattice have non crisp values—e.g. a node may have an intent of Comedy and an extent of Movie_1/0.8 and Movie_2/0.6). Equivalently, the lattices may be considered as being fuzzy lattices if one or both of the total intent and total extent of the lattice are fuzzy sets.
In a preferred embodiment, the lattices are fuzzy lattices for which each node in the lattice represents a concept and each concept can be defined as a fuzzy formal concept comprising a pair X, Y where X is a fuzzy set of objects and Y is a crisp set of attributes such that X↑=Y and Y↓=X where we adopt the usual definition of equality for fuzzy sets (any element has identical membership in both sets) and
                              X          ↑                =                  {                                    y              ∈              Y                        |                          ∀                              x                ∈                                                      X                    ⁢                                          :                                        ⁢                                                                  μ                        R                                            ⁡                                              (                                                  x                          ,                          y                                                )                                                                              ≥                                                            μ                      X                                        ⁡                                          (                      x                      )                                                                                                    }                                    (        1        )                                          Y          ↓                =                  {                                                    x                /                                                      μ                    X                                    ⁡                                      (                    x                    )                                                              |                                                μ                  X                                ⁡                                  (                  x                  )                                                      =                                          min                                  y                  ∈                  Y                                            ⁢                              (                                                      μ                    R                                    ⁡                                      (                                          x                      ,                      y                                        )                                                  )                                              }                                    (        2        )            
Preferably, within each lattice subgroup we keep at least one maximal and at least one minimal node.
The nodes in said subgroup of lattice are within a predetermined threshold distance of one another, the predetermined threshold distance representing a limit in which data items associated with nodes in a same subgroup lattice differs.
Thus, the objects represented in each node in a subgroup of lattice are closely related to the objects represented by the remaining nodes in the same subgroup. Effectively, this allows objects that are closely related to be grouped together. This provides an advantage that when a search is carried out on the lattice, the search can be focused on a portion of the lattice rather than the global lattice to allow the search to be conducted more efficiently.
Said subgroup of lattice may be reduced by extracting a minimal node of said subgroup, the minimal node having one or more data items that associate with properties represented by nodes in said subgroup.
Since the minimal node of a subgroup represents the objects that have all the attributes in that subgroup, the remaining nodes in the subgroup can be removed. This results in the data occupying less memory space.
The method may further comprise the step of combining said extracted minimal node of said subgroups of lattice to form the reduced lattice.
The data items and the associated properties in each node may be, respectively, objects and attributes of a fuzzy set.
The context may be represented as a table in an array of rows and columns, the rows being labelled by objects and the columns labelled by attributes.
The method of the above aspect may further comprise receiving user input including a search query term, searching through said reduced lattice to select at least one node from the reduced lattice based on said search query term and providing an output representing at least one data item that is associated with said selected at least one node.
In a second aspect of the invention there is provided a method for retrieving at least one data item relating to a search query term, the method comprising receiving user input including a search query term, retrieving data items from a data source, wherein each data item comprises one or more properties associated therewith, determining a context defining relationships between the data items and the associated one or more properties, generating an initial lattice data structure based on said context, wherein said initial lattice comprises a plurality of nodes, a node representing one of said one or more properties, and one or more data items are associated with nodes of said plurality of nodes such that data items that are associated with the same node are associated with the same property, reducing the initial lattice by removing nodes within the initial lattice such that the number of nodes in the reduced lattice is less than the number of nodes in the initial lattice, searching through said reduced lattice to select at least one node from the reduced lattice based on said search query term, and providing an output representing at least one data item that is associated with said selected at least one node.
In a third aspect of the invention there is provided a method for retrieving at least one data item relating to a search query term, the system comprising receiving user input including a search query term, searching through a data source storing data items represented by a reduced lattice data structure generated by removing nodes within an initial lattice data structure such that the number of nodes in the reduced lattice is less than the number of nodes in the initial lattice, selecting at least one node from the reduced lattice based on said search query term, and providing an output comprising at least one data item that is associated with said selected at least one node.
In a fourth aspect of the invention there is provided an apparatus for processing data items in a data source, wherein each of the data items comprises one or more properties associated therewith, the apparatus comprising a processor operable to:                determine a context defining relationships between the data items and the associated one or more properties;        generate an initial lattice data structure based on said context, wherein said initial lattice comprises a plurality of nodes, a node representing one of said one or more properties, and one or more data items are associated with nodes of said plurality of nodes such that data items that are associated with the same node are associated with the same property; and        reduce the initial lattice by removing nodes within the initial lattice such that the number of nodes in the reduced lattice is less than the number of nodes in the initial lattice.        
The apparatus may further comprise a user input for receiving user input including a search query term, a searching module for searching through said reduced lattice to select at least one node from the reduced lattice based on said search query term, and a data output operable to output at least one data item that is associated with said selected at least one node.
Preferably, the apparatus stores the full initial lattice as well as the reduced lattice and permits, via the user interface, a user to navigate through the reduced lattice and to expand sub-groups to permit navigation through a sub-group where the sub-group is expanded to be as it is in the initial lattice prior to any reduction having taken place to generate the reduced lattice.
In a fifth aspect of the invention there is provided a system for retrieving at least one data item relating to a search query term, the system comprising a data source for storing data items represented by a reduced lattice data structure generated by removing nodes within an initial lattice data structure such that the number of nodes in the reduced lattice is less than the number of nodes in the initial lattice, a user input device for receiving user input including a search query term, a searching module for searching through said reduced lattice to select at least one node from the reduced lattice based on said search query term, and a data output operable to output at least one data item that is associated with said selected at least one node.
The reduced lattice data structure may comprise a plurality of nodes, a node representing one or more properties associated with one or more said data items, wherein the reduced lattice is reduced from said initial lattice by dividing said initial lattice into a plurality of subgroups of lattice such that a subgroup of lattice comprises a subset of said plurality of nodes, extracting a minimal node of said subgroup, the minimal node having one or more data items that are associated with properties represented by nodes in said subgroup, combining said extracted minimal node of each of said subgroups of lattice to form the reduced lattice.
According to a sixth aspect of the invention there is provided a method for processing data items in a data source, wherein each of the data items comprises one or more properties associated therewith, the method comprising determining a context defining fuzzy relationships between the data items and the associated one or more properties, generating an initial fuzzy lattice data structure based on said context, wherein said initial lattice comprises a plurality of nodes, a node representing one of said one or more properties, and one or more data items are associated with nodes of said plurality of nodes such that data items that are associated with the same node are associated with the same property, and                reducing the initial fuzzy lattice by removing nodes within the initial lattice such that the number of nodes in the reduced fuzzy lattice is less than the number of nodes in the initial fuzzy lattice.        
The step of reducing the initial fuzzy lattice may further comprise the step of dividing said initial fuzzy lattice into a plurality of fuzzy lattice subgroups such that a fuzzy lattice subgroup comprises a subset of said plurality of nodes, and applying lattice reduction to reduce one or more of said fuzzy lattice subgroups.
One embodiment provides a computer program product comprising computer executable instructions which, when executed by a computer, cause the computer to perform a method as set out above. The computer program product may be embodied in a carrier medium, which may be a storage medium or a signal medium. A storage medium may include optical storage means, or magnetic storage means, or electronic storage means.
The described embodiments can be incorporated into a specific hardware device, a general purpose device configure by suitable software, or a combination of both. Aspects can be embodied in a software product, either as a complete software implementation, or as an add-on component for modification or enhancement of existing software (such as a plug in). Such a software product could be embodied in, a carrier medium, such as a storage medium (e.g. an optical disk or a mass storage memory such as a FLASH memory) or a signal medium (such as a download). Specific hardware devices suitable for the embodiment could include an application specific device such as an ASIC, an FPGA or a DSP, or other dedicated functional hardware means. The reader will understand that none of the foregoing discussion of embodiment in software or hardware limits future implementation of the invention on yet to be discovered or defined means of execution.