With the continued proliferation of various types of content items available on the Internet, there are ever-increasing benefits and a subsequent reliance on data determination techniques to help sort or otherwise categorize content. One technique for classifying content items includes user generated content or user supplied content, such as users entering tag information associated with existing data sets or content items.
A classic example is an online photograph database that includes voluminous amounts of user-posted photographs, such as the web site “www.flickr.com.” Users can enter tag data relating to a given photograph, such as for example entering an event associated with the given photograph (e.g., Fourth of July Parade, Christmas, etc.), location of the given photograph (e.g. San Francisco, Golden Gate Bridge, Fisherman's Wharf, etc.), or any other suitable information that otherwise describes the photograph, which is but one example of a content item.
Through the posting of content items on the Internet, there are vast amounts of content structured in accordance with various disparate formats. It can be extremely beneficial to provide techniques for categorizing or otherwise organizing this information. Using the example of tag data, one current technique is to make the tag data available as searchable or otherwise categorical content, such as in the form of metadata that describes a given content item. In this example, a person may perform keyword searching of metadata. This technique, however, is highly restrictive and fails to provide any type of large scale content recognition with the overall data set. For example, a static search may only return items having the exact search terms associated with the content item, while overlooking potentially vast amounts of relevant data.
The current challenge facing tagging systems is to extract structured information from the unstructured, user specified tags. Unlike category or ontology based systems, tags result in unstructured knowledge having no a-priori semantics. This unstructured nature, however, allows greater user-flexibility in entering tag data, as well as allowing data to naturally evolve to reflect emergent properties of the data. Despite the lack of ontology and semantics, patterns and trends can emerge that allow for the extraction of some amount of structured information from tag-based systems. One technique is tag-based determination over spatial and temporal patterns, such as geo-referenced or geo-tagged content items.
There are existing techniques relating to the extraction of patters or trends from tag data associated with photographic collections. These techniques provide for recognizing semantic information, but are limited to a single collection of images, or stated another way, are based on a single camera. Some techniques include time-based event detection, where other techniques may include GPS data to assist in geographic information, but again are limited to single data sets. Related techniques are also recognized in the field of GeoIR, such as attempting to extract geographic information from content items on the basis of links to or from the content item, network properties and geographic terms on the site itself that hosts the content item.
Existing data sets may include two basic elements, the data itself (content item) and tag data, such as the example of a photograph and tag data associated with the photograph. For ease of understanding, the following techniques are described relative to photographs, but it is understood that these techniques are equally applicable to any suitable type of content item having metadata associated therewith. A given geotagged photo has, in addition to other metadata, location data (“lp”) and time data (“tp”) associated therewith. Tags associated with the photos are a second element type in the dataset, using the variable x to denote a tag, where a given photograph can have multiple tags associated therewith and a given tag can be associated with more then one photo.
Based on the location and times associated with photographs, the location and time distributions for a given photograph is:
                    Lx        ⁢                  =          Δ                ⁢                  {                      lp            |                          p              ⁢                                                          ⁢              is              ⁢                                                          ⁢              associated              ⁢                                                          ⁢              with              ⁢                                                          ⁢              x                                }                                    EQUATION        ⁢                                  ⁢        1                                Tx        ⁢                  =          Δ                ⁢                  {                      tp            |                          p              ⁢                                                          ⁢              is              ⁢                                                          ⁢              associated              ⁢                                                          ⁢              with              ⁢                                                          ⁢              x                                }                                    EQUATION        ⁢                                  ⁢        2            With this information, the existing techniques can attempt to derive time and place semantics from the tag location Lx and time Tx usage distributions.
Two different types of scan methods are known to determine semantic information, both with associated shortcomings. A first technique is referred to as the naïve scan method, which uses standard burst detection methods utilized in signal processing. This method computes the frequency of usage for each time segment at each scale and identifies a burst when the frequency of data in a single time segment is larger than the average frequency of data over all segments, plus a multiplier applied to the standard deviation of segment frequencies, e.g., two times the standard deviation.
One problem with the naïve scan method is that tags may have sparse usage distributions, which results in low average frequencies and low standard deviations. Therefore, this method suffers from too many false positives. One solution is to compute the average and standard deviation values from aggregate data and relax the condition that the number of tag occurrences be larger than the mean plus a multiplier applied to the standard deviation. A partial computation for each tag may be defined by:
                                          T            r                    ⁡                      (                          x              ,              i                        )                                                μ            N                    +                      2            ⁢                                                  ⁢                          σ              N                                                          EQUATION        ⁢                                  ⁢        3            where μN is the mean of {Nr(i)|=1 . . . } and σN is the standard deviation of {Nr(i)|=1 . . . }. Using this technique, segments of time corresponding to an event are identified by simply recording the segments that pass a significance test. The significance test includes aggregating the partial computation statistics for each time segment at each scale to determine whether a given tag denotes an event.
An alternative approach, referred to as Naïve Scan II, compares the individual tag occurrences to the total number of tag occurrences, instead of the number of photograph occurrences. This technique is based on the assumption that if tag x captures the important aspects of a photo, then that photo will require few tags in addition to tag x. This partial computation may be defined by:
                                          T            r                    ⁡                      (                          x              ,              i                        )                                                μ            T                    +                      2            ⁢                                                  ⁢                          σ              T                                                          EQUATION        ⁢                                  ⁢        4            where μT is the mean of {Nr(i)|=1 . . . } and σT is the standard deviation of {Nr(i)|=1 . . . }.
Another technique is referred to as the spatial scan method, which includes a standard application of the spatial scan statistic technique used in epidemiology. This technique is a burst detection method and assumes an underlying probability model of observing some phenomenon over some domain. The method then tests whether the number of occurrences of a phenomenon is abnormal relative to the underlying probability model.
The spatial scan statistic technique, as well as the naïve scan methodologies, suffers from the basic problem of being defined within defined time segments. That is, these methods determine segments of time for each scale independent from the actual usage distribution of the tags. More specifically, these methods only address a-priori time segments as the time of events and may actually hide the actual time of an event by splitting the usage occurrences into adjacent segments, none of which rises above the defined threshold.
Therefore, there exists a need for a technique for extracting semantic information from tag data that can depend on or account for multiple scales, which may include accounting for the multiple scales in a simultaneous or substantially simultaneous fashion.