The present invention is directed to systems, apparatus and methods for correcting data or resolving ambiguities in data, and more specifically, to correcting metadata used to describe content that is submitted by a relevant community or network of users.
The growth of the Internet and the availability of high speed communications networks have contributed to a growing interest among users in the creation and sharing of digital content. This content may take the form of images, books, videos, or sound files (such as music), for example. Some of the content may be created by a user and shared with others in a network of users, while some content may be commercially available and uploaded to the Internet for access and distribution to others via a downloading process. In either case, the content is typically described using metadata such as title, artist or creator, source, version, description of content, etc. A person seeking the content typically enters examples of such metadata as keywords into a search engine and then examines the results of the search to find the content they desire.
The metadata used to characterize the content typically takes the form of one or more strings of characters which form a data n-tuple, where the characters may include letters, numbers, and standard typographic symbols (e.g., #, &, -). A problem may arise because while the categories of metadata may be standardized (e.g., title and artist for music content), some or all of the metadata used to describe the content may be ambiguous, contain spelling errors, or represent one of several generally accepted or understood ways of describing the same information. For example, even a well-known musical group such as “The Beatles” may be entered as an artist name in multiple forms; Beatles, Beetles, Beatles, The, etc. While this example is rather simple, it illustrates that there are multiple possible forms that metadata may take for even well-known and familiar categories of metadata. Although this example introduces a recognizable spelling error and a limited amount of ambiguity as to what is meant, it may still cause problems for certain types of consumer applications. Such applications include, for example, those whose reliability or utility to a user are highly dependent upon exact matching of metadata search terms.
One of the problems created by variations in the metadata used to describe content is that of efficiently enabling the search of, retrieval, and distribution of digital content such as images, videos, books, and music when that content is described by potentially incorrect, misspelled, or ambiguous metadata. This problem is made even greater by the growth and use of social networks for sharing content. This is because such networks enable users to search content in the content libraries of multiple users, where each such user may have contributed incorrect, misspelled or ambiguous metadata to the description of that content. As a result, as the number of users increases, both the potential variations in the way that content is characterized and the resulting metadata errors are likely to increase.
Thus, from one perspective, “dirty metadata” is or may become a problem in many consumer internet multimedia applications, where in the context of the present invention, “dirty metadata” includes, but is not limited to, incorrect, misspelled, ambiguous, or simply confusing data that is meant to characterize or describe some aspect of content or other data. As discussed, in music-related services, misspelled, ambiguous, or missing metadata information may cause confusion about the identity of an artist, song or album that is being searched for or recommended. In video or image related searches, finding a desired video or image is made more difficult in the situation where multiple videos or images are returned as a match to the search parameters (where the search parameters are typically used to identify the subject matter or characteristics of the desired content, or as data used to otherwise describe the image or video). In this type of search (as well as others) a user would prefer to reduce the number of unique “hits” that result from the search by having incorrect metadata corrected so that search results are more relevant and not duplicative, and metadata more accurately reflects the actual content.
As a further example, most consumers with digital music libraries (e.g., iTunes™ distributed by Apple™) have “dirty” metadata in their music libraries—misspelled artist names or track names, missing or incorrect track, album, or genre information, etc. The flawed metadata can cause problems when trying to manage one's music, for example in finding a favorite song whose name is misspelled in the library, creating a playlist (manually or automatically) when the tracks in the playlist aren't labeled properly, or trying to receive recommendations for new music from other members of a network when the original music isn't properly labeled or identified (or at least identified in an unambiguous manner).
As noted, these problems are particularly troublesome in social networks or Internet services involving posting of multimedia content or sharing, because such networks combine the dirty metadata of multiple users in one place. This means not only that the volume of incorrect metadata is greater, but also the possible variations in that metadata. As an example, on the video service YouTube (now part of Google™), a popular video may be uploaded to the YouTube service many times by different consumers, and when searching for the video the same search may yield multiple, possibly equivalent results. This creates a burden on the user to determine which video is the best to watch to satisfy their interests, or to otherwise filter the search results. Within the music space, there are numerous potential social networking applications or services that add value for users but that are either impossible or cannot be implemented in an optimal or desirable way without proper metadata. For example, sharing music playlists between two users is only realistic and desirable if the person sharing the playlist has proper metadata (to describe the music in the playlist), and the person receiving the playlist has similarly proper metadata (to know which songs from the playlist he/she already owns).
Although the problem of using incorrect or ambiguous metadata to describe content has been recognized, at present there is not a satisfactory solution to the problem. One possible solution that has been suggested is to compare the “dirty” metadata to a “known good database” of metadata, and then to correct misspellings based on the accepted “known good” spellings. In this regard, the most widely used application/service that performs this correction based on accepted good metadata is the Windows Media Player (WMP) from Microsoft™, coupled with a service known as the Windows Media Internet Service (WMIS) metadata service. This combination can be used to correct metadata because the Windows Media Player has the capability of looking at every media item in a consumer's digital library. The WMP then transfers the relevant metadata to the WMIS service, where a server-side spellchecking algorithm compares that metadata to a large database of “known good metadata”. The output of the service is properly-spelled metadata which is returned to the user and used to correct misspellings in the user's media library. For example, a consumer may have a digitized song labeled with the artist name “Brittany Spears”; in this case the WMP will send the artist name to the WMIS service, which identifies it as a misspelling by comparing it to a database which provides the proper spelling, i.e., “Britney Spears”.
However, as recognized by the inventors of the present invention, there are at least two primary problems with such a solution to the dirty or incorrect metadata problem:                (1) It is very difficult, perhaps even unrealistic or impossible, to build a comprehensive, up-to-date, and correct “known good database” of metadata. This is at least partly because building such a database requires licensing data from multiple providers—music labels, music publishers, and 3rd party data aggregators. Each of these sub-providers may have errors in their own data, and they may provide different, conflicting spellings for the same media or content. Furthermore, keeping a “known good” database up-to-date in a timely fashion requires regular (if not continuous) freshening of data, again in reliance on multiple 3rd parties. Thus, building such a database and keeping the data fresh in a scalable fashion in the growing world of multimedia is very difficult and may be an unrealistic goal; and        (2) In the world of music, movies, and video, a given author or work may have multiple, commonly known spellings or versions of the artist's name or title of a particular work. For example, “Beatles”, “The Beatles”, “beatles”, and “Beatles, The” are all common ways to attribute music to the band most properly called “The Beatles”. Similarly, the artist “Beyoncé” is often spelled as “Beyonce” without the accent, and the group “Mötley Crüe” is often spelled as “Motley Crue”, without the umlaut. The use of multiple forms for the metadata describing the same underlying data can create further uncertainty as to what information or content is being referenced, and thereby complicate or in some situations frustrate a user's ability to find, retrieve, or share content.        
What is desired is a system, apparatus and method for correcting metadata used to characterize content that is submitted by a community of users, where such system, apparatus, and method overcomes the noted disadvantages of existing approaches.