Web sites for sharing of digital media, such as digital music and video files, have become commonplace. The web sites may be accessed to upload music and video files, and to find music and video files to download, listen to or view, as authorized. Often, users of such web sites provide descriptive terms for the music and video files when uploading the files to the web sites. The descriptive terms can facilitate appropriately categorizing and storing of the uploaded files, for example.
There may be no particular restrictions placed on the descriptive terms, and so the descriptive terms can be as varied as the users who think them up. For example, one user might describe an uploaded music or video file in terms of a full title and artist name, while another user might describe an uploaded file in terms of a partial title and a venue where a live performance took place, and so on. Because the descriptive terms can be free-form and may not occur in any particular pattern, the descriptive terms may be considered “unstructured.”
In contrast, databases that store media files are typically highly organized and structured, to facilitate efficient storing, searching and retrieval of data. Thus, a database might organize music or video files according to fixed, pre-defined patterns. For example, a particular database might use a particular indexing system or systems with strict rules about how files are to be described. Such a database may be thought of as “structured.”
Correctly identifying newly-uploaded files can assist in storing the files within a structured database in such a way that efficiencies in database space utilization, search and data retrieval can be realized. For example, accurately matching unstructured descriptive terms corresponding to a newly-uploaded file to organizing information or predefined field in a structured database can open up possibilities for clustering groups of files which all match the same reference. This can make database navigation and data retrieval more efficient, and can promote more efficient content identification.