1. Field of the Invention
The present invention relates generally to extracting information from a block of text, and more particularly to extracting artist names and song titles.
2. Related Art
Music service providers that stream content to their customers have become a major component of the music industry. In the music industry, for example, the streaming content often includes descriptive material about the artists and songs, such as biographical information as well as current events information. To stay current, relevant information must be continually acquired.
Such information, however, can come from a variety of sources. For example, a local or national news organization may choose to run a story on a particular artist or song. This commonly occurs when an artist plays in a city or town covered by the news organization. With the advent of the Internet, these stories are commonly published online. In addition, some news media organizations are dedicated to the music industry, such as VH1, MTV, and Rolling Stone, who also provide coverage of artists and songs.
These traditional news providers, however, are not the only sources of relevant information on artists and songs. In fact, the growing use of social media has dramatically increased the number of potential sources of information. For example, concert-goers can provide commentary via blogs, feeds (e.g., Twitter feeds), posts (e.g., Facebook or Google+ posts), and other social media venues. Oftentime, this information is available long before a traditional news provider provides any information about the song, artist, or related events. In addition, the pervasive use of smartphones for instant access to the Internet and social media has exponentially increased the number of sources and correspondingly increased the amount of information available. While almost all of this information is available over the Internet, it is in a highly decentralized form, which creates an obstacle to efficient retrieval and analysis.
Relevant information may also be combined with other information which is not related to the artist or song. For example, the average social media page, such as a Facebook page, contains only a small amount of information, if any, relating to artists or songs. A Twitter feed may only contain a few tweets relating to an artist or song. A web log may only contain one post directed to an artist or song out hundreds of posts.
Automated recovery of information on artists and songs from the Internet can therefore be advantageous. One significant technical challenge to accomplishing this is recognizing that a particular set of data refers to an artist or song. Almost every word in the English language corresponds to an artist's name. For example, the band “Queen.” Thus, a system which can distinguish between common English words and named entities is advantageous. Furthermore, webpages can be in any language. Thus, a system which can identify an artist or song name regardless of the language the webpage is written in is also desirable. Still a further technical complication is that artists and songs often have aliases or abbreviations which are used instead of their formal or legal names. For example, Dave Matthews Band may be referred to as either “Dave Matthews” or “DMB.” Thus, recognizing aliases and abbreviation is also advantageous. In addition, artist and song names are often misspelled. The information that is being reported may nonetheless be relevant; so it is also advantageous to be able to recognize misspellings of artist names or songs.