The Internet, World Wide Web, and other types of data networks may be used to find information. Specific information is typically sought using these sources by conducting a search. Searches are conducted for various reasons such as research, education, personal interest, rights management, and others. However, while a large amount of information is available from various sources and services on these networks, the approach used by search service providers and the amount of data (either raw or returned in searches) renders conventional search techniques problematic with regard to accuracy, efficiency, and latency.
Conventional search techniques are problematic because information is identified and found by analyzing text associated with a file. “File” may refer to a physical or logical grouping of data and as such, the file may or may not exist physically. Files may also refer to directory structures or data. A file can have text associated with it such as a reference on a web page (e.g., link, in-line image, and the like), metadata attached to the file, or another resource with text in proximity to or associated with the file reference. If a search is performed using keywords that correspond to the associated text of the file, then the file or file location is delivered as a search result.
This conventional approach is used when searching for files (such as an image file) on the Internet. The service provider's search engine has no knowledge of the contents of the file searched for. Instead, numerous results are returned based on text associated with the file intending to return files that accurately match a search request. However, the file is neither analyzed nor checked to ensure that it matches a user's desired search.
For example, if an intellectual property rights management organization (e.g., law firm, agency) is determining whether a particular image of a popular singer such as Madonna has been copied illegally, the organization may use a conventional search engine to search a network such as the Internet for the image in question. Conventional techniques typically associate the word “Madonna” with an image file. If text is found, automatic search solutions then attempt to analyze the text to determine whether the text indicates the image is similar to the image being sought. The analysis of text associated with a file (image or otherwise) is neither accurate nor efficient. With each search result returned, a user must download the file in its entirety and manually evaluate the file. In the example cited, this approach forces the user to wade through thousands of pictures of other Madonnas such as the biblical Mary. When images of the pop singer Madonna are found, the image files often require additional manual review to determine which image files match a protected image of the popular singer. If a match is determined, then the image is identified as a copy and rights may be enforced. However, there may be additional copies of the protected image online, but if the indicated text is not found associated with the file, then a match can not be determined and rights may not be enforced.
In yet another example, a company may be trying to determine if its computer program is being distributed illegally on a network. Leveraging conventional solutions, the company would search based on text possibly associated with the computer program (e.g., “Get ABC's computer program here for free”). Once again, the files returned in the search are neither analyzed nor checked by the search engine to ensure that they match a user's desired search. There may be copies of the computer program that are never returned in the search results because the copies are not associated with text or because the associated text does not match the search request. For returned search results, manual review of a large amount of data is again required to determine if the files found in a search match those of the proprietary computer application.
Further, conventional solutions that identify files based on content are inefficient for all but comparatively small file sizes (e.g., HTML text, extremely small programs, pictures, or data files) because downloading larger files (e.g., picture files, music files, movie files, executables, and others) requires prohibitive amounts of bandwidth, data storage space, and processing power, which can be expensive and difficult to scale for implementation. Even if the required resources were obtained, the systems on the other side of the network providing the data would quickly become overloaded and may also exceed their allotted data transfer limits. Conventional solutions are also inefficient because analysis of the complete file is required, thus requiring large data storage facilities (e.g., data warehouses, arrays, and the like) and prohibitive amounts of processing power.
Conventional hashing algorithms or “hashing” techniques use an algorithm to generate a unique hash value for a file. However, this technique is problematic, as discussed above and because conventional solutions must first process an entire file to assign a hash value for the file. Subsequently, each file in the search results must have also been processed completely in order to generate a comparable hash value. If the hash value is the same, the files are determined to match. However, using conventional techniques, the same hash value could be calculated for two different files (i.e., collisions may occur), leading to error-prone results. Other conventional hashing solutions require pre-processing of the entire data file, which requires large amounts of storage, processor capability, and bandwidth availability to perform the pre-processing, which is unduly burdensome, slow, and expensive. Conventional solutions are inefficient, inaccurate, labor and time-intensive, and expensive.
Thus, what is needed is for searching for data without the limitations of conventional techniques.