Field of the Invention
The present invention relates to a system and method for searching large amounts of data, and more particularly for searching very large data using a Pigeonhole Principle with fuzzy criteria.
Background of the Related Art
Big Data is the new term of the exponential growth of data in the Internet. The importance of Bio Data is not about how large it is, but about what information you can get from analyzing these data. Such analysis would help many businesses on making smarter decisions, and provide time and cost reduction. Therefore, to make such analysis, you will definitely need to search the large files on Big Data. Big Data is such a construction where sequential search is prohibitively inefficient, in terms of time and energy. Therefore, any new technique that allows very efficient search in very large files is highly demanded.
Information Retrieval systems naturally exist in many applications that require obtaining information resources related to specified information. Web search engines are the most well known Information Retrieval applications Like Google, Yahoo, and others. These information retrieval systems require fast and efficient searching and processing. In which it presents a challenge in using large collections of information items in Data management systems. This issue becomes extremely important with the availability of huge amounts of data through the Internet, “Big Data”.
An interesting issue with most of the existing IR applications' interactive techniques is that they only support substring matching without considering approximate searching. Therefore, approximate searching of information items in very large data files is even more challenging Computer Science problem. Usually, the solution of this problem relies on brute force approach, which results in sequential look-up of the file. In many cases, this substantially undermines system performance. It also, consumes a lot of time and energy. The good new is that the sequential processes can be easily parallelized; however in a very large information system, this would be very expensive and costly solution. Therefore, a fast algorithm that solves this problem is highly demanded.
The problem of “approximate” pattern matching is a well-studied problem and has received a lot of attention, in a view of the fact that several applications require approximate matching rather than exact matching of the pattern [3]. Typically, these applications, such as information retrieval, pattern recognition, computational biology and others [4].“Therefore, a fast algorithm for approximate pattern matching is highly demanded” [5]. One paper [6] indicates the importance of approximate matching. In this kind of matching, we are looking for the closest solution, which depends on the considered type of errors. “Mismatch is the one of the most common errors and the number of mismatches between two equal length strings is called the Hamming distance. Approximate pattern matching with Hamming distance refers to the problem of finding all the substrings with Hamming distance less than specified distance from the pattern”[7]. There are many different algorithms dealing with the problem of approximate pattern matching within specified Hamming distance in the pattern matching literature. A well-known algorithm is the Shift-Add algorithm for both exact and approximate string matching [8]. Another algorithm based on convolutions is given in [9]. Others used trees in their search model like [10], [11], and [12]. Furthermore, there are other works considered searching very large data files like [13], [14], [15], and [16]. An interesting method of matching string patterns in large textual files is presented in [17]. It is based upon the hash transformation mapping of string segments on to key numbers; thus locating the matching pattern faster. They used the segments to detect the errors. Therefore, the pattern is divided into fixed length segments that, in total, are representing the actual searched pattern; and then match these segments with its pattern in a hash table. In [18] and [19], the authors, used the Pigeonhole Principle for pattern matching in string search. They divided the pattern into a specific number of segments, in order to have at least one exact match segment with the pattern and at most one less of the number of segments errors from the pattern.
The algorithms in the previous mentioned literature are mostly support substring matching in string search, while in the algorithm of the present invention, considers a novel technique of fuzzy search in very large data files. Also, we used the Pigeonhole Principle for searching and not for pattern matching in string search as in [18] and [19]. Furthermore, in our new searching technique, we expand the Pigeonhole searching capabilities by making the basic search utilized with intrinsic approximate search method, like FuzzyFind method [1][2].