The present invention relates to the field of storing and retrieving computerized data through the use of a Search Engine and, more particularly, to the indexing of data in system(s) as it relates to Internet, the World Wide Web, an intranet of local systems and/or any combination thereof. Computers are widely used to store and retrieve information. If the number of stored records is of any significant size, the records are typically stored in a computer database. Given a collection of multiple systems, a Search Engine may be used to locate, find, compare, and track data as it relates to documents (including files, images, objects, programs and other data in various forms referred herein as a document) in the System(s). The Search Engine can read the documents through program(s) commonly referred to as a Web-walker, Web-crawler, Spider, Browser or Robot, which acts similar to a user and notes the words in a document, the words sequence and the size of the document If changes have occurred from a prior scan of the document, the date of the document, the file name, computer or server containing the document, the directory of the document, whether the document has a URL (universal record locator) pictures, objects (video, sound, etc.), attributes (color, font, etc.) links to other documents, meta-tags and any other attribute (spread sheets, graphs, computer code, programs, addresses of other documents and their associated attributes, etc.) that could be placed in or relate to the document.
Present Search Engines such as Google, Excite, and Alta Vista perform these following common functions:
browsing of the documents by a program or system of programs to scan the documents for content and attributes; PA1 parsing of the documents to separate out words, information and attributes; PA1 indexing some or all of the words, information and attributes of the documents into a database; PA1 querying the index and database through a user interface (for live users and/or programs and systems) through local and remote access; PA1 maintaining the information, words and attributes in an index and database through data movement and management programs, as well as re-scanning the systems for documents, looking for changed documents, deleted documents, added documents, moved documents and new systems, files, information, connections to other systems and any other data and information. PA1 retrieving a Web page from the Internet, including a URL associated with the retrieved Web page; PA1 indexing data items from the Web page according to the present invention, the indexing further comprised of the steps of: PA1 associating each of the unique references to Web pages containing the unique data items associated with the unique reference. Wherein the unique reference may be a unique numerical value assigned to each unique word, and wherein that word is used to look-up all URLs of web pages containing the unique word. PA1 a.) providing locations in the index array which may be as small as 1 bit in length; and PA1 b.) merely storing an "on" or "off" (i.e., "full" or "empty") indicator in the location; PA1 c.) counter locations and pointer locations of predetermined length are also preferably added to the "compressed" index array structure for counting the number of locations that are full and pointing to the locations of the next level index arrays.
Google represents a typical Search Engine. The Internet currently contains over one hundred million documents--each on average containing over 100 unique words with an average of over one unique word per document (the URL is usually also unique). This results in an extremely large database of words (over 100 million) and over 10 billion entries in a database that tracks words in referenced documents. As the Internet grows to more than a billion documents, these databases will grow respectively. In typical Internet Search Engine designs, Hash techniques, B-tree Indexes, sorted lists, and variations thereon are the accepted approaches.
The B-tree approach is one approach to an index and a database. Due to the enormous size of the database, and given changes and growth, the present invention utilizes a new approach for structuring databases for Search Engines. The Search Engine of the present invention uses an indexing method that provides advantages to organizing the tremendous amount of information on the Internet and for searching such information in a fast and efficient process. As discussed in more detail in the following detailed description, the Search Engine of the present invention using the described indexing method provides significant practical advantages over known Search Engines.
For example, the present invention significantly increases the speed with which database records can be stored and retrieved. The present invention increases data retrieval as a function of unique and efficient programming steps rather than through hardware configuration.
In recent years the term database has been used rather loosely, and as a result, has lost some of its usefulness. To some a database is just a collection of data items. Others define the term more strictly. However, for the purposes of the present invention, a database is a self-describing collection of integrated records. A database is self-describing in that it contains a description of its own structure. This is called meta-data. The database is integrated in that it includes the relationships among data items as well as the data items themselves. Accordingly, the term database as used in the present application is not limited to merely bits of stored data or information.
A database is made up of both data and meta-data. Meta-data is data about the structure of the data in a database. This meta-data is stored in a part of the database called the data dictionary, which describes the tables, columns, indexes, constraints, and other items that make up the database.
Databases come in all sizes, from a simple collection of a few records to millions of records. A personal database is designed for use by a single person on a single computer. It tends to be rather simple in structure and small in size. A database is a structure that holds data. The database is structured to operate on the data contained within it.
There are many database management systems (DBMS) on the market today. Some run only on mainframe computers, some only on minicomputers, and some only on personal computers and some on most systems. However, there is a strong trend for such products to work on multiple platforms or on networks that contain all three classes of machines. A database that will run on different operating systems such as OS/2, Windows 3.1, Windows 95, or UNIX are said to be portable. Databases are also scalable if they can make use of adding/removing computers for more or less processing speed or power.
Regardless of the size of the computer that hosts the database, and regardless of whether it is connected to a network, the flow of information between the database and the user is the same. The DBMS mask the physical details of the storage so that the application only has to know about the logical characteristics of the data, not how it is stored.
Database structure makes it possible to interpret seemingly meaningless data. The structure brings to the surface patterns, trends, and tendencies in the data. Unstructured data, like uncombined atoms, has little or no value.
Databases may vary in size and structure. However, they are generally structured as a hierarchical, network, or relational model. The hierarchical database assigns different types of data to different levels of a data structure. The links between the data item on one level and data items on a different level are simple and direct. A major advantage of the hierarchical model is the simplicity of the relationships among data items. However, its rigid structure is a disadvantage.
The opposite of a hierarchical structure is one in which any node has direct access to any other. There is no need to duplicate nodes since they are all universally accessible. The network model is based on this concept.
The relational database model was first formulated by E. F. Codd of IBM in 1970, and started appearing in products about a decade later. Relational databases have attributes that distinguish them from databases built according to other models. In a relational database, the database structure can be changed without requiring changes to applications that were based on the earlier structure. For example, if one or more new columns are added to a database table, older applications that processed the table would not require alteration because the columns they deal with are unaltered.
As discussed, a database is usually more than a collection of tables. Additional structures, on several levels, help maintain the integrity of the data. A database's schema provides an overall organization to the tables. The domain of a table column tells us what values may be stored in the column. You can apply constraints to a database table to prevent invalid data from being stored in it. A view is a way of looking at only part of the database at one time. In relational tables, primary and foreign keys are used to connect tables.
A database is usually used in a client/server environment. The client/server environment is where some client requests information and/or services from some server which provides services back to the client. The client usually consists of some graphical interface such as Microsoft Windows (Microsoft Windows is a registered trademark of Microsoft Corporation). The server usually consists of one or more large computers connected to provide very fast response time to clients. A client/server environment can exist on a single system usually as two or more separate processes running simultaneously. Most databases are servers giving results to requests made by different clients.
Databases make a way of storing information without much work by the applications. Data is stored as data and meta-data in files. The way this is stored is transparent to the application which allows for multiple programs to access the same data using a given database. The database doesn't even have to be on the same machine as the application. Databases can change without affecting the application to some extent.
Records are made up of a collection of fields which usually comprise a group of bytes. A group of bytes is usually called a string. Bytes are used to represent some character set or, as a collection, some number. A string which is a group of characters from some character set will vary in length. Numbers are usually represented by a fixed number of bytes.
Information or data may be stored as a collection of records onto some type of magnetic device for retention while the machine that is supposed to read and/or write the information upon request may be without power, or simply off. These magnetic devices come in several different varieties. The most common among the personal computer world are the floppy disk (or diskette) and the hard drive. These will be referred to simply as disks hereinafter. Other forms of storage include memory which is dynamic (usually RAM memory) meaning that it must be loaded every time the given machine boots up or is turned on. Operating systems control the way in which data is read from and written to these disks. The way operating systems use RAM memory and disks is a concern of the present invention, but instead of going into great detail of how storage may be implemented, it is assumed that optimal disk and RAM memory usage is available. Groups of data are referred to as files, objects, libraries, etc.
Files are used to store programs which are a series of commands to a computer. Computers execute machine language instructions represented by bytes of the program. A data file is a collection of information usually separated into records that a program will use. Databases are used today to allow a higher level of transparency in which information is to be stored and/or retrieved. Databases consist of a program that works as a message broker between the programs which need data and the data files. This database program can enforce all kinds of security rules, referential integrity, and duplication of records.
As the computer industry continues to boom, the speed of the hardware keeps getting faster. The size of storage available on these machines has also grown at enormous rates. Even with all the speed and storage available on machines today, many databases tend to be very slow. The present invention is intended to increase the speed at which data is accessed and stored.
Computers are math machines. This means that they work very well with numbers. Strings are a series of numbers usually of an indeterminable size. Operations on strings tend to be very slow in comparison to single numbers. Sorting strings is one of the slowest operations a computer can perform. The reason computers are good with numbers and not with strings is because there are no existing machine language instructions to support strings directly. Instead, strings must be converted to a series of numbers and instructions must be executed on each of those numbers. Data files are a series of fixed or variable length records. Inserting records in the middle of these files is an expensive operation because, in doing so, other data must be shifted. The present invention allows the placement of new records at the end of the file.
The present invention comprises a method for storing and retrieving large amounts of data, a system and a network wherein the method can be performed. The density of information stored may become larger than traditional methods of storing and retrieving data, yet the computational power needed therein will become significantly smaller. This system does not require a big system/small system architecture in which the big system stores all of the information in an efficient manner and then the small system is used for retrieval purposes. Although this is the traditional approach, many problems arise from such an architecture. New records added to the database after the big system stores the information will either be inserted in a very long and tedious process or all new records will not be accessible in an efficient manner. The present invention, on the other hand, increases speed of processing data with no real care as to the size of the data.
Records are stored in a container (see FIG. 1) at some location. That location could then be used to identify the record in the container. The location could be a relative address of the record or the actual address of the record. It makes no difference as long as the location can identify the record within the container. The location hereinafter will be referred to as an address, or record location. The way the records are stored in the container also is of no concern with respect to storage and/or retrieval. The records could be sorted, partially sorted, or not sorted at all. If speed is of a primary concern, the best place to store the record within the container is usually at the end. Storing a record in the middle of a container usually means that another record (or, in most cases, a group of records) must be compared and moved.
A container is stored physically on the system as a file or object on a mechanism that will maintain information when the system is off. Magnetic disks utilize a common notation for storing, organizing, and retrieving data into groups, containers, files, or objects. The terms disk and file will be used hereinafter for a mechanism of storing information and the container respectively. Data and records will also be used interchangeably.
One embodiment of the current invention uses three entities (FIG. 2) to store and retrieve records. The container, or file, is used to store and retrieve records. The indices are preferably used to store the addresses, or locations, of the records. Records are usually a collection of fields. These fields comprise a string of characters from a given character set (e.g., numbers). The fields are preferably not stored in the indexes however they are used to shape the index. The index is initially a list or array of empty or NULL index locations. These index locations are used to store addresses of record locations or pointers to other indexes. This list initially includes an empty location for all important characters (that is, characters to be indexed), a location for other unknown characters (UKN), and a location for a terminating character. FIG. 2 describes an index for numbers, indexing characters `0` through `9`.
A duplicate segment stores multiple occurrences of a given string from a given field. It may be a wide linked-list structure containing a count of occurrences and several addresses per segment. The number of addresses per segment may vary. A pointer to the end of the list could also be included in the first duplicate segment. Additional fields may be added as found necessary (e.g., a flags character to show if a duplicate segment is sorted).
After a record is placed into the record file and an address is returned, the record is then broken up into fields which can be used to represent strings of characters. These strings of characters should be terminated with some terminating character. For example purposes, the terminating character may be the normal C programming language style terminating character `0` (zero). Strings of characters can then be broken into their individual counterparts to shape the index (as will be shown in the detailed description of the invention). The first character is used as a reference to the first index. In one embodiment of the present invention, the address retrieved from a location in the index is checked for one of three possibilities. The address can be empty, or it may contain an address to a record, or be an address to another index. The type of the value can be specified within the address by way of signed integers (e.g., a positive integer signifies a FULL state, a zero signifies an EMPTY state, and a negative number signifies an ADDRESS TO INDEX state). Other combinations are possible, but these are the values that are used hereinafter.
The indexing schema of the present invention may also be combined with other known database technologies (e.g. B-tree, Hash, Direct Access, Sorted Set, Unsorted Set, etc.) to work as a "hybrid" database indexing system. This may be accomplished by having the index locations of the present invention point to any of the other known database technologies at predetermined levels of the index structure. Alternatively, the other known database structures, at predetermined levels of their structure, may point to the indexing structure of the present invention, or any combination thereof. For example, the indexing schema of the present invention may be used to index the first five characters of all words in a database while resorting to other known database technologies to index the remaining characters of the words, if any. Accordingly, the indexing schema of the present invention, by indexing the first 5 characters, narrows the set (or the amount of further indexing needed) for the other existing technology. This hybrid indexing system is advantageous in that it combines the high speed indexing and retrieval of the present invention with the other known technologies which do not require as much memory for storing indexes.
In one embodiment of the present invention specific to an Internet Search Engine, the indexing process is comprised of the steps of:
a. creating an index structure by creating a plurality of index arrays, each of the index arrays having locations relating to a predetermined list of characters, wherein the locations of the plurality of index arrays are adapted to store either a pointer to another index array or a second data element; PA2 b. associating a unique reference for each unique data item in the index structure;
Furthermore, the Search Engine according to the present invention may be configured with a "compressed" index array structure for the purpose of saving memory space. "Compressed" as the term is used in the present invention means that the index array structures are smaller in length than the "uncompressed" index array structures. The "compressed" index arrays are condensed by:
The "compressed" index array structure saves a significant amount of memory space as there are less bits (e.g., 1 bit) for each location assigned to a character as opposed to 8 or more bits.
The present invention significantly increases the speed with which database records can be stored and retrieved. It accomplishes this, not through an improvement to computer hardware, but through efficient programming which utilizes the commands the computer can execute the fastest. The need to sort records and/or indexes for rapid retrieval may be eliminated. In fact, records do not need to be stored in any order for retrieval. They also do not have to be grouped as traditionally known. Searching is independent of records and can be compressed and/or put into variable length formats. This in turn allows for the computer database to grow to a size only restricted by what the computer hardware can handle, without much of a degrade in performance.
In addition to the features mentioned above, objects and advantages of the present invention will be readily apparent upon a reading of the following description.