1. Field of the Invention
The present invention relates generally to storing/caching data directly on transmission mediums and network transmission hardware. More particularly, the present invention relates to systems and methods for searching, accessing, querying, and performing computations of locally or globally distributed data, stored/cached in the form of data packets, protocol data units (PDU), or protocol payloads, etc., continuously transmitted on a telecommunication network, and/or on a microprocessor, data bus, or electronic circuit for the life of the data packet(s) and/or data stream(s).
2. Problems in the Art
Each year it is estimated that between 1 and 2 billion gigabytes of unique information is created, and a high percentage of this information is created in a digital format. Of all that data, 90 percent is expected to be stored/cached digitally. Contributing to this is the rapid growth of digitized books, magazines, videos, music, and other ‘rich content’ and ‘mass access’ data. This growing volume of data is often located remotely, must be transported between computing devices, and stored/cached in a highly secure, accessible method. The exponential growth rate of generated data is expected to outpace improvements in communication bandwidth and storage/caching capacity in the near future. Consequently, this creates an urgent need to store/cache digital information in new ways that make it accessible at high speeds on a storage/caching medium and where the medium may be exponentially improved.
The access speed to digital information is ultimately controlled by the input/output (I/O) capacity of any electronic device. I/O is the lifeblood of computing, getting relevant information into and out of the processor, compute device, or appliance to the end-user on a timely basis. This has always been true, but never so much as in a networked computing environment.
Many associated I/O problems impede high speed access to remotely stored/cached data. Ethernet and TCP/IP are widely accepted, but inefficient protocols, which are used to drive LANs, WANs, and ultimately the Internet. The TCP/IP protocol suite has proven itself a basic foundation for communications of all kinds over essentially unreliable networks. But that fact alone makes it inefficient and creates network latency issues. TCP/IP-based protocols have a complex, layered design, with many inter-layer dependencies that can easily demand extensive processing and significant buffer memory to implement. Open Systems Interconnection (OSI) is a worldwide communications standard that defines a networking framework for implementing protocols in seven layers. Handling gigabit-class network traffic, servicing interrupts, moving data through long code-paths, and numerous kernel-to-application context switches are all expensive operations. Together, they yield long message latencies and use up a significant percentage of available processor power.
Another problem that impedes the high speed access to data is the venerable Peripheral Component Interconnect (PCI) shared data bus, which is one of the most prevalent I/O architecture for compute devices. For example, the bandwidth for board-level transfer, and processor to cache transfer on a typical PC is much higher than from the PC to a peripheral network device via PCI bus.
Data storage/caching centers have developed a variety of specialized networks, such as SANs (Storage Area Networks), specialized cluster links, NAS (Network Attached Storage), and RAID (Redundant Array of Independent Disks) systems in order to improve access to local and remotely distributed data. However, as RAID, NAS, Fast and Gigabit Ethernet, SANs, and SCSI (Small Computer System Interface) links are usually implemented with PCI adapter cards or PCI components, all of the data traffic on these network devices is ultimately throttled by the low-speed I/O devices.
I/O problems are further complicated by the architecture of a typical data storage/caching center. For example, a data storage/caching center for Web applications or Business-to-Business (B2B) exchanges may have hundreds of servers all requiring shared access to terabytes of file storage/caching. The workload is defined by server requests coming in through networked routers, switches, firewalls, load balancers, caching appliances, and the like. Since file sharing by multiple servers is a fundamental requirement of this environment, storage/caching is usually aggregated into shared storage/caching pools, accessed by the servers using a file access protocol such as Network File System (NFS). The result is a complex and sophisticated infrastructure that has exploded in importance in just the past few years.
Consequently, there are challenges surrounding how individual servers fulfill growing client requests and connections from ‘the outside world’, and how these challenges impact organization of complex and discrete files, data, databases, and storage. PCI-X and Infiniband are two solutions that will greatly improve I/O performance, and therefore increase broadband access to remotely stored/cached digital information.
Infiniband is an architecture and specification for data flow between processors and I/O devices that promises greater bandwidth and almost unlimited expandability in tomorrow's computer systems. In the next few years, Infiniband is expected to gradually replace the existing Peripheral Component Interconnect (PCI) shared-bus approach used in most of today's personal computers and servers. Offering throughput of up to 2.5 gigabytes per second and support for up to 64,000 addressable devices, this architecture also promises increased reliability, better sharing of data between clustered processors, and built-in security. Infiniband is the result of merging two competing designs, Future I/O, developed by Compaq, IBM, and Hewlett-Packard, with Next Generation I/O, developed by Intel, Microsoft, and Sun Microsystems. For a short time before the group came up with a new name, Infiniband was called System I/O.
PCI-X (Peripheral Component Interconnect Extended) is a new computer bus technology (the “data pipes” between parts of a computer) that increases the speed data can move within a computer from 66 MHz to 133 MHz. This technology was developed jointly by IBM, HP, and Compaq, and PCI-X doubles the speed and amount of data exchanged between the computer processor and peripherals. With the current PCI design, one 64-bit bus runs at 66 MHz and additional buses move 32 bits at 66 MHz or 64 bits at 33 MHz. The maximum amount of data exchanged between the processor and peripherals using the current PCI design are 532 MB per second. With PCI-X, one 64-bit bus runs at 133 MHz with the rest running at 66 MHz, allowing for a data exchange of 1.06 GB per second. PCI-X is backwards-compatible, meaning that you can, for example, install a PCI-X card in a standard PCI slot but expect a decrease in speed to 33 MHz. You can also use both PCI and PCI-X cards on the same bus, but the bus speed will run at the speed of the slowest card. PCI-X is more fault tolerant than PCI. For example, PCI-X is able to reinitialize a faulty card or take it offline before computer failure occurs.
PCI-X was designed for servers to increase performance for high bandwidth devices such as Gigabit Ethernet cards, Fibre Channel, Ultra3 Small Computer System Interface, and processors that are interconnected as a cluster. Compaq, IBM, and HP submitted PCI-X to the PCI Special Interest Group (Special Interest Group of the Association for Computing Machinery) in 1998. PCI SIG approved PCI-X, and it is now an open standard that can be adapted and used by all computer developers. PCI SIG controls technical support, training and compliance testing for PCI-X. IBM, Intel, Microelectronics and Mylex plan to develop chipsets to support PCI-X. 3Com and Adaptec intend to develop PCI-X peripherals.
To accelerate PCI-X adoption by the industry, Compaq offers PCI-X development tools at their Web site.
When remotely storing digital information the following criteria should be considered: the frequency of read access, frequency of write access, size of each access request, permissible latency, permissible availability, desired reliability, security, etc. Some data is accessed frequently, yet rarely changed. Other data is frequently changed and requires low latency access. These factors should be taken into account, but are often compromised in the “one size fits all” design and operation of conventional data storage/caching systems.
Preferably, a data storage/caching system should be designed to be scaleable so a user can purchase only the capacity needed at any particular time. High reliability and high availability are also considerations as data users want remote access to data, and have become increasingly intolerant of lost, damaged, and unavailable data. Unfortunately, current conventional data storage/caching architectures compromise these factors, and no single data storage/caching architecture provides a cost-effective, highly reliable, highly available, and dynamically scaleable solution.
Today the end-user can have high-speed access to streaming and non-streaming data in the form of websites, electronic text documents, graphic images, or spreadsheets stored/cached remotely by purchasing telecommunication bandwidth in the form of a T-1 or a fractional T-3 line, a Digital Subscriber Line (DSL), or through their cable TV provider using a cable modem. However, no conventional digital information storage/caching system addresses the needs of the end-users desire for widespread, low latency access to streaming and non-streaming multi-media data in the form of music, TV shows, movies, radio broadcasts, web casts, etc.
Advances in fiber optic transmission technology and its declining cost have enabled upgrades in front-end network systems such as cable TV network trunk and feeder systems. Traditionally, these systems have increased the bandwidth of a telecommunication network sufficiently to provide each subscriber his own dedicated channel to the head-end for receiving compressed digital video. In addition, direct broadcast satellite technology and other emerging wireless communication technologies also provide dedicated multimedia and video channels between a large number of end-users and the server systems. Personal computers and set top boxes for the end-user are also emerging, which enable networked multimedia applications.
The above mentioned improvements may typically improve the overall performance of current video server systems by a factor of only two or four times, whereas the current need in the industry requires improvements in the range of 100 to 1000 times to make interactive streaming video services economically feasible.
While the end-user (client) system and the front-end network system infrastructure is evolving rapidly to meet the requirement of non-streaming and interactive multimedia services, the constraints of current server architectures continue to be expensive and impractical for delivering these services. Current server systems are unable to process the large number of data streams that are required by streaming multimedia and video services. The current choices of servers are typically off-the-shelf mainframe or workstation technology based parallel computing systems. The hardware and software in both cases is optimized for computation intensive applications and for supporting multiple concurrent users (time-sharing) with very limited emphasis on moving data to and from a telecommunication network interface and the Input/Output (I/O) device.
Another key to acceptable multimedia audio and video streaming is the concept of Quality of Service (QoS). Quality of Service generally refers to a technique for managing computer system resources, such as bandwidth, by specifying user visible parameters such as message delivery time. Policy rules are used to describe the operation of data packet(s) to make these guarantees. Relevant standards for QoS in the IETF (Internet Engineering Task Force) are the RSVP (Resource Reservation Protocol) and COPS (Common Open Policy Service) protocols. RSVP allows for the reservation of bandwidth in advance, while COPS allows routers and switches to obtain policy rules from a server.
A major requirement in providing Quality of Service is the ability to deliver multi-media frame data at a guaranteed uniform rate. Failure to maintain Quality of Service may typically result in an image that is jerky or distorted.
Traditional server system architectures have not been equipped with the functionality necessary for providing Quality of Service on a large scale. With an increasing load on server systems to provide streaming multimedia applications, an increased volume of user (end-clients), and the above mentioned deficiencies in current server system technology, a need exists to provide a server system architecture or a new data storage/caching system with enhanced search and access capabilities which will be able to address the need of low latency, high-speed access to data.
U.S. Pat. No. 5,758,085 assigned to the International Business Machine (IBM) Corporation partially addresses the above-named problems by providing a plurality of intelligent switches in a Storage Area Network (SAN). When the end-user (client) makes a request to receive video and multimedia data, a request is sent to the host processor which in turn sends a request to a plurality of intelligent switches on the SAN. The intelligent switches include a cache for storing the requested data. The data is relayed directly from these switches to the end-user (client) requesting the multimedia data.
While the IBM system described above provides for the storage/caching of data onto switches, it does not allow the individual switches to cooperate together as a distributed architecture in order to pool bandwidth together to supply the backbone network, nor does it allow for the data to reside directly on a telecommunication network medium. Current technology allows for only a 1-2 gigabyte data stream coming out of a single peripheral device such as an array of disks, wherein a telecommunication network backbone may accommodate a 10 gigabyte or higher data stream. Also, in the above referenced patent, the individual switches are not capable of working together to distribute a delivery request over multiple switches for load balancing and streaming of the requested data.
United States Patent Application 20010049740, filed by Karpoff, addresses many of the shortcomings of the previously referenced IBM U.S. Pat. No. 5,758,085, by describing various systems and methods for delivering streaming data packets to a client device, over a telecommunication network in response to a request for the data packets from the client device. The client request is received by a server or a controller device that is typically located on a network switch device. If received by a server, the server sends a request to the intelligent network controller device for the transfer of the requested data to the client.
In addition to the data storage network architecture and bus problems discussed above, rapid access to and intelligent searching of data is impeded by the requirements of traditional relational database structures.
It has been 16 months since terrorists attacked the United States, and federal agencies are struggling to find a way to best share information to prevent future acts of terrorism. The key to fighting terrorism is the real-time free flow of information between federal agencies as well as with state and local governments. More than ever before, successful interdiction is dependent upon collecting, analyzing, and appropriately sharing information that exists in different databases, transactions, and other data points. The effective use of accurate information from diverse sources is critical to the success in the fight against terrorism. There is no lack of desire to share information in a cooperative way, however, there is no easy, and inexpensive solution to accomplish the sharing of data stored in traditional database structures.
Recently, the FBI has chosen to pursue “investigative data warehousing” as a key technology for use in the war against terrorism. This technology uses data mining and analytical software to sift through vast amounts of digital information to discover patterns and relationships that point to potential criminal activity. The same technology is also widely used in the commercial sector to track consumer activity and predict consumer behavior.
The FBI plans to build a data warehouse that receives information from multiple FBI databases and sources. Eventually, this warehouse might receive and send warehoused data to and from other law enforcement and intelligence agencies. In the war against terrorism new information technology is critical to analyzing and sharing information on a real-time basis. Also, the FBI is working to focus on analytical capabilities far more than it has in the past. For example, the FBI might want to put in a request for information on flight schools and access all the reports the FBI has written on flight schools from various FBI databases and then analyze them using artificial intelligence software, however, they are far from having this capability, which is known as enterprise data warehousing in the business world.
Data warehousing and data analysis/modeling tools are used extensively in the commercial sector to monitor sales in stores and automatically order new stock when inventories run low, monitor individual customer buying habits and try to influence consumer buying. The FBI is considering applying the same analysis techniques/tools currently used by the private sector to search vast collections of data to identify suspicious trends. For example, analyzing data collected in various FBI databases and by the Immigration and Naturalization Service, the CIA and other agencies could indicate suspicious activity that now is overlooked. Add to that data from credit card companies, airlines, banks, phone companies and other commercial entities, and actions and events that previously seemed innocent when considered separately, begin to trigger alarms when considered in context with other activities.
Most business executives make critical decisions based on data that's been cut and sliced for them by information managers. If executives could get closer to their core business data, they would increase their odds of making better-informed decisions. The promise of business-intelligence software is to make existing enterprise databases accessible through easy-to-use analytical and reporting functions. Business-intelligence software quickly and cheaply allows organizations to extract additional value from existing data warehouses and enterprise systems.
Business-intelligence software is nearly useless for companies that have “dirty data.” Before any quality feedback can be produced, databases must have consistent categories, language, and maintenance. Unfortunately, for most organizations due to mergers and acquisitions, the tendency is toward chaos. They end up trying to use incompatible databases to force new data into legacy information systems. Consequently, uniform data-entry protocols are lacking, or ignored, making it difficult to implement changes. For example, a data field such as, a supplier's name can be entered any number of ways by employees. Cleanup of such dirty data can be costly and can take from a few months to a few years.
Although the government has a huge effort underway for implementing data mining, business intelligence, and on-line analytical processing (OLAP) of transactions stored in traditional, structured data sources, intelligence agencies are in need of unstructured textual analysis to find patterns in unstructured data.
One company, Maya Viz, combines various elements of collaboration, knowledge management, and business intelligence to bring data into a visual form that can be manipulated and shared. This technology was first deployed in military command situations. The company's component architecture aims to transform relational database information pieces into nuggets, visualize them, and then through peer-to-peer connections, allow people to share the information with anyone.
Tacit Knowledge System's software automatically discovers expertise and activity across large organizations, and connects people and information. This software taps into existing content sources such as document repositories and e-mail archives to discover individual expertise and activity, and then makes end-users aware of relevant colleagues and data.
The U.S. has approximately 170,000 people working together to prevent attacks on the United States. This is an incredibly complex process, using multiple information technology systems to record information about case research, various memos, etc. In addition, a system that could tap into all those multiple information repositories and figure out who is working on what, would be phenomenally valuable to make critical connections between different agencies, departments, and analysts.
However, current data mining, warehousing, and business intelligence technologies are expensive and become difficult to implement, particularly when multiple federal, state, and local agencies become involved, all of which use their own proprietary technologies and data formats. The constraint of these current data technologies is the requirement for predetermining storage format, for example, table structure and upfront analysis.
There is therefore a need for a method of storing/caching, searching, accessing, querying, and performing computations on data in the form of data packet(s) and/or data streams continuously transmitted for the life of the data on a telecommunication network, and/or a microprocessor, data bus, or electronic circuit. The resulting solution needs to be cost effective, avoid traditional I/O problems, overcome the limitations of traditional relational database structures, and avoid other problems.