A client program or user utilizing a large distributed computing system typically issues queries, search requests, data selection requests, and so forth, and collects results from a large number of servers in the distributed computing system. The large distributed computing system may be any environment that comprises data that is horizontally partitioned across many servers. A continuing effort has been made to make the process of collecting the information from the servers as efficient as possible with regards to both time and resources. The need for efficient collection of information from large distributed computing systems has become more critical as more systems adopt a web services approach to interfacing with clients.
One conventional approach to issuing queries and collecting results is a sequential processing approach 600, illustrated by the diagram of FIG. 6. A client 605 sequentially issues a query to and receives a result from server 1, 610, server 2, 615, server 3, 620, and server 4, 625 (collectively referenced as servers 630). For example, client 605 issues a query 635 to server 1, 610, and receives a result 640. Client 605 then issues a query 645 to server 2, 615, and receives a result 650, etc. This sequential process is repeated until all the queries have been issued and all the results returned. Although this technology has proven to be useful, it would be desirable to present additional improvements.
The sequential processing approach 600 has the advantage of requiring a single thread to process the results. Utilizing a single thread is efficient with respect to resources, but not time. The sequential processing approach 600 is relatively slow; a delay by one of the servers 630 delays the overall response to the query. Each of the servers 630 may take a reasonable amount of time such as, for example, 10 ms to respond to the query. However, for a large number of servers 630, the overall response time to the query becomes unacceptably slow. The time required to respond to the query becomes the sum of the time required for each of the remote procedure calls.
Another conventional approach for issuing queries and collecting results is a parallel processing approach 700, illustrated by FIG. 7. A client 705 comprises a thread 1, 710, a thread 2, 715, a thread 3, 720, and a thread 4, 725 (collectively referenced as threads 730). Client 705 issues in parallel a query to and receives results from server 1, 735, server 2, 740, server 3, 745, and server 4, 750 (collectively referenced as servers 755). The parallel processing approach 700 utilizes one of the threads 730 for each of the servers 755 to manage input/output communication with each of the servers 755. For example, thread 1, 710, is dedicated to input/output communication with server 1, 735. Thread 2, 715, is dedicated to input/output communication with server 2, 740, etc. Although this technology has proven to be useful, it would be desirable to present additional improvements.
The parallel processing approach 700 has the advantage of quickly processing the results. Utilizing one of the threads 730 for each of the servers 755 is efficient with respect to time, but not resources. Each of the threads 730 consumes a substantial amount of computing resources. Further, network packets are typically 1.5 Kbytes. If the result of the query is much larger than 1.5 Kbytes, each of the threads 730 become active when data is ready to be read, resulting in a large number of context switches. As the number of servers 755 increases, the parallel processing approach 700 becomes even less efficient.
With both the sequential processing approach 600 and the parallel processing approach 700, the client 605 and client 705 are required to wait until sufficient information is accumulated to provide results. Several useful techniques have been developed for managing the collection of results provided in structured formats from a large distributed computing system.
However, the use of semi-structure formats such as XML is proliferating on the Internet and on other networks that are based on a web service model, requiring new approaches for managing bulk XML querying and semi-structured results streams. Structured data informs the client in advance how much data to expect so that the client can know when all the information has arrived and then process the information. Semi-structured data simply arrives at the client as a byte stream. The client then has to interpret the byte stream as it arrives by parsing the byte stream. Consequently, it is difficult to use one thread to process parallel streams of semi-structured data.
What is therefore needed is a system, a computer program product, and an associated method for bulk processing of semi-structured results streams from many different resources. The need for such a solution has heretofore remained unsatisfied.