Video teleconferencing generally involves a meeting between remote parties, whereby each remote party is able to see and hear at least one other remote party. This generally requires the rapid transmission of synchronized audio and video data. Typically, the data that is to be transmitted is first captured and encoded by a sending computer, and then communicated via an electronic data channel to a receiving computer, where the data is received, decoded, and rendered or otherwise manifested to the receiving conference party. Existing schemes for carrying out the aforementioned steps include a hardware device to capture the information to be sent and a separate software encoder to encode the information. Typically, the capture hardware itself is not network aware, and cannot responsively tailor its processing or output to the needs of the receiving computer. Additionally, the encoder is typically a software module running on the host computer, consuming host memory and processor time. At the remote receiving location, the decoder is typically "dumb," in the sense that it does not interpret or analyze the data stream, but merely performs a decoding function. Additionally, as with the typical encoder, the typical decoder is implemented as a software module running on the host.
This existing system of exchanging audio and video teleconferencing data gives rise to many inefficiencies. During a video teleconference, the computational resources of a computer are often fully utilized, and occasionally exhausted, while the resources of the sending capture device or the receiving video data processor are typically under-utilized. This inefficient allocation of computing resources sometimes leads to deterioration or loss of the video or audio data being transmitted. Additionally, the output of the capture device must sometimes be further processed prior to encoding, causing further inefficiencies. Finally, because existing capture devices are not able to directly access communication channels to their counterparts at a remote location during a teleconference in order to optimize the encoding/decoding process, data transmission is often of less than optimal quality and accuracy.
A system implementing the invention preferably conforms to the appropriate International Telecommunications Union (ITU) standards for multimedia communications over packet-based networks. In particular, the Telecommunications Standardization Sector of the ITU (ITU-T) has published a set of standards under the H.323 designation which include standards for data channels, monitoring channels, and control channels. According to the H.323 group of standards, audio and video data streams to be transmitted are encoded (compressed) and packetized in conformance with a real-time transport protocol (RTP) standard. The packets thus generated include both data and header information. The header information includes information whereby synchronization, loss detection, and status detection are facilitated. Within the H.323 recommendation, video applications may use the H.261, H.262, or H.263 protocols for data transmissions, while audio applications may use the G.711, G.722, G.723.1, G.728, or G.729 protocols. Any class of network which utilizes TCP/IP will generally support H.323 compliant teleconferencing. Examples of such networks include the Internet and many LANs.
An H.323 compliant terminal generally initiates and conducts a communications session via a gatekeeper. Accordingly, although a gatekeeper is not necessary, in a typical teleconference, there may reside a gatekeeper at each of the transmitting and receiving ends. The gatekeeper may perform address translation and bandwidth management, and may serve to map LAN aliases to IP addresses.
Additionally, in order to allow for the exchange of status information between the transmitter and receiver, a real-time transport control protocol (RTCP) channel is opened.
In order to provide control functions, an H.245 control channel is established. This channel supports the exchange of capability information, the opening and closing of data channels, and other control and indication functions.
Although the preferred embodiment will be described in the context of the Microsoft brand Windows Driver Model (WDM), one of skill in the art will appreciate that the invention is not limited to this implementation. The Windows Driver Model is a common set of services which allow the creation of drivers having compatibility between the Microsoft brand Windows 98 operating system and the Microsoft brand Windows 2000 operating system. Each WDM class abstracts many of the common details involved in controlling a class of similar devices. WDM utilizes a layered approach, implementing these common tasks within a WDM "class driver." Driver vendors may then supply smaller "minidriver" code entities to interface the hardware of interest to the WDM class driver.
WDM provides, among other functionalities, a Stream class driver to support kernel-mode streaming, allowing greater efficiency and reduced latency over user mode streaming. The stream architecture utilizes an interconnected filter organization, and employs the mechanism of "pins" to communicate to and from the filters, and to pass data. Both filters and pins are Component Object Model (COM) objects. The filter is a COM object that performs a specific task, such as transforming data, while a pin is a COM object created by the filter to represent a point of connection for a unidirectional data stream on the filter. Input pins accept data into the filter while output pins provide data to other filters. Filters and pins preferably expose control interfaces that other pins, filters, or applications can use to configure the behavior of those filters and pins. The interface "IBaseFilter" is an example of a filter configuration interface. An embodiment of the invention will be described by reference to the filters and pins of the WDM model hereinafter. For further information regarding the Windows Driver Model, please see WDM Kernel Streaming Architecture, available on the Internet at http://www.microsoft.com/Devonly/tech/hardware/desinit/csal.htm, or Windows Driver Model (WDM) Technology, available on the Internet at http://www.microsoft.com/Devonly/tech/hardware/WDM/default.htm.
As illustrated in FIG. 6, to control and access the kernel mode streaming data of the WDM architecture, a module such as Microsoft brand Telephony Application Programming Interface 3.0 (TAPI 3.0) running in user mode may be utilized by an application 610. The TAPI 3.0 COM API is implemented as a suite of COM objects, chiefly Call Control 600, Media Stream Control 602, and Directory Control 604. A Telephony Service Provider (TSP) 606 is responsible for resolving the protocol-independent call model of TAPI into protocol-specific call-control mechanisms, while a Media Stream Provider (MSP) 608 implements Microsoft brand DIRECTSHOW interfaces for a particular TSP. Microsoft brand DIRECTSHOW, part of the WDM, is an architecture which facilitates the control of multimedia data streams via modular components. TAPI 3.0 employs a kernel streaming proxy module such as KSProxy, a Microsoft DIRECTSHOW filter, to control and communicate with kernel mode filters. KSProxy provides a generic method of representing kernel mode streaming filters as DIRECTSHOW filters. Running in user mode, KSProxy accepts existing control interfaces and translates them into input/output control calls to the WDM streaming drivers. TAPI 3.0 may automatically create the WDM filter graph by invoking the appropriate filters and connecting the appropriate pins. For more information regarding TAPI 3.0, see IP Telephony With TAPI 3.0, available at http://msdn.microsoft.com/library/backgrnd/html/msdn_tapi.sub.-- 3.0.htm.
FIG. 2a is a high-level schematic of a prior art teleconferencing architecture. According to the illustrated architecture, a capture device 200 receives input from a video or audio data source such as a microphone 202 or a video camera 204, and further processes the data, yielding PCM or RGB data respectively. Software video encoding 206 or audio encoding 208 modules receive the appropriate data and encode it into a form in which it can be transmitted by a network sink 210 over a data channel 222 to a remote endpoint.
At the remote endpoint, a software video decoder 212 or audio decoder 214 module receives the appropriate data from the network source 216. After decoding the encoded data, the software video decoder 212 or audio decoder 214 module outputs video or audio data respectively, usable by a video display device 218 or an audio speaker device 220. Typically, in addition to the data channel 222, there also exists a control channel 224 for exchanging capability information at the outset of the call, and a status channel 226 for passing status messages during the call. These channels link the encoder 206, 208 and decoder 212, 214, but typically do not pass to the capture device 200. One shortcoming of this architecture is that the capture function is not dynamically adaptable; the capture module 200 is not network aware, and the decoder modules 212, 214 are not capable of analyzing incoming data and responsively communicating with the capture module 200. A further draw back is that the computationally expensive encoding and decoding tasks are consuming critical resources of the host computers.
A method is needed whereby the computational resources of a hardware capture device and a hardware decoding device are utilized to perform certain encoding and decoding tasks currently allocated to the host computers in order to more efficiently process teleconference data. It is further desirable that the two functional units which are performing capture encoding and decoding at remote sites during a teleconference be communicably interconnected in order to facilitate more efficient and accurate data flow responsive to changing system capabilities and network conditions.