With the emergence of 3G mobile telephony, new packet-based communication technologies have been developed for communicating multimedia content. For example, GPRS (General Packet Radio Service) and WCDMA (Wideband Code Division Multiple Access) technologies support wireless multimedia telephony services involving packet-switched communication of data representing images, text, documents, animations, audio files, video files, etc., in addition to traditional circuit-switched voice calls. The term “multimedia content” will be used in this description to represent any predefined data communicated by means of packet-switched transport.
Mobile networks will be designed to handle multimedia sessions that are divided into a circuit-switched (CS) part for voice transport and a packet-switched (PS) part based on IP technology for the transport of other data (typically multimedia content). In this way, the high performance associated with the traditional full duplex channels is obtained for voice, whereas any other data involved in multimedia services can be adequately supported by packet-switched transport, since it is normally not equally delay-sensitive. This arrangement can also reduce the costs for network operators by utilising existing resources for circuit-switched transmission, as e.g. in GPRS networks.
This solution is schematically illustrated in FIG. 1, where two exemplary mobile terminals A and B are engaged in a multimedia session involving both voice and multimedia content. Terminal A is connected to an access network 100A and terminal B is connected to another access network 100B, by means of conventional radio interfaces. Typically, each access network 100A, 100B has separate architectures and logic systems for circuit-switched transport and packet-switched transport, respectively, as indicated in the figure by a dashed line dividing the networks into a CS domain and a PS domain.
Voice is thus communicated in a separate circuit-switched call session 102, whereas multimedia content is communicated in a separate packet-switched multimedia session 104. In each session 102 and 104, various other intermediate networks and links may of course be involved, although not shown here for simplicity. The sessions 102 and 104 are basically independent of each other in terms of call management and transport channels, and each session may be started and terminated regardless of the other one. Typically, a CS-based voice call is established first, and then at some point during the call, multimedia may be introduced to the conversation by establishing a PS-based session. Simultaneous CS and PS sessions will become possible for access based on, e.g., WCDMA or GSM with DTM (Dual Transfer Mode) capability.
Recently, a network architecture called “IP Multimedia Subsystem” (IMS) has been developed by the 3rd Generation Partnership Project (3GPP) as an open standard, to provide multimedia services in the packet domain. IMS is a general platform for enabling services based on IP transport, more or less independent of the access technology used, and is basically not restricted to any limited set of specific services.
A specification for session setup has been defined called “SIP” (Session Initiation Protocol, according to the standard IETF RFC 3261 et al), which is an application-layer control (signalling) protocol for creating, modifying and terminating sessions over a packet-switched logic. SIP is generally used by IMS service networks for establishing multimedia sessions, such as session 104. In the case of FIG. 1, an IMS network may be integrated into the PS part of each network 100A, 100B.
Since many different types of terminals are now available on the consumer market, two terminals about to communicate multimedia may have different capabilities, and each terminal has initially no knowledge of the capabilities of the other. In order to establish a multimedia session, session parameters must therefore first be selected and determined in a session setup procedure, by exchanging information regarding their specific capabilities and preferences. In SIP, a method called “INVITE” is defined to initiate a session during a call setup when the terminals exchange their capabilities.
During a traditional circuit-switched voice call-setup between a calling party A and a called party B, a simple ring-back tone is emitted at the calling terminal to indicate that a ringing signal has been activated at the called terminal. Although B can pre-select any type of ringing signal at the called terminal, such as a piece of music, vibration or any recorded sound, the ring-back tone given at the calling terminal normally consists of a single repeated tone which can be somewhat tiresome to hear if the called person's answer is delayed.
Today, a popular service for circuit-switched telephone calls called “music ring back tone” is often used to entertain the calling party while he/she is waiting for an answer from the called party. This service is often used by telephone exchanges at authorities and enterprises where the answer can be greatly delayed, e.g. when placed in a telephone queue. A pre-recorded piece of music or information is then played for the calling party until the called party answers.
FIG. 2a illustrates a signalling diagram for a conventional circuit-switched call-setup between a calling terminal A connected to an access network 1, and a called terminal B connected to another access network 2. The following steps are then basically executed:    200: As a user of terminal A enters the telephone number of terminal B, terminal A sends a call-setup message including the called B number to Network 1.    202: Network 1 identifies Network 2 based on the received B-number and sends a call-setup message including the A-number to Network 2.    204: Network 2 sends a call-setup message to terminal B. If a calling number presentation service is applied, this call-setup message also includes the A-number. Terminal B then starts to ring or vibrate to alert its user. Also, a circuit-switched channel between terminals A and B is now being reserved for the call.    206: Terminal B responds by sending an alerting message to Network 2, indicating that a ringing signal has been activated at the called terminal B.    208: Network 2 then sends an alerting message to Network 1.    210: At the same time, while waiting for the user to answer, network 2 also generates a ring-back “sound” over the reserved channel, i.e. “in-band”. The ring-back sound is typically a simple repeated tone, but may also be any pre-recorded piece of audio, such as music or a spoken message, that has been preselected by the subscriber of terminal B.    212: The alerting message now reaches the calling terminal A indicating that terminal B emits the ringing signal. In response thereto, terminal A connects to the reserved channel to listen to the ring-back sound “in-band”.    214: After a while, the user of terminal B answers the call.    216: Terminal B sends a connect message to Network 2, indicating that the user of terminal B has answered the call, and network 2 therefore stops generating the ring-back sound.    218: The connect message is sent from Network 2 to Network 1.    220: The connect message is sent from Network 1 to the calling terminal A.    222: The call-setup is now completed and the actual call session may begin.
According to the above-described conventional call-setup procedure, a ring-back sound or audio piece is always transferred “in-band” to the calling terminal A by means of a connection in the CS domain, i.e. the CS channel that has been reserved for the call being setup. This is basically illustrated in FIG. 2b where a dashed block 222 represents a conventional call-setup procedure in the CS domain between terminal A and terminal B, being connected to CS-based networks 1 and 2, respectively. The dashed arrow 224 within block 222 represents a ring-back audio piece being transmitted to terminal A during the call-setup 222 by means of a CS channel reserved for the call being setup. Hence, the ring-back mechanism is wholly integrated as a part of the CS-based call-setup procedure 222.
However, conventional call-setup procedures provide a limitation since the CS nature of the connection enables users to only present pre-recorded pieces of audio, such as music or voice, to waiting callers. In the present environment of mobile multimedia communication, it would be desirable to extend the current limited range of audio-based ring-back presentations by introducing other types of media as well, in particular visual media.
Moreover, a calling user may want to finish listening to a presented piece of audio, especially if some important information is presented. In conventional call-setup procedures, any ring-back audio sequence being played is automatically interrupted at the moment the called party answers, i.e. picks up the phone, irrespective of how much of the sequence has been played. Thus, it would also be desirable to enable a caller to continue listening to a ring-back audio sequence even if the called party has answered. On the other hand, a caller may find a played audio piece disturbing and may want to stop listening to it even before the called party answers, which is only possible today by taking the phone away from the ear. When doing so, the caller will naturally neither be able to hear whether the called party answers, making this an impractical solution.