4.22.1.1. Protocol elements

The devices involved in SIP communication can have several different roles, but a single device can play the part of different roles at the same time. The most important roles are briefly summarized below:

  • User-agent: The phone itself. In the traditional model, this would be called client.

  • Registrar: The registration service. The address where a particular user-agent is accessible is registered here. It acts as a sort of a name service for the protocol.

  • Proxy: This device transmits the requests of the user-agents. It has nothing to do, and is not to be confused with a proxy firewall or with a web cache proxy.

  • Presence server: Similar to the registrar; this device stores information about the availability of the user-agents. Users can monitor if the VoIP devices of their contacts (friends, business partners, etc.) are active (i.e. on-line) via the presence server.

  • Back2back user-agent: This is a special proxy implementing the functions of two user-agents. On one side of a connection it acts as the caller, on the other side as the called party.

SIP is only involved in the signaling part of a communication session, and relies on other protocols to perform the actual data transfer. SIP communication takes place in multiple channels: one is the signaling channel, the other one the actual data channel used to transmit the voice and/or video data. This latter channel is opened dynamically according to parameters negotiated in the signaling channel. The negotiation uses a separate - embedded - protocol called Session Description Protocol (SDP) used to describe the channel and the type of media used in a session (i.e. the IP ports, codecs, etc.). It is essential for the firewall to understand and inspect the SDP protocol, since it contains all the information required to allow the VoIP traffic pass the firewall. The SDP traffic also has to be modified in case network address translation is performed. To transfer the actual voice, video, or other data, SIP uses the Real-time Transport Protocol (RTP). RTP defines a standardized packet format for delivering audio and video over the Internet, and is frequently used in audio/video streaming and conferencing solutions.

From the signaling point of view, it is important to note that there is no client/server hierarchy between the user-agents, only caller/called party. The signaling traffic is usually not transmitted directly between the user-agents, generally proxies and back2back user-agents are also involved. Consequently, signaling messages (for example a request and a corresponding answer) can take very different routes between two user-agents, greatly complicating the secure transmission of the protocol. On the other hand, the RTP session is built directly between the user-agents without the interaction of proxies, though back2back user-agents may still be involved in the transmission of the audio/video data. Therefore a special care must be taken when creating the access control rules of the SIP signaling and data traffic.