SIP Guide

A tutorial on the IETF's Session Initiation Protocol: * Why is it important? * How does it work? * What remains to be done?

July 25, 2005

25 Min Read
SIP Guide

It’s pretty widely accepted now that telecom value is moving away from centralized network/service combinations (such as the traditional PSTN) towards distributed multimedia applications running over a common (converged) IP-based hardware/software infrastructure that includes both fixed and mobile elements and broadband transmission.

A key technology for this transformation is the IETF's Session Initiation Protocol (SIP). SIP is already making waves in the burgeoning VOIP world, where it is beginning to displace the earlier (and, in the views of many, overcomplex) ITU H.323 standard for call setup and control. But SIP goes much deeper than this, because it is the basis on which a lot of future sophisticated application and service software will rest. In particular, it will support the IP Multimedia Subsystem (IMS), the likely future platform for multimedia services in both fixed and mobile networks.

So SIP is beginning to loom large, and this tutorial aims to provide a timely, quick, and easy guide to some questions that are commonly asked about SIP. Answers to a number of basic questions are given. But readers may ask further questions on the message board attached to this article. If you want to send a private message, please email us at [email protected], and include "SIP Guide" in the subject field. Frequently posed questions will be answered in updates to this report.

For a basic understanding of SIP, here are the key starting points:

  • SIP will have a major impact on the telecom industry – including telcos, mobile operators, service providers, vendors, and others – because it allows users, devices, and software applications to set up and control end-to-end multimedia sessions independently of the media types involved or the underlying transport protocols.

  • The fundamental standard is well established, but there is a huge ongoing effort to build on and around it to address a large number of issues and requirements. SIP represents the biggest standards effort in the IETF’s history.

  • SIP implementations are increasing, and the industry is working hard to address interoperability, which remains a major issue. But there is still a lot of work to be done.

Here's a hyperlinked list of the questions we seek to answer in this report:

— Tim Hills is a freelance telecommunications writer and journalist. He's a regular author of Light Reading reports.

SIP is Session Initiation Protocol. And, for once with a telecom acronym, it does exactly what it says on the tin: It’s a protocol (system of formal rules) for initiating (setting up and starting) communications sessions. Actually, it does a lot more – like turning them off when they are no longer needed, to take a trivial example. And, since SIP is consuming a huge amount of development effort in the telecom industry (and naturally producing oceans of hype as well), it is not difficult to guess that it has something to do with the Internet Protocol (IP).

What is less obvious is that it is quite hard to find something in telecom that it hasn’t anything to do with. Voice, broadband, mobile, video, data, multimedia – you name it and SIP (along with its various extensions) is going to be in there in spades as next-generation networks (NGNs) really begin to arrive. For example, SIP is fundamental to the IP Multimedia Subsystem (IMS), which is creating the software architecture for NGN applications and services (click here for the Light Reading IMS Guide).

SIP is one of the fundamental parts of the new software face of telecom. It provides the basic signaling protocol needed to initiate, manage, and terminate multimedia (voice, data, video, and audio, for example) communications sessions. To state the blindingly obvious, services and applications cannot be accessed or used over telecom networks unless appropriate sessions between communicating entities – which can be people or software processes – are set up. And these sessions have to be managed in various ways and eventually ended.

SIP does not do this job unaided, nor does it operate in a vacuum. Session Description Protocol (SDP), for example, is needed to handle multimedia sessions, defined as those involving streams of several different types of media. And SIP is surrounded by extensions and further protocols and standards to support essential communications functions (such as call control) and to enable interworking with non-SIP networks – e.g., the traditional public switched telephone network (PSTN).

In industry dollar terms, SIP matters because incumbent telcos and other service providers worldwide are throwing in their respective PSTN towels and beginning a long and massive migration to converged IP-based networks and a colorful multimedia future. Although there are other protocols that can be used for session control in such networks (such as the established H.323), no one seriously believes that the standard technology will be other than SIP (or a future SIP derivative) – if only for the reason that most of the development effort is going into SIP. Without session control, these networks will not work. And without a sophisticated form of session control like SIP they will not be able to offer novel and complex multimedia services. End of story.

Well. Almost, but not quite. There is the little matter of the spreading 3G mobile networks, which, being packet based and multimedia, have also turned to SIP. In short, in both the fixed and mobile worlds, the networks are moving to SIP.

And there is the knock-on effect in the enterprise world, where the deployment of converged IP voice and data networks is increasing rapidly. Many IP PBXs now support SIP, and the ultimate convergence of enterprise networks around a common external and internal use of SIP seems inevitable.

SIP is important in another, purer sense. It is one of the small band of IP application protocols that allow IP networks to do something fundamentally different and useful. Others are HTTP (which enabled the World Wide Web) and SMTP (which enabled email). A result is that SIP APIs are popping up all over in network-aware programming languages, such as Java, so that developers can build SIP components for service providers and enterprises.

SIP is the largest standard in the IETF's history of standards work – and it is continuing to grow.

The basic SIP standard is the IETF’s RFC 3261 of June 2002, which makes obsolete the earlier RFC 2543. However, SIP is a tightly focused protocol that provides only one component of a complete multimedia communications system, so it is therefore expected to be used with other IETF protocols. These include Real-time Transport Protocol (RTP – RFC 1889), Real-Time Streaming Protocol (RTSP – RFC 2326), Media Gateway Control Protocol (MGCP – RFC 3015 and H.248), and Session Description Protocol (SDP – RFC 2327).

There is also SIP-T (SIP for Telephony). This supports the interconnection of Media Gateway Controllers to emulate a network of Class-4 PSTN trunk switches. Essentially, it is SIP with a specific payload for carrying PSTN ISUP (ISDN User Part) messages, thus allowing call setup between SS7-based PSTNs and SIP-based IP telephony networks.

The IETF has produced several standards to reinforce RFC 3261, in particular RFCs 3262, 3263, 3264, and 3265. These address technical things like message reliability, location of proxy servers, and event notification. RFC 3263 covers using DNS to locate SIP servers, which helps make large deployments more stable.

SIP and SIP-related standards continue to stream from the various standards bodies, principally the IETF, but also from the 3rd Generation Partnership Project (3GPP), 3rd Generation Partnership Project 2 (3GPP2), and Mobile Wireless Internet Forum (MWIF) for third-generation wireless network requirements. Five new IETF RFCs appeared in January 2005. Bodies such as the Session Initiation Protocol (SIP) Forum are heavily involved in promoting SIP interoperability profiles and trials.

The IETF has three main Working Groups developing SIP: the SIP Working Group, the SIMPLE (SIP for Instant Messaging and Presence Leveraging Extensions) Working Group, and the SIPPING (Session Initiation Proposal Investigation) Working Group. But there are extensive ramifications with other Working Groups: the MMUSIC (Mulriparty Multimedia Session Control) Working Group, the IPTEL (IP Telephony) Working Group, and the PINT (PSTN-Internet Interworking) World Group, for examples.

Roughly speaking, the SIP Working Group looks after SIP basics; SIMPLE is concerned with SIP for instant messaging and presence; and SIPPING handles application-specific matters and developments for telephony and multimedia. IPTEL is the IP Telephony Working Group, which developed the Call Processing Language (CPL), which is relevant to SIP. PINT is the PSTN and Internet Interworking Working Group, and MMUSIC is the Multiparty Multimedia Session Control Working Group, which developed SDP and is now working on a successor.

The Working Group work programs are extensive: the SIP Working Group alone has a list of 15 deliverables and has 32 RFCs in various stages of production or acceptance; SIPPING has 15 RFCs to its name, and SIMPLE has four RFCs now as standards.

In common with many other modern telecom technologies (such as IMS, and even earlier ones like the Advanced Intelligent Network that introduced more sophisticated call control into the PSTN), SIP is an abstraction of a real network and its use. So SIP specifies a series of logical entities and functions to represent the participants and processes involved in setting up communications sessions within a network.

In terms of the ISO’s Open Systems Interconnection classification of communications protocols, SIP resides in Layer 7 (the Application Layer). It is independent of the lower-layer protocols, so it does not depend on the type of transport used or even on the type of session being established. SIP also exemplifies the modern trend to develop protocols in the higher OSI layers (for example, XML switching, also in Layer 7 - see Telecom Startups Play in XML) to handle network, service, and application convergence as telecommunications become more sophisticated and software oriented.

It’s worth noting that some people classify SIP as belonging to Layer 5 (the OSI Session Layer), which would seem logical, since the “S” stands for “Session.” However, this seems to be a question of philosophical outlook on which Light Reading is reluctant to pontificate. The distinction seems to be that, as far as an IP network is concerned, SIP is just something that runs over the complete thing (like applications such as HTTP, FTP, or Telnet) for a particular external purpose. However, from the viewpoint of the end service application (say, VOIP), SIP does control sessions and is therefore in Layer 5.

In the jargon of network protocols, SIP’s logical entities and functions provide service primitives, not services that an end user might recognize and use, or a telco sell to someone. Service primitives are small, general-purpose building blocks that can be combined and used to support a wide variety of end services and applications.

There are five logical SIP entities, each with different functions:

  • User Agent

  • Proxy Server

  • Redirect Server

  • Registrar Server

  • Back-to-Back User Agent (B2BUA)

Each entity can acts as a client, a server or both. Clients make (initiate) requests of various kinds, and servers respond to them. So a User Agent (client) may request a Proxy Server (server) to do something, and the Proxy Server (now client) in turn requests some information from a Registrar Server (server) as part of fulfilling the User Agent’s request.

SIP makes no assumptions or requirements as to how vendors and telcos implement these logical entities in their physical network devices. So there is nothing to stop a single network server acting as both a SIP Proxy Server and a SIP Registrar Server, for example, if this is felt to be appropriate.

The basic idea of SIP is to allow User Agents (UAs) – which represent the communicating endpoints (the caller and the called party) – to contact each other to set up, modify, and finally end various types of communications sessions, such as a voice call or a videoconference.

SIP User Agents can acts as both User Agent Clients (UACs) and User Agent Servers (UASs). UACs generate requests, and UASs respond to incoming requests. So an IP phone, for example, contains User Agent software that acts as a UAC when the human user starts to make an outgoing call, and as a UAS when someone makes an incoming call to that phone (notifying the user and returning a reject, accept, or redirect message). It’s important to realize that the User Agent software switches between UAC and UAS modes on a message-by-message basis, depending on what is going on – they are not persistent entities.

To make User Agent communications work, SIP has to provide five key things:

  1. A method of finding the called party (user location)

  2. Settling whether the called party wants to, or can, participate (user availability)

  3. Determining what media and media parameters to use (user capabilities)

  4. Establishment of the communications session (session setup)

  5. Ending the session and also modifying it in various ways, such as invoking new services, transferring to new users, or changing certain session parameters (session management)

SIP offers other capabilities also, such as encryption and security.

The main mechanism for all this is simple and easily extensible plain-text request/response message passing between entities acting as clients and servers as required. This is very much in the Internet tradition of such protocols as HTTP and SMTP, and contrasts with the generally highly-complex and monolithic control protocols used in the PSTN. In particular, SIP takes a very different approach to the earlier, ITU-specified H.323 VOIP standard.

This messaging is largely to and from Proxy Servers. These are network hosts, acting as both clients and servers to other entities, that either process and respond to requests internally, or – more usually – forward them (after translation, if necessary) to other entities. It is their function as intermediaries that gives Proxy Servers their name.

Proxy Servers are a crucial part of a SIP infrastructure and play a role similar to routing in an IP infrastructure, as their job is to ensure that requests are routed to the appropriate entity, identified by a SIP Uniform Resource Identifier (URI). To do this, Proxy Servers interpret requests, and may rewrite parts of the request message before forwarding it. Typical tasks include handling registrations and invitations to sessions from User Agents, and applying call policies governing whether a given user can make particular calls.

Redirect Servers help Proxy Servers route requests through the SIP network by supplying routing information (if available) back to Proxy Servers in response to requests. Unlike Proxy Servers, Redirect Servers do not forward messages themselves: It is up to the Proxy Servers to resend requests to the URI(s) received. Also, Redirect Servers themselves do not issue any SIP requests. The idea is to improve the robustness and scaleability of SIP messaging, and to allow SIP Proxy Servers to direct SIP requests to external domains.

Registrar Servers maintain databases that contain location information about all User Agents registered with a particular SIP domain. This information is used to respond to User Agent and Proxy Server requests so that SIP messages can be sent to the appropriate location for a given called User Agent.

Back-to-Back User Agents (B2BUAs) are a combination of User Agent Client and User Agent Server. The basic point is to make the response generated by the UAS part to an incoming request depend on the response received by the associated UAC part to a further request that the UAC generates. Such dependency allows a B2BUA to use dialog state information, a dialog being a peer-to-peer SIP relationship between two UAs that persists for a period. Thus a B2BUA maintains information on the state of a dialog, and participates in all requests sent on the dialogs it has established. Proxy Servers, for example, do not do this.

SIP gets a lot of its generality, flexibility, and power from the structure of the messages used. So a request message, for example, will contain a codeword (or method) specifying the action that the requesting client wants the server to perform, together with various message headers that provide further information about the message for identification and other purposes (for example, the SIP URI of the sender). Further information can be held in the body of the message, and such information uses a completely different protocol and is not specified by SIP. So an INVITE message, which is sent by a UA to start setting up a SIP session, will use such a body message to specify, say, the type of media, codec, and sampling rate to be used in the session. A common protocol for this purpose is Session Description Protocol (SDP), which is independent of SIP.

It’s worth pointing out that the network paths taken by the SIP signaling messages are usually different from those taken by the media packets used to transport the resulting voice call, videoconference, or whatever has been established through SIP.

Figure 1 shows in very simplified form how some of these SIP logical entities use messages to interact – in this case to set up a voice call from a PC (softphone) to a hardware SIP VOIP phone.

The message and action sequence is roughly:

  1. User Agent: ‘X in SIP domain A wants to call Y in SIP domain B’

  2. Proxy Server: ‘Where to call setup requests for domain B go?’

  3. Redirect Server: ‘Send call setup requests to domain B Proxy Server at address enclosed in this response message’

  4. Proxy Server: ‘Call setup request for B’

  5. Proxy Server: ‘Where is B?’

  6. Registrar Server: ‘B is at address enclosed in this response message’

  7. Proxy Server: ‘Call notification’

  8. Response

  9. Response

  10. Response

If the call setup is successful (Y is free to take the call), a media path using RPT is established between X and Y and the connected parties can start to talk.

A key point – but a reality check, rather than a problem – is that SIP is not a complete multimedia solution in itself. So in real multimedia networks SIP is going to be working with other software and devices to offer real multimedia services. For example, SIP can set up a videoconference session that uses a separate videoconferencing application, but SIP alone cannot create and control a videoconference.

This means that it is vital to understand that SIP does not control the operation of the resulting end-to-end media stream that it sets up to carry the call. So there is still a lot of development work to be done before multimedia really becomes commonplace – and issues at the media level, which include aspects of security, QOS, and interoperability, are still being addressed.


Security is a pressing issue in services that use SIP, such as VOIP, largely because SIP is IP-based and SIP messages can be monitored, used, interfered with, or spoofed like any other IP communication. Put more technically, SIP uses in-band signaling, meaning that the signaling messages that control the system are transported by the same mechanism (IP packets) that transports the service media (the voice channel for VOIP). So the separation between signaling and media streams is logical, not physical. This makes for an open architecture – high on flexibility, but not inherently secure.

The modern PSTN is very different, as it uses a closed architecture of out-of-band signaling, where the SS7 signaling messages are transported by a packet network and the service media are separate circuit-switched channels. This is inherently more secure than in-band signaling.

It’s worth recalling that the PSTN used to use in-band signaling and that so-called “phone phreaks” achieved notoriety by using tone generators (or even simple human whistles) to simulate telephone DTMF dialing to obtain free toll calls.

Threats facing VOIP calls, for example, include:

  • Denial of service (DOS): An attacker mimics a SIP user identity and cancels all its incoming INVITE requests, thereby effectively shutting off that user’s phone.

  • Call hijacking: As above, but the attacker responds to all incoming INVITE requests with a redirection message indicating that the called party has moved and will supply a forwarding address. So calls are redirected to a new destination of the attacker’s choosing. A particularly insidious version is to hijack someone’s voicemail by redirecting it to a ghost mailbox set up by the attacker, as it could take the compromised user some time to realize what is going on.

  • Code stealing: An attacker monitors SIP messages to extract account codes that govern access to various types of call, such as toll or international.

  • False call termination: The attacker fakes SIP messages that the service provider’s billing system understands to mean that the call has terminated, although the media path still remains open, so the call can continue, but free of further charge.

  • Direct User Agent call setup: It may be possible for two User Agents to bypass the Proxy Servers and set up a direct call between themselves, thereby bypassing the service provider’s billing system.

  • Call spamming: Unwanted incoming calls can be mass produced by a spammer in much the same was as email spam.

There are also well-known firewall/NAT-traversal issues with VOIP, although these are overcome by VOIP-aware firewalls and are not necessarily SIP specific. However, if users are careless and settle for kludges (such as leaving open the firewall’s UDP/TCP 5060 port used for VOIP signaling), they can compromise their network security.

Also, the media stream set up by SIP flows end-to-end and is independent of any SIP security mechanisms at the Proxy Servers, for example. So eavesdropping and transport disruption are possible, although a VPN will provide encryption and other security.

The IETF is making big efforts to devise standards that will address these and other security problems, but much can be done with existing standards and technologies. For example, SIP over Transport Layer Security (TLS) is a mechanism for secure signaling in VOIP networks. TLS (derived from, and very similar to, the widely used Secure Sockets Layer – SSL – and specified in RFC 2246) provides endpoint authentication and encryption security for communicating SIP entities.

The March 2004 RFC 3711 – Secure Real-time Transport Protocol (SRTP) – provides secure media. It forms a profile of the Real-Time Transport Protocol (RTP) used to transport call media, and can provide confidentiality, message authentication, and replay protection to the RTP traffic, and also to the RTP control traffic – the Real-Time Transport Control Protocol (RTCP).

Quality of Service

A lot of effort has gone – and continues to go – into quality of service (QOS) for IP-based networks, generally because QOS is not inherent in the basic IP mechanisms and therefore has to be added through various workarounds of increasing sophistication and complexity. The point for SIP is that many of the services that will use it – voice is an obvious example – require very tight QOS guarantees on transport parameters such as packet delay, loss, and jitter. SIP itself does not provide QOS, and has to work with other protocols, such as Resource Reservation Protocol (RSVP), to provide voice QOS.

A very big issue for telcos and service providers here is scaleability – ensuring that the QOS mechanisms will be able to handle the millions of busy-hour call attempts typical of existing PSTNs, for example. The Multiservice Switching Forum (MSF) has started a program to define a solution to the scaleability issue, and this will involve working on the interaction between SIP services and QOS mechanisms such as MPLS-TE.


Testing for SIP interoperability is crucial, and there is a lot of activity going on in this area. The SIP Forum, for example, hosts the SIPit events of twice-yearly, week-long test fests; in July 2004 the Forum released its first SIP Forum Test Framework (SFTF) v1.0, an open-source test suite for SIP developers (see SIP Forum Releases Test Framework). This is intended to allow SIP device vendors to test for common protocol errors, and thus improve the interoperability of devices.

Roughly, the SIP Forum says its recent tests suggest that interoperability can now often be achieved fairly easily at the basic level of VOIP call setup and transfer, and that interoperability issues now lie more with the use of multiple media – such as audio, video, and instant messaging – within SIP sessions. With more than one type of media stream involved in a session, it is crucial that the various end devices know which streams are involved, and their characteristics.

Equally, all the new SIP and related standards being introduced create new areas for interoperability issues. The mid-2005 SIPit saw testing of security aspects (SIP over TLS and SRTP) and RFC 3236 for DNS location of SIP Servers, for example. The SIP Forum also runs a parallel series of test fests for the emerging presence and instant messaging standards – SIMPLEt.

In short, SIP interoperability, while steadily progressing, is inevitably going to be an ongoing issue as the technology continues to develop. Further, being only one protocol among the many involved in IP-based networks, it is inevitably intertwined in commercial products with much wider issues of equipment interoperability. It is no secret in the VOIP world, for example, that it is still a big issue whether a softswitch from vendor X will work with a media gateway from vendor Y. Despite SIP’s central importance, SIP interoperability alone is not the whole story.

The MSF is developing and testing interoperability agreements that address such practical issues. For example, it has developed a scenario to examine the interactions among protocols such as SIP, MGCP, and H.248, while providing basic calling features such as call forwarding, call waiting, and three-way calling.

SIP applications are much wider than just vanilla VOIP. SIP is a fundamental capability that enables such things as:

  • Voicemail and unified messaging

  • Context-aware communications, such as presence and IM, and location services

  • Integration of communications and applications

  • Internet conferencing and collaboration

A basic point about SIP applications is that SIP enables people, devices, and software to interact in a wide variety of ways as independent peers running their own endpoint applications. This means, for example, that X can discover whether Y is available to communicate and what form that communication should take. So a user might make himself available for instant messaging, but not to receive phone calls, which would be diverted to a voicemail box. But only certain callers might be presented with the IM option, giving a further level of screening.

This is an example of presence, which is the notion that the current state of a peer can be characterized in particular ways and made available to other peers in various ways. The PSTN does something similar in a very crude and limited way with the busy tone generated when a dialed phone is off the hook. But SIP-enable presence is much more general and powerful and can signal things such as equipment state (such as off-hook, switched off), user disposition (do not disturb, at lunch), activity (working on the PC, talking on cell phone), physical location (office, meeting room, traveling off site), access devices available (cell phone, softphone, PDA), and so on.

Further software applications can use this information in many ways, and network-aware languages such as Java already have SIP application programming interfaces (APIs). A simple example is that a hospital IP PBX could automatically route an emergency call to the nearest available doctor with a particular expertise. This takes SIP directly into the IMS area that provides a framework for creating distributed software applications that are integrated with, and run over, telecom networks. SIP is essential for IMS.

On the pure communications side, SIP will really come into its own in supporting multimedia in a unified way. It will support a wide range of call types, such as instant text messaging, instant voice message, push to talk, and personal videoconferencing, that can be mixed and matched as required.

It is probably fair to say that just about every telecom player concerned with services or applications in some sense will be affected by SIP to some degree, simply because it is so fundamental to the development of modern IP-based networks. So telcos, mobile operators, service providers of many types, ISPs, equipment vendors, and software houses will use SIP. And the permeation of SIP throughout telecom generally will make SIP a key protocol within the enterprise network.

In principle, SIP promises many advantages and benefits to telecom players, including:

  • Ability of service providers to develop new services quickly and cost-effectively, and in a more open equipment environment than for traditional networks.

  • Simplification of programming and troubleshooting through the use of text-style control messages within an open standard. SIP follows many of the principles and conventions made familiar by other Internet protocols and applications (for example, HTTP and DNS), thus easing learning and training.

  • Ability to define SIP extensions for new applications while retaining operation of earlier unmodified SIP equipment (which just ignores unrecognized extensions).

  • Independence from the underlying transport network and protocols, and support for mobility.

  • Support for multidevice feature leveling and negotiation, so that two dissimilar end devices can negotiate a reduced service that both can support – for example, a videophone can establish a voice call with an audio-only phone.

  • A lightweight protocol that is inherently scaleable and suited for a highly distributed and versatile network environment.

Already, large numbers of players within (and even beyond) the telecom industry are using SIP. Telcos, carriers, and service providers of many types are offering SIP-based services such as local and long-distance telephony, presence and instant messaging, IP Centrex, voice messaging, push-to-talk, and multimedia conferencing. And all 3G mobile networks will use SIP.

Perhaps a notable seal of approval [ed. note: or kiss of death?] is that Microsoft has implemented SIP in Windows XP, Pocket PC, and MSN Messenger, and future versions of will include a SIP-based VOIP application interface layer.

Subscribe and receive the latest news from the industry.
Join 62,000+ members. Yes it's completely free.

You May Also Like