An overview of reliable multicasting

(1)

David A. Carr

Centre for Distance-spanning Technology Department of Computer Science

Lulea University of Technology Lulea, Sweden

e-mail: David.Carr@sm.luth.se

Lenka Motyckova

^y

Department of Computer Science Masaryk University

Brno, Czech Republic e-mail: lenka@.muni.cz January 28, 1998

Abstract

Increases in Internet bandwidth have spurred a growing interest in using it to solve problems in coordinating geographically distributed project teams. Supporting these teams requires applications such as real-time video conferencing and group editing tools. Other organizations are interested in delivering entertainment, group games, and lectures. All of these applications need to be able to transmit simultaneously and eciently to many hosts. In order to preserve network bandwidth, the routers in the Internet backbone provide a capability to send a message to many hosts without duplicating the message while it travels on the backbone. This service is called IP- multicasting. However, IP-multicasting is a best eort service without guarantees about delivery or correctness. These services are left to the transport protocol. Therefore, it is up to the multicast transport protocol to implement the reliability.

This paper surveys the two issues of reliability and routing. It begins by discussing the terms used in multicast protocols and discussing several experimental protocols. It then gives an overview of the theoretical performance limits for the various types of reliable multicast protocols. Next, it discusses current multicast routing algorithms.

Finally, a survey of current and experimental multicast routing protocols is given.

1 Introduction

Increases in Internet bandwidth have spurred a growing interest in using it to solve problems in coordinating geographically distributed project teams. Supporting these teams requires applications such as real-time video conferencing and group editing tools. Other organizations are interested in delivering entertainment, group games, and classes or lectures. All of these applications need to be able to transmit simultaneously and eciently to many hosts. In order to preserve network bandwidth, the routers in the Internet backbone provide a capability to send a message to many hosts without duplicating the message while it travels on the backbone. This service is called IP- multicasting.

SupportedbytheMATESproject,ESPRIT20598.

y

SupportedbytheHeinzNixdorfInstitute,theUniversityofPaderbornprojectSFB376.

1

(2)

1 INTRODUCTION

2 However, IP-multicasting is a best eort service with no guarantees about delivery or correctness. These services are left to the transport layer protocol. Therefore, it is up to the multicast protocol to implement the reliability. IP constructs a delivery tree for routing, and packet loss may occur in any node of this tree. The main cause of data loss is buer over ow at the routers or receiving hosts. A reliable protocol delivers an exact copy of the transmitted data to all receivers.

The wide variety of applications gives rise to dierent requirements for reliability and may add other requirements for message delivery. For example, video and audio conferencing applications can tolerate some loss, but benet from timely delivery of lost packets [PSA]. On the other hand, a group support tool may require that all messages are delivered to each host in a specic order.

These requirements are described with various terms. A protocol is said to be

receiver-reliable if only the receivers can detect lost packets. The receivers then re- quest repair. Some authors will classify a protocol as receiver-reliable if there is a small probability that the replacement (repair) packet cannot be obtained. Protocols with a small probability of loosing a packet are sometimes called semi-reliable while those guaranteeing delivery are called fully- or totally-reliable . In a receiver-reliable pro- tocol, the sender cannot determine if all receivers have correctly received the message.

A protocol is said to be sender-reliable if the sender receives conrmation that all receivers have received a message. Who is responsible for initiating the repair sequences gives rise to receiver-initiated ^and sender-initiated protocols. In the former, the receiver requests retransmission when it notices a missing packet. In the latter, control messages are sent in response to requests by the sender.

The order in which messages are delivered gives its own set of terms. If messages are delivered in any order, they are said to be unordered . If all receivers receive the messages in the same order that they were sent, then the protocol is ordered . If in addition no one receives message n + 1 until all have received message n , the protocol is totally-ordered .

In addition to providing reliability, multicast protocols must perform well with respect to other criteria. A protocol should carefully manage network bandwidth to avoid bottlenecks on congested links, to keep the load on receivers manageable, and to limit the memory load on senders. One important metric of performance is transfer time which depends on the average throughput seen by a session. This is again dependent on the available bandwidth and packet loss characteristics. Another measure is scalability or how an increasing number of receivers aects throughput.

Reliable multicast protocols are sometimes classied according to how lost packets are signaled. This occurs by one of two methods, an ACK-based scheme or a NAK- based scheme. For the ACK scheme a receiver sends either cumulative ACKs (the number of latest packet received without a gap), selective ACKs (every packet received), or block ACKs (a group of packets). ACKs may be sent spontaneously or in response to a polling message. In the NAK scheme receivers send negative responses for lost packets. A lost packet is recognized by a missing sequence number or by timeout between two successive messages. To prevent long delays in detection, senders must ll idle time with control packets that contain the highest sequence number. Positive acknowledgment is safer but more expensive. It is safer because NAK packets may be lost. However if most packets are delivered correctly, the ACK-based scheme generates many more control messages. In many cases both techniques are combined, especially when delay in replacing lost messages aects the application.

A potential bottleneck is the control trac path back to a sender, especially for

sender-initiated multicasting. As the size of the receiver set grows from tens to hun-

dreds, the sender spends an ever increasing time processing ACKs and NAKs. This is

called ACK- ^, NAK- ^{, or} packet-implosion . Even with a receiver-initiated protocol,

(3)

2 AN OVERVIEW OF EXISTING PROTOCOLS

3 control trac can overwhelm the sender. Another problem for sender-initiated proto- cols is obtaining and maintaining knowledge of all receiver identities and their state.

The solution is distributed control. In the case of sender-initiated protocols, receivers can be organized to local groups managed by local group controllers or designated receivers that process ACKs and perform retransmissions.

For receiver-initiated protocols NAK-suppression can greatly reduce control traf- c. With a NAK-suppression scheme NAKs are multicast and each receiver delays before transmitting a NAK. If it sees another NAK during the delay, it increases the delay. The NAK is cancelled when it receives the lost packet or transmitted when the delay ends.

Receiver-initiated methods have problems as well. They increase the memory load of senders by the need to buer data for a long time. End-to-end delay might be increased too, as detection of lost packets is delayed. However, no receiver-initiated transport protocol alone can be fully-reliable without buering all packets [LeGA] as there exists some possibility that a NAK will arrive after the sender has released the buer. The most ecient method seems to be combining NAKing with perioding polling for ACKs.

Also, retransmission of lost packets for a few hosts by multicasting increases the load and wastes network bandwidth. How packets are retransmitted can also aect load. There are basically two methods, go

^,

back

^N

^or selective-repeat ^{. The rst}

strategy repeats the lost packet and all successive packets as well. The second strategy resends only lost packets.

In order to meet the dierent requirements a number of reliable-multicast transport protocols have been developed. In the next section, this paper gives a brief description of a few recent ones: RMP (Reliable Multicast Protocol), SRM (Scalable Reliable Mul- ticast), MTP-2 (Multicast Transport Protocol version 2), LBRM (Log-Based Reliable Multicast), XTP (Xpress Transport Protocol), RMTP (Reliable Multicast Transport Protocol), and TMTP (Tree-based Multicast Transport Protocol).

With so many dierent protocols one naturally asks which is the best. However, this question is not easily answered. The answer depends on the application and the number of receivers. No two protocols have been compared head-to-head. However, some theoretical work has been done analyzing generic protocols. Section three summarizes this work.

How multicast protocols are routed is also an important issue. Current methods use sender rooted trees, but as the number of multicast groups on the Internet grows these methods will have trouble scaling. In particular, they load the routers as a function of senders

groups . In addition, scalable reliable multicast protocols are built by organizing the receivers in groups in order to limit the impact of control messages on the network and the sender. Using routing information would make these groups easier to construct. Sections 4-6 discuss current routing protocols and new research on replacement protocols.

2 An Overview of Existing Protocols

Reliable multicast protocols can be classied as to how they distribute the bookkeeping for retransmission and assuring that all group members have received a message. The rst reliable protocols were designed to operate on a local area network and used a logical token-ring topology [ChMa]. Generally, the token-ring protocols distribute the bookkeeping by passing a token to each member of the multicast group. RMP [WhKM]

extends this scheme to wide-area networks and is a member of this family of protocols.

SRM [FetA] takes a dierent approach to bookkeeping. Every member of the group

simultaneously participates in error repair. It is a NAK-based protocol that is receiver-

(4)

4 Sender

( Multicasts Data)

Token Holder

Next Token Holder Lost Packet

Passes Token ACKs Packet NAK &

Repair

Figure 1: RMP - Token holder responsibilities

reliable. MTP-2 [BetA], LBRM [HoSC], and XTP [Ho97] use designated hosts for the bookkeeping activities. RMTP [LiPa] and TMTP [YaGS] also use designated hosts.

However, they organize the hosts in a hierarchy to achieve further scalability. These protocols are described in more detail below.

Reliable multicast protocols have evolved from ring-based protocols as described in Chang and Maxemchuk [ChMa] in their fundamental paper which considered reliable broadcast in LANs. Early experiments in WAN-based protocols used a single point for reliability control. This may be the sender (early XTP) or a separate process (MTP and MTP-2). Problems with packet implosion led to experiments with receiver-reliable protocols (SRM). However, other researchers did not want to give up full reliability.

They experimented with ways to reduce the load on a single control process. The two solutions are either to use the original ring-based method for migrating the reliability server (RMP) or to organize a hierarchy. Some protocols rely on manual organization (RMTP and LBRM) while others use a dynamically-constructed hierarchy (TMTP).

More recently a protocol based on forward error correction has been proposed [NoBT].

However, it has not been implemented and will be discussed in the theoretical section.

2.1 Reliable Multicast Protocol (RMP)

RMP [WhKM] was designed to provide reliable multicast communication for software buses. It supports a range of reliable delivery options. An application can request reli- able delivery with unordered, ordered, or totally-ordered messages as well as unreliable delivery. RMP distributes the load of packet repair and synchronization by requiring each member of the multicast group to participate. This is accomplished by passing a token from member to member. This token does not control the right to transmit.

The member with the token is responsible for (See Figure 1):

ACKing to the packet sender,

for repairing any errors,

(5)

5 serializing the messages for the ordered and totally-ordered modes,

for passing the token to the next token site.

Reliability is assured by requiring that a site has correctly received the last N packets before it accepts the token where N is the size of the multicast group. Once the token has travelled around to the entire group and back to the original holder, it is assured that all but the last N packets have been correctly received. It may then discard the correctly received packets from its buers. In addition to ACKs, RMP uses NAKs. When a receiver notices that it has missed a packet it multicasts a NAK and names the last known token holder as the site to supply the missing packet. After a certain time if the packet has not been received, the NAK is retransmitted naming a dierent recovery site. After a set number of NAKs without success in retransmission, the receiver transmits the NAK requesting any site to respond. Retransmissions are unicast to the receiver requesting lost packets.

In order to minimize the number of messages many functions are piggybacked onto the ACK message. The ACK message:

is multicast,

signals to the sender that the packet has been correctly received,

assigns a time stamp to the message so that it can be sequenced,

passes the token to the next member, and

acknowledges to the previous token holder that the token has been received.

Membership in the group is known to all members and dynamic. Newly joined members receive the token as soon as their status is recognized. In addition, RMP species procedures for reforming the group in the event of a network partition, failure of a member, or loss of the token.

RMP is quite ecient. Tests with eight workstations on a local area network achieved utilization of about 85% of the network capacity. Doing the same work via unicast would require the transmitting the data seven times. If unicasting used 100%

of the network, RMP transferred the data about six times as fast as is possible by unicasting.

2.2 Scalable Reliable Multicasting (SRM)

Scalable Reliable Multicast (SRM) [FetA] is a protocol which solves the NAK implosion problem by using NAK-suppression. SRM's goal is to provide receiver-reliable multi- cast. So, a receiver in the group is able to receive all messages broadcast to the group.

SRM is implemented in the Lawrence Berkley Laboratories conferencing white board tool, wb ^.

In SRM all nodes participate in error correction. When a node notices that it is missing a packet it waits a random time and then multicasts a repair request (NAK) to the group as a whole. However if some other node has NAKed the same packet during the wait period, the waiting node increases its wait time. If a repair packet arrives before the wait expires, the NAK is cancelled. Otherwise, the NAK is transmitted.

Similarly, a node with the packet waits a random time and then multicasts the packet

in response to a NAK. The random delay is scaled proportionally to the distance to the

node that transmitted the repair request. All repair requests and repairs are multicast

to the entire group. In addition, a host that has sent a repair refrains from sending

another for a time dependent on the estimated distance to the requesting node. (It

should be noted that the distance calculation is not used in wb . A xed interval is

used instead.)

(6)

6 The idea is that a node close to the point of failure will probably time out rst. It's message will then suppress the NAKs of all nodes farther from the point of failure. In addition, the node multicasting the repair will probably be close to the node reporting the error, but just upstream from the failure. This should work to minimize the number of transmissions of both repair requests and repairs.

In order for receivers to discover that they have missed the trailing packet(s) in a message, each member of a group is required to transmit session messages which are used to identify participants, report state on the page that the member is viewing, and to estimate host-to-host distances.

Receiver-initiated protocols such as SRM are generally not reliable because the sender has no way of knowing if the packet was correctly received by all. However, the wb application participates in the repair process as well. Since the application has a complete object list, every multicast group member has the complete state of the application and can supply any missing message.

SRM's authors have simulated the behavior of SRM for several topologies. For xed error correction parameters on random trees with every node in the multicast group, the protocol preforms well. However when the network is sparsely populated the number of duplicate repairs or repair requests is large. This can also occur when the congested link is near the sender. Simulations also show that extending the timing parameters decreases the number of duplicates at the expense of increased delay.

SRM's authors suggest that by dynamically adjusting the delay intervals better performance can be obtained. This is accomplished by decreasing the delay after each unduplicated repair request. When duplicates occur the delay is increased depending on the number and distance of duplicate requests. With the dynamic algorithm, they were able to decrease the number of duplicates compared with the xed parameter algorithm.

Studies of packet loss patterns [YaKT] suggest that multicast packet losses will occur outside of the router \backbone". This will tend to localize packet losses. To limit the impact of local loss on the entire group SRM's authors suggest that loss neighborhoods might be identied and repair requests limited in scope. How local loss is determined is not stated. The authors suggest administrative scoping, separate multicast groups, or TTL-based (time-to-live) scoping as possibilities.

2.3 Multicast Transport Protocol (MTP & MTP-2)

MTP-2 is a revised version of MTP which was used in Xy , an X-windows application sharing tool. MTP is a NAK-based protocol and assumes that arrival of messages is the normal case. Transfer is based on dividing time into heartbeats and message senders are only obliged to hold a packet for a limited time after transmission. NAKs are unicast back to the sender and repairs are multicast. Therefore, MTP is not fully reliable as a member of a group can possibly loose a message, or it can fail without other group members being aware of the problem.

Each multicast group has a master who is responsible for message ordering and synchronization. This is accomplished by granting tokens to senders. All messages using MTP are divided into one or more packets. One packet is transmitted every heartbeat, and messages must be of a minimum number of heartbeats. Shorter messages are padded with idle packets. Also if there are no senders, the master must transmit idle packets.

Ordering is enforced by message number which must be requested from the master as a token. The master maintains a record or the status of messages. A message may be accepted, rejected, or pending. All members must conform to the master's view.

This provides atomicity or a common view of communication status. The master also

(7)

7 controls membership. Membership is by admission and is only allowed when there are no messages pending.

MTP-2 was designed to correct several deciencies in MTP: missing address man- agement, the master as a single point of failure, the loading of the master node, poor congestion control, no high priority trac, no subgroups, no unicasting, and minimum message size. MTP-2 attempts to remedy these problems. Most signicantly it adds procedures to choose a new master in case of failure, to move the master node, and splits the communication into subchannels to allow for subgroups.

2.4 Log-Based Reliable Multicast (LBRM)

LBRM [HoSC] is designed for simulations where a large number of objects are present and each is represented by a process. Examples are battle simulations or multiuser games. LBRM has three important features: heartbeat packets are sent in variable time intervals, logging is performed in a distributed manner, and statistical acknowl- edgements are used to select a retransmission strategy (multicast or unicast). LBRM uses a logging server to log transmitted packets for the time during which a retrans- mission might be requested.

2.4.1 Variable Heartbeat

Each sender has an application dependent inter-heartbeat interval. It is reset for each gap in a data transmission to minimum value h min . For each subsequent heartbeat packet the inter-heartbeat interval is doubled up to some given maximum value h max . So, an isolated packet loss is detected in the h min because it must be followed by another packet or an heartbeat within h min . Longer burst errors are discovered in worst-case time of min ( t burst ;h max ) because they are either followed by a packet after t burst or the heartbeat interval passes without a message.

2.4.2 Distributed Logging

Secondary logging servers are designated in sites. A site is dened by a multicast application and is supposed to be a topologically localized part of the network. It may consist of one or more hosts. Secondary servers request lost packets from a primary logging server. Other recipients ask for recovery at their secondary server. The source receives acknowledgements only from a primary logging server. Criteria for designation of a secondary logging server might be the same as in case of le server: fast network connection, large memory, and disk. Alternative approach is in rotating the role of a logging server [ChMa]. Each host uses a series of scoped multicast discovery queries to locate a nearby logging service. If no convenient server is found then the requesting host becomes a server or some other machine in the neighborhood is asked to become a server. Another possibility is a static conguration of a logging server. Secondary servers either multicast or unicast lost packets. A local scope of retransmission may be achieved by setting a TTL eld in retransmission. NAKs that are sent to a secondary server inside of a site do not cause a packet implosion because a site is assumed to be reasonably small.

For better reliability, duplication of the primary log server is suggested. Packets are

transmitted from the primary log server to replication servers. The replication servers

send ACKs to the primary log server. The primary log server then ACKs packets to

the sender. These ACKs contain a primary logger sequence number and a replicated

logger sequence number. The data is kept by the sender until it is reliably transferred

(8)

8 to the replication servers. Then in case of a primary server failure, all data is available at the replication servers.

2.4.3 Statistical Acknowledgements

When ACKs are sent by secondary servers to the primary sender the primary server selects a small fraction (random set) of secondary servers, and they are required to acknowledge packets. If all selected servers don't conrm receipt of a packet, the packet is retransmitted via multicast. Otherwise, it is unicast to other secondary servers that send NAKs. The time interval of a multicast is divided to epochs. For each epoch a new set of responding servers is chosen. The sender waits t wait after each packet so that servers have enough time to send their ACK. In order to be able to decide about the number of acknowledging secondary servers, the source must know the number of servers in the group so that the selected servers constitute the desired fraction of the whole group.

2.5 Xpress Tranport Protocol (XTP)

The XTP protocol is designed as a general protocol to replace TCP [StDW]. As part of its basic functionality XTP provides multicast layered on top of an unreliable datagram service (usually IP-multicast). It is designed to allow the application to select its transport policies independent of each other. For example, the choice of reliable or unreliable service is independent of the choice of multicast or unicast. XTP provides both unreliable (best eort) and reliable multicast. XTP reliable multicast uses an ACK-based model. The ACKs are on request by the sender for a group of packets and the receiver responds by reporting what packets have been received in a single control message.

NAKing can be either the XTP fast-NAK or not occur. Fast-NAKing means that a receiver NAKs after receiving an out-of-order packet instead of waiting until it is polled for an ACK. Because it is sender-initiated XTP allows the application to select who is admitted to a group as a fully-reliable receiver, and therefore, XTP can support mixed groups of reliable and unreliable receivers. XTP version 4.0 [ACFS] has all control trac unicast to the sender. This can overload the sender in large groups and limit throughput.

In order to remedy this problem, Hofmann [Ho97] proposes adding Local Group Controllers (LGCs). The LGC collects the acknowledgments for some number of ma- chines in its local neighborhood and passes this on to the sender. The sender is only aware of LGCs being in the reliable multicast group. LGCs repair the lost packets within the local group by retransmitting packets that they have correctly received.

This is done by the following basic algorithm:

If the LGC has the packet, it unicasts it or multicasts it to the entire group via a special local group multicast address. The decision is based on recovery mode and number of receivers who have not received the packet correctly.

If the LGC doesn't have the packet but it is available from some other member of the local group, then the LGC requests that the packet be multicast by the member who has it.

If packet isn't available from the local group a NAK will eventually be sent to

the LGCs parent,who is either another LGC or the sender. If the parent is

another LGC, then the process is repeated until the packet is supplied or the

NAK eventually reaches the sender who retransmits.

(9)

9 In fact, there are two error repair algorithms in [Ho97], load sensitive and delay sensitive. In load-sensitive mode, the LGC delays after receiving a repair request. At the end of the delay it counts the number of retransmission requests and makes a unicast/multicast decision. If the LGC itself detects a packet error, it delays. If an ACK from a group member is received during the delay, the packet is requested locally.

Otherwise, it is NAKed. Note, repair retransmissions by group members other than the LGC are always multicast.

In delay sensitive mode, the LGC immediately retransmits via multicast a NAKed packet. If the LGC is missing the packet or notices that it has not received a packet, it immediately multicasts a repair request to the local group. Each group member that has the packet waits a time proportional to its distance from the LGC and then multicasts the packet if it has not already been multicast. NAKing to the LGC's parent only occurs after a suitable delay.

Management of local groups is dened by the Dynamic Conguration Service proto- col (DCS). DCS species that all LGCs periodically advertise their presence by sending a message to a group specic multicast address. The message includes: a smoothed error probability estimate, size of the local group, the local group multicast address, and the original time-to-live (TTL) of the message. This TTL value is dynamically varied on dierent advertisements. Optional metrics such as round trip time, cost, bandwidth, throughput, error probability, and security restrictions are also possible.

The exact selection algorithm is not specied but left to the application. Some kind of weighted-distance measure based on the advertisements is suggested. In addition, some suitability measure is suggested. If no suitable LGC is found, a new receiver forms a new local group and appoints itself LGC.

Local groups are dynamically recongurable. This is done by each receiver peri- odically evaluating all advertisements and reselecting its LGC if a reasonably better LGC is found. Receivers which determine themselves to be reasonably better than any current LGC can appoint themselves LGC for a new local group. The metric for these decisions is not specied, but Hofmann recommends the algorithm used by TMTP [YaGS]. The dynamic reconguration ability provides fault tolerance. Group members notice that their current LGC is not advertizing and therefore has innite metrics. So, they will switch to a new LGC.

2.6 Reliable Multicast Transport Protocol (RMTP)

RMTP [LiPa] organizes the receivers into a tree in a manner similar to XTP with Local Group Controllers (LGCs). In RMTP the LGCs are called designated receivers. At the time a multicast session is created, the designated receivers are manually created and organized into a hierarchy.

The designated receivers and the sender then periodically broadcast a special \of- fering" packet with TTL set to some constant value. Receivers then attach themselves to the designated receiver with the largest TTL value. This ensures that they choose the designated receiver closest and least upstream from them. Whenever a receiver gets a packet with a better oer (higher TTL value), it changes its designated receiver.

The TTL value is saved for a specied time that is reset with each new oering packet from the designated receiver. This way if a designated receiver fails, it's children will in a short while reset the oer value to zero and begin the search for a new designated receiver.

RMTP is an ACK-based protocol with receivers sending periodic ACKs for a block

of messages. Missing messages are indicated in the ACK. ACKs and repairs are pro-

cessed by the designated receivers. The designated receiver may either unicast or

multicast repairs based on how many of its children have not received a message.

(10)

3 THEORETICALPERFORMANCELIMITS

10 What makes RMTP unique is how multicast repairs are handled. RMTP has its own routing processes which are a modication of the standard IP multicast routing process. The modied routers implement a special IP message type called subtree multicast. When a designated receiver multicasts a repair packet it sends a subtree multicast packet to the nearest routing process. This process then multicasts the packet downstream. This limits the repair trac to the designated receiver's children.

Finally, RMTP allows the receivers to join the session any time and still get the entire data transfer. In order to support this the data is cached during the entire session by the designated receivers. Part of this cache is in memory and part on disk.

2.7 Tree-based Multicast Transport Protocol (TMTP)

TMTP [YaGS] uses an expanding-ring search to organize receivers into a hierarchi- cal, dynamic control-tree. (An expanding-ring search sends messages with increasing Time-to-Live (TTL) until an answer is received.) The tree is used for disseminating restricted negative acknowledgments with NAK suppression and periodic positive ac- knowledgements for ow and error control. Control is distributed (i.e., not centralized in the sender), may be local, and parallel at dierent branches of the tree.

A set of receivers and a single sender multicast is called dissemination group. All the group members in the same subnet belong to one domain. The group is organized to a hierarchy of domains represented by a domain manager. The rst node in a subnet becomes the domain manager for that subnet. Domain managers also accept at most k other domain managers as children. This creates a control tree of degree k + 1. So, protocol overhead grows proportionally to log _k (# receivers ). Each interior node in the tree handles only the errors reported by its child domain managers who send ACKs to it. The sender receives ACKs only from its k domain manager children and its subnet children. Each domain manager sends ACKs to its parent immediately after a packet reception and doesn't wait for its children's ACKs. In the case of a packet omission, a NAK with limited Time-to-Live (TTL) is multicast. The TTL localizes the error recovery so that only sites that are likely to have the same domain manager receive the NAK. (This technique is combined with a NAK-suppression algorithm.) A new manager joining a tree uses an expanding-ring search while sending search-for-parent messages. These messages contain a TTL which is increased each time that no potential parent is found with less than k domain-manager children.

In the event of a domain manager failing or leaving the group its orphans must look for a new parent. Because the domain manager ACKs before receiving an ACK from all its children, it is possible to loose data when a domain manager fails.

3 Theoretical Performance Limits

When one asks the question \Which protocol is best?" one nds no clear answer.

While the authors of the various protocols have simulated and tested their protocols they have not done so under the same assumptions and conditions. However, there has been some work comparing protocols. Towsley et al. [ToKP] analyzed three theoretical protocols and calculated the theoretical load on both the sender and the receiver as a function of the number of receivers. For the analysis they assumed that:

The probability of packet loss is the same for all receivers.

All packet losses were independent.

ACKs and NAKs were never lost.

(11)

11 The rst protocol is a sender-initiated, ACK-based protocol. This protocol repeats only lost packets, multicasts both original and repair packets, requires receivers to ACK each packet, and retransmits packets after a timer expires without all ACKs being received. The analysis of this protocol shows that the sender is the bottleneck and that the load on the sender is O ( R

lg ( R )) where R is the number of receivers.

The primary reason is the need for the sender to process the ACKs. It should be noted that even if the errors don't occur, this protocol loads the sender at O ( R ).

The second protocol is a receiver-initiated, NAK-based protocol. This protocol repeats only lost packets, multicasts both original and repair packets, requires receivers to unicast NAKs when they notice a missing packet, and uses timers to detect when retransmissions are lost. Again the sender is the most loaded node. Its load grows equal to O ( R ). As the chance of an error approaches zero, the load function becomes O (1).

The third protocol is also a receiver-initiated protocol. It is similar to the second except that NAKs are multicast and NAK-suppression is employed. For this protocol both the receiver and the sender are loaded approximately equally. Both have load functions of O ( lg ( R )) that become O (1) as the error probability approaches zero.

Levine and Garcia-Luna-Aceves [LeGA] extended this analysis for three more pro- tocols:

a ring-based protocol with unicast retransmissions (similar to RMP),

a tree-based protocol using ACKs and local group controllers to process ACKs for part of the receiver group (similar to RMTP),

a tree-based protocol that is NAK-based with NAK suppression.

The analysis found that all three protocols loaded systems O (1) with error probabil- ity of zero. With non-zero error probability the results were dierent. The ring-based protocol had a load function on all nodes of O ( R ). The ACK-tree protocol loaded group controllers with a function O ( B

lg ( B )) where B is the size of the local group.

This is O (1) in the number of receivers. The NAK-suppression, tree-based protocol was best with a load function of O ( lg ( B )).

These studies do not consider protocols like SRM. SRM is in many respects similar to the theoretical receiver-initiated protocol with NAK suppression ( RINS ). However, in SRM all members of the multicast group are required to process NAKs and multicast repair packets. We will call this property global repair and analyze the eects of this property on the sender and receiver processor load.

Let E [ msg ] be the expected processing load to transmit a message, E [ M ] be the expected number of transmissions required to successfully transfer a message, E [ RT ] = E [ M ]

^,

1 be the expected number of retransmissions, and E [ NAK ] be the expected processing load for a NAK. The load on the sender is the load to transmit messages, retransmit messages, and process NAKs. From [ToKP] we have that the expected load for of transmitting and retransmitting in RINS is E [ msg ] E [ M ] or dividing it into its components E [ msg ](1 + E [ RT ]). Using the assumption from Towsley's analysis that NAK suppression reduces the number of NAKs to one before each retransmission, we have E [ NAK ] E [ RT ] as the expected load for NAK processing.

Now, RINS with global repair requires both the sender and the receiver to buer the message and retransmit it. Under the assumptions of independence of errors the expected number of receivers who correctly receive the message is R (1

^,

p ) where p is the probability of an error. Also under the independent errors assumption, all holders of the message are equally likely to retransmit. So,

Load sender = E [ msg ] + E [ msg ] E [ RT ] = ( R (1

^,

p ) + 1) + E [ NAK ] E [ RT ]

(12)

12 From [ToKP] we have that E [ RT ] is O ( lg ( R )). So, Load sender is O ( lg ( R ) =R + lg ( R )) or O ( lg ( R )). Similarly, the load on the receiver is the load for RINS plus the load required to process the NAKs and retransmissions. So,

Load receiver = E [ RINS receiver ]+ E [ msg ]( E [ RT ] = ( R (1

^,

p )+1))+ E [ NAK ] E [ RT ] From [ToKP] we have that E [ RINS receiver ] is O ( lg ( R )), and we have that the rest of the equation is O ( lg ( R )) from the work above. Therefore, the load on the sender and receiver is O ( lg ( R )). As the error probability approaches zero, the loads approach O (1).

The nal theoretical protocols are also based on the receiver-initiated protocol with NAK suppression. However, instead of retransmitting lost packets Forward Error Cor- rection ( FEC ) is employed. FEC groups packets into transmission groups and com- putes a number of parity packets for each transmission group. Each parity packet can be used to replace any one lost data packet. So, a transmission group with seven data packets and one parity packet would not require any retransmissions as long as no more than one packet was lost at any receiver. Nonnenmacher et al. [NoBT] analyze and simulate two dierent protocols based on a Reed Solomon Erasure FEC code. The rst protocol assumes an FEC layer between the network layer and the reliable multicast protocol. With this protocol k data packets and h parity packets form a transmission group. The entire group is transmitted. As long as the number of errors is less than h no retransmissions are required. However, Nonnenmacher simulations showed that for group sizes small enough to avoid latencies the layered FEC protocol was worse than no FEC when burst errors ocurred.

The second protocol assumes that FEC is integrated within the reliable multicast protocol. Here the h parity packets are computed by the sender, but not automatically transmitted. Instead after the data packets are transmitted the sender sends a control packet to the receivers. Receivers who have not received all packets in the transmission group then wait a random time which is shorter for those who have lost more packets.

The rst receiver to time out is guaranteed to be one with the most packets lost. This receiver transmits a NAK with the number of packets lost and its NAK suppresses all others. The sender then transmits parity packets corresponding to the number of lost packets. The NAK sequence is repeated to cover the possibility that one of the parity packets is lost. Nonnenmacher et. al. simulated this protocol under the assumption that no transmission group experienced more errors than it had parity packets. Under these conditions the FEC-based protocol preformed better than the receiver-initiated protocol with NAK suppression.

To determine the load function for the integrated FEC protocol, consider k > 1 and the optimistic assumption that no receiver ever misses more than one data packet. In this case, transmitting the kth packet of the transmission group and then processing NAKs is just like transmitting one packet using the RINS protocol. So, the amount of work for the sender to process the NAKs is amortized over k messages and is O ( lg ( R ) =k ) as a lower bound. Similarly if one sets k = 1 and retransmits the data packet as the parity packet, one has a RINS-like protocol as an upper bound. Remember that RINS's load is O ( lg ( R )). For the FEC protocol, transmission and receiver loads are similar to those in RINS within a constant factor. So, its load is O ( lg ( R )). As errors drop to 0, no retransmissions are necessary and the protocol load is O (1) in the number of receivers.

If we look at the protocols surveyed earlier in this paper and classify them into

theoretical protocols, we can get an approximate idea of how they will scale. (See

(13)

13 Protocol Load Load

Type

^P

(

^Error

)

^>

0

^P

(

^Error

) = 0

Sender-initiated

^O

(

^R^lg

(

^R

))

^O

(

^R

)

Receiver-initiated

^O

(

^R

)

^O

(1)

Receiver-initiated w/ NAK suppression

^O

(

^lg

(

^R

))

^O

(1) Receiver-initiated w/ NAK suppression

^O

(

^lg

(

^R

))

^O

(1) and global repair

Receiver-initiated w/ NAK suppression and FEC

^O

(

^lg

(

^R

))

^O

(1)

Ring-based

^O

(

^R

)

^O

(1)

Tree-based

^O

(

^B^lg

(

^B

))

^O

(1)

Tree-based w/ NAK suppression

^O

(

^lg

(

^B

))

^O

(1)

Table 1: Protocol load functions for the most heavily loaded member.

^R

is the number of receivers and

^B

is the size of the local group.

Table 1.) MTP and XTP without local groups are basically sender-initiated protocols.

RMP is a ring-based protocol. SRM is receiver-initiated with NAK suppression and global repair. LBRM is sender-initiated among the sender, the primary logging server, and any replacement logging servers. It is also sender-initiated between the primary logging server and the secondary logging servers. (Even though only a fraction of the secondary logging servers are polled.) However, it is receiver-initiated between the secondary logging servers and the local site. So, one could expect it to perform similarly to a sender-initiated protocol with the load on the primary logging server growing as a function of the number of secondary logging servers O ( S

lg ( S )). The load on the secondary logging servers would be O ( B ). Finally, RMTP, TMTP, and XTP with local groups are tree-based protocols.

The above analysis makes the assumption that packet loss at receivers are inde- pendent. However, this is not necessarily the case. As one can see from the routing methods described in the next section, a packet can be lost on a shared link. This can cause more than one downstream receiver to lose the same packet. Bhagwat et al.

[BhMT] studied the problem of how network topology eects reliable multicast. The study was done in the context of sender-initiated reliable protocol with block ACKs.

Retransmitted packets were multicast.

The throughput seen by a connection is parameterized by block size, receiver buer size, topology of the multicast tree, and the loss probabilities at the routers and hosts.

First, Bhagwat, et. al. compute a number of transmission attempts before the packet is delivered to the entire group. The total time consists of the average time taken to transmit all packets and waiting time for acknowledgements. Two dierent types of multicast trees were studied. The rst tree is built so that receivers joining a group are added to the tree by shortest path computation without rebuilding the whole tree.

The other tree is built so that minimal bandwidth is used.

For both trees, increasing overlap between paths increases the probability of a cumu- lative loss. The dependence of the optimal transfer time on topology is parameterized by individual loss probabilities at the routers. It means that the tree should be con- structed according to the reliability (cache memory capacity) of the router. Thus, there should be a tradeo between the number of links through a node and its buer size.

Analysis shows that the optimal (fastest) tree topology needn't be exactly the one us-

ing the minimal bandwidth or the shortest path joining mechanism. It also shows that

out-of-order delivery of packets increases the throughput. The authors propose using

(14)

4 CURRENTIP MULTICASTROUTINGPROTOCOLS

14 a core-based tree approach where so called core nodes are responsible for the reliable transmission in the subtree as a more ecient way of performing multicast routing.

4 Current IP Multicast Routing Protocols

Reliable protocols described in this paper so far are based on best-eort multicast delivery protocols. Basic existing architecture and multicast routing protocols were invented by Deering (1990) [DeCh] in his pioneering paper. His routing trees are source rooted and are not shared by all senders. Source-based (sender-specic) trees are built either by a distance-vector protocol (DVMRP) or a link-state protocol (MO- SPF). MOSPF is based on the underlying unicast routing tables. Deering's ideas were used for implementation of a multicast subnetwork called the MBone. The MBone is the multicast-capable part (a collection of connected multicast-capable routers) of the Internet. It spans routers with multicast capability either directly or virtually. The MBone consists mostly of a tree structure but contains several meshes.

The MBone uses DVMRP or MOSPF protocols for routing. MOSPF is used as an intra-domain protocol. A router monitors up-to-date group membership on its attached links and sends multicast trac over these links. Each subnetwork may run any multicast routing protocol. The inter-domain routers on the MBone use Distance Vector Multicast Routing Protocol (DVMRP). Inside of dierent address domains any protocol may be used, but border routers must use the standard MBone protocol, DVMRP.

4.1 Distance Vector Multicast Routing Protocol (DVMRP)

This algorithm is a variant of the reverse-path-forwarding technique ^{. A packet}

arriving at the port used for routing the trac to the sender of the packet is broad- cast over all other connected links (except a leaf subnet). Otherwise, the packet is discarded. In order to stop the trac to those parts of network where no receivers reside, prune messages are sent from leaf routers towards the source. Prune messages are sent as a reply to the rst multicast packet received ( data-driven activity). If a router receives prune messages from all its children (downstream routers), then it sends a prune message upstream. This technique is called broadcast and prune . A simpler and older version of this technique truncating only leaves is also known as truncated broadcasting . The prune information must be refreshed at specic intervals, other- wise packets start to be sent over the branches previously truncated. This state that must be refreshed is called soft state . The biggest disadvantage of this protocol is the initial broadcast of rst packets throughout the entire network before uninterested receivers manage to prune multicast trac. It also requires periodic prune messages to be sent in order to prevent re ooding. Other protocols avoid the broadcast of the rst packets as this imposes an unnecessary load on the network.

4.2 Link-State Multicast Algorithm (MOSPF)

The algorithm extends a standard protocol for collecting of routing information, Open Shortest Path First (OSPF). Every router periodically broadcasts ( oods) a list of its directly attached neighbors and a list of group-membership changes on incident links.

Based on this global knowledge of the topology, every router computes a shortest path

tree (SPT) rooted at the senders using Dijkstra's algorithm [Dijk]. The tree is built

inside of each domain. A computation of the SPT is started after reception of the

(15)

5 PROPOSED NEW IPMULTICAST ROUTINGPROTOCOLS

15 rst multicast packet from a new sender (i.e., data-driven computation). This protocol extended by a multicast ability is called MOSPF.

Like DVMRP, MOSPF is not very scalable. The shortcoming of MOSPF is in the rather complex computation of shortest-path trees and in the broadcast of a list of neighbors over the entire network which is unreasonably expensive in WANs.

4.3 Other Related Protocols

Internet Group Management Protocol (IGMP) is used in the scope of one subnetwork for joining hosts to a router [Deer]. Multicast routers learn about new members of a group by accepting a host membership report packet. This operation scales well as receivers join the closest router possible. The IGMP protocol is the underlying mechanism that monitors the presence of group members at subnetworks so that routers propagate packets over appropriate links to the subnetworks and group members.

5 Proposed New IP Multicast Routing Protocols

Increasing use of the MBone has lead to concerns that current multicast routing proto- cols may not be ecient enough and that continued increases in multicast trac could overload the routers. The primary reason for this concern is that current protocols construct a routing tree for each sender in the group, and therefore, require each router to maintain a table entry for every sender in every group.

In order to solve this problem, three new protocols have recently been proposed:

CBT - Core-Based Trees multicast architecture [Ball], PIM - Protocol-Independent Multicast [Deta], and Multicast Internet Protocol - MIP [PaGA]. All of these proto- cols are based in part on shared trees instead of shortest-path trees that are rooted on each sender. CBT constructs only a single shared tree per group. The shortcoming of this approach is that this routing may not be fast enough especially for delay sensitive applications. PIM allows for both a shared tree and shortest-path trees that are rooted on high-speed senders. On the other hand, it sends periodic control messages to dy- namically maintain tree topology. These messages may overload links. Transient loops may occur in both CBT and PIM. Even though both protocols are presently drafts, PIM has already been implemented and is used in some multicast domains. The most recent protocol, MIP, claims to improve shortcomings of PIM and CBT. Its authors have also proven the correctness of this protocol.

These protocols dier in the level of their dependence on unicast routing tables and new protocol designs propose independent multicast routing topologies. Scalability and possibility of choosing a convenient protocol according to requirements of application and properties of a multicast group (sparse/dense mode) are other leading factors in design of multicast protocols.

5.1 Source-Based Trees Versus Shared Trees

The basic IP routing is now based on the reverse shortest-path tree method and Dijk- stra's shortest-path tree algorithm. A tree rooted at a sender is separately constructed for each sender, and the number of control messages increases with every newly built tree. Memory needed for routing tables is O ( S

G ), where S is the number of senders and G is the number of groups. Shared trees scale better as they need only O ( G ) space.

Multicasting routing should be independent of underlying unicast routing algorithm as

changes in routing tables may cause problems in multicast routing.

(16)

16 Source-based tree algorithms are also referred to as dense-mode algorithms which means they are only ecient in subnetworks where the group members are densely populated. However, this is not the case in WANs. Source-based trees are time optimal.

They are suitable for time sensitive applications such as image or voice transmission as they deliver packets over optimal paths. Construction of a source-rooted tree is data- driven. The shortest-path tree (SPT) rooted at a sender is computed at a network node after the receipt of the rst multicast data packet.

Shared trees save bandwidth and storage of link-state information. They are con- structed to originate from a central router called the core-router ^, core ^, rendezvous point ^{, or} root . Shared trees span only the receivers in a group. (Packets from non- member senders can be encapsulated and unicast to the core of the tree.) Algorithms that use shared trees scale better than those using source-rooted trees (router state maintenance is an important scaling factor), are robust if core-router failure is de- tected quickly and the core router replaced, simple, and able to interoperate with other protocols (DVMRP). They don't broadcast routing information over the entire network as MOSPF does. Broadcast in WANs is expensive so SPT-based protocols are not as ecient in WANs. Shared trees always use the same set of links for every source. This fact concentrates trac on a subset of links that may become congested. A route used for a packet delivery from a source to a destination need not be a shortest path which implies that packet delivery is relatively slow. Even if it has been shown that delay is just a constant factor, this holds only if the sender is in a network center. This as- sumption does not hold in most of cases. Core routers can either be placed manually or by a bootstrap mechanism. As shared trees scale better than source-based ones, they are more suitable for a sparse-mode multicast , where receivers sparsely populate a wide area.

A weak point for robustness in shared-tree protocols is the core of the tree. The failure of a core node has impact on the whole routing structure. In order to make the protocol robust, a failed core must be replaced as rapidly as possible.

5.2 Core Based Trees Multicast Routing Architecture (CBT)

The CBT protocol uses a shared tree which spanns all members of a specic group regardless of the location of the sender. (I.e., there is only one tree for each group.) This gives the protocol better scaling characteristics and makes it suitable for sparsely distributed group members in a WAN. On the other hand, the shared tree concentrates trac over a small subset of links, and the routing is not always optimal. So, CBT can increase delivery delays and congestion for a group. CBT builds shared trees before any multicasting starts. The protocol operations for CBT are summarized below.

Tree Construction The following summarizes a tree construction for CBT:

A receiver wanting to join a multicast group must multicast an IGMP host mem- bership report rst. This causes a transient inheritance of a multicast item for the

\interested router" to a routing table at the closest CBT router. By an interested router we mean a router through which a receiver asks for the connection to a multicast group.

A CBT router receiving the report, sends a JOIN REQUEST message towards the core. (See Figure 2.)

The JOIN REQUEST message sets up a \transient join state" in the routers

that it traverses before it is acknowledged. The join state records the group iden-

tier, the incoming interface, and the outgoing interface. The incoming interface

(17)

17 is the previous hop (IP source address) and the outgoing interface is the next hop (found in the routing table) towards the core.

The JOIN REQUEST message is acknowledged either by the core router itself or by another router on the path to the core, that has already been attached to the multicast tree.

The transient state expires unless it is acknowledged by a JOIN ACK message from upstream. After receiving this message the router is connected to the mul- ticast tree.

Loop Prevention There are three main loop prevention provisions.

At any instant there is only one active core per group which implies that a receiver can join only one core.

In the case that an upstream router is not reachable, the entire subtree is ushed and receivers must connect again. This prevents a router from being connected to two cores at the same time. This is also part of the solution to the problem of replacing a failed core by a new one.

A transient loop due to changes in unicast routing tables causes a message to loop back to the sender. Such a message is never acknowledged, so a cycle is never established as a routing path.

Routing

The state that is set by a JOIN ACK message in routing tables is used for routing multicast trac in both directions. The group address is used as an index into a routing table. An incoming packet is forwarded over all other interfaces on a list at the table entry except of an incoming interface.

Upstream and downstream orientation is kept by distinguishing between upstream interface (father) and downstream interfaces (children) both in CBT and inter- mediate routers. The sense of direction is used only when forwarding control messages while data packets are sent in both directions.

Tree Maintenance Shared trees are dynamic structures. Receivers that are still interested in group membership must send periodic messages that prevent routers from deleting the receivers' entries from the routing tables.

CBT multicast tree maintenance is done by keep-alive messages called ECHO REQUEST s. They are sent by downstream routers periodically (gran- ularity of minutes) to their upstream neighbors. One message may represent all children. A keep-alive message is multicast over multicast links to all CBT routers with TTL = 1. The eect of this is that the message suppresses other messages from other children having the same father. If multicast is not supported at a router, the message is unicast.

ECHO REQUEST messages are acknowledged by ECHO REPLY messages.

These messages notify child routers about the state of their parent (e.g. of the time when the next ECHO REPLY message will be sent, to which group multicast tree the parent is connected, etc.)

Due to the fact that dierent routing domains work to some extent independently,

\a multiple edge" in a routing tree may appear. This is the case when one link is shared

by two or more unicast routing domains and each of them chooses a dierent upstream

(18)

18 Join Request

Receiver Core

CBT Router Sender

Sender Unicast

Echo_Request IGMP Echo_Reply

nontree,(group,iif,oif)

Figure 2: CBT - Shared multicast tree

router. Then multiple upstream routers would be inherited into the tree. To avoid this, all routers on the link have to agree on a single upstream router for all groups, a so called designated router (DR).

Multicasting Messages by Non-Member Senders

A message sent by a non-member sender that is on the tree is forwarded over all outgoing interfaces. This is enabled by the bi-directional nature of tree edges.

A message sent by a non-member sender that is not attached to the tree is unicast to the core-router rst and then multicast over the shared tree.

5.3 Protocol Independent Multicast (PIM)

PIM protocols exist in two versions: PIM-DM, dense mode, and PIM-SM, sparse mode.

PIM-DM uses a source-rooted tree in a way to similar to DVMRP. However, multicast for sparsely-distributed receiver-groups shows a greater increase in overhead compared to amount of data transferred when traditional routing protocols with source-based trees are used. PIM-SM is especially suited for this case.

PIM-SM is similar to CBT but diers in its ability to switch between two modes.

The rst mode is similar to CBT and uses a shared tree. The second mode uses

a source-based tree. PIM does not have the disadvantages of DVMRP. It does not

broadcast rst packets which are broadcast over the entire domain at the beginning of

the session until they are pruned. It also tries to optimize the routing-tree structure

for each group. PIM is a receiver-initiated protocol that uses a shared multicast tree

centered at a Rendezvous Point (PIM's term for the core). Additionally, source-specic

(19)

19 trees are built in special some cases. PIM is independent of any unicast routing protocol and uses a soft-state mechanism to adapt to network and group changes.

The authors claim that PIM protocol is robust, exible, and scales well. Robustness is achieved by establishing a small set of Rendezvous-Point (RP) candidates and using a soft state refresh mechanism. One of the RP candidates is elected as a replacement after active RP failure. The protocol exibly switches between shared-tree and shortest- path-tree routing. The scalability of the protocol is evaluated in terms of its overhead (bandwidth, processing, state storage) growth with the size of the network, number of receivers in one group, and distribution of a group's receivers and senders. The protocol has been implemented and is used by some routers as an intra-domain protocol in the same way as MOSPF is sometimes used.

Joining and pruning are explicit. This is dierent from classical DVMRP which is data-driven and built after the rst packet is received. In contrast, PIM routing trees are build ahead, before any multicasting starts. PIM uses a Rendezvous Point for senders to announce their presence and for receivers to learn about senders.

Details of the protocol operations follow.

Shortest Path Tree (SPT) Construction A SPT is established for delay sensitive (high rate) applications and in the case where there are many simultaneous multicast sessions from dierent sources. In the rst case, the reason for a SPT is faster delivery of packets over the SPT than over the Rendezvous-Point Tree ^{(RPT). In}

the second case, a SPT is used in order to avoid overloading the RPT links.

The Multicast Routing Table (MRT) contains entries, ( S;G ), for source/group pairs. The pairs provide an index to the incoming and outgoing interfaces.

Receivers check the rate of data packets and may decide to build a SPT for receiving from this sender. In other cases, the decision is made by the rendezvous point.

The last hop (the closest) router of a receiver joins source-rooted tree by sending a Join/Prune message to the source.

As the Join/Prune message travels through intermediate routers, PIM-prune mes- sages are sent to the RP in order to disconnect the RPT from the SPT.

Rendezvous Point Tree (RPT) Construction

The MRT entry is (

;G ), where the content is the same as in case of SPT. The appropriate RP is substituted for the wild-card star.

If a receiver wants to join a group, its last hop PIM router sends a Join/Prune message to the RP for the group. As the message travels through intermediate routers, the route from the RP to the receiver is established.

Sending Messages

A source wanting to multicast to a group sends the data to a Designated Router

(DR) in its domain. The designated router sends a PIM-Register message encap-

sulating the data to the appropriate RP. (See Figure 3.) The RP checks the rate

of the source's data and decides which kind of tree is more appropriate. A shared

tree is more ecient also in case of a high number of senders. In the case when

a shared tree is more convenient, the RP sends Join message to the source which

establishes a route for packet delivery.

(20)

20 Join Request Receiver

RP

PIM-register

Sender Join/Prune

DR

Figure 3: PIM - Rendezvous-point tree

In the case when SPT is needed every receiver learns about it eventually (e.g.

by measuring the rate of data packets received via the RP) and sends a prune message to disconnect from the shared tree. To establish a SPT, join messages are sent to nd the shortest path to the sender. Once RP is disconnected, RP may start to send data directly over the shortest-path tree if it has already joined the tree.

Rendezvous Point Discovery

A bootstrap router keeps a list of possible RPs for all groups.

A set of active RPs is distributed to all PIM-servers.

Each router uses the same hash function to map a group to a RP.

Summary Data packets from a source travel to a RP and then to receivers over the RPT. If the RP decides (based on a packets' rate) that SPT is more appropriate, receivers start to build a SPT, and the trac is routed via the partially built SPT.

Once the SPT is completely built the source can stop sending data through RP.

5.4 Multicast Internet Protocol (MIP)

MIP constructs group shared and shortest path multicast trees similarly to PIM, but it

uses dierent algorithms to build routing trees. Shared trees are constructed on either

sender-initiated or receiver-initiated basis. MIP multicast routing is independent of

unicast routing. The protocol does not use soft states. A diusing computation is used

to disseminate multicast routing information. As the communication via IP protocol

is not reliable, the delivery of control messages is acknowledged during the diusing

computation. The next diusing computation is run only on demand after topology

change.