Internet Engineering Task Force A. Zimmermann Internet-Draft A. Hannemann Intended status: Experimental RWTH Aachen University Expires: August 1, 2009 January 28, 2009 Make TCP more Robust to Long Connectivity Disruptions draft-zimmermann-tcp-lcd-00 Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on August 1, 2009. Copyright Notice Copyright (c) 2009 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Abstract TCP was designed with fixed, wired networks in mind. As a result TCP performs suboptimal in networks where connectivity disruptions are Zimmermann & Hannemann Expires August 1, 2009 [Page 1] Internet-Draft Make TCP more Robust to LCDs January 2009 frequent, e.g., in wireless (multi-hop) networks. One reason for the performance degradation is TCP's over-conservative behavior in face of long connectivity disruptions. This document describes how connectivity disruption indications provided by standard ICMP messages may be exploited to improve TCP's performance. An RTO revert strategy is proposed that enables earlier detection of whether connectivity to a previously disconnected peer node has been restored or not. The scheme is a sender only modification which fully respects the TCP congestion control principles. 1. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. The term "acceptable acknowledgment (ACK)" in this document refers to a TCP segment that acknowledges previously unacknowledged data (as defined in [RFC0793]). The Transmission Control Protocol (TCP) sender state variable "SND.UNA" and the current segment variable "SEG.SEQ" are used as defined in [RFC0793]. SND.UNA holds the segment sequence number of the oldest outstanding segment. SEG.SEQ is the segment sequence number of a given segment. 2. Introduction Connectivity disruptions can occur in many different situation. The frequency of the connectivity disruptions depend thereby on the property of the end-to-end path between the communicating hosts. While connectivity disruptions can occur in traditional wired networks too, e.g., simply due to an unplugged network cable, the likelihood of occurrence is significant higher in wireless (multi- hop) networks. Especially, end-host mobility and wireless interferences are crucial factors. In the case the hosts use the Transmission Control Protocol (TCP) [RFC0793] for their communication, the performance of the connection can exhibit a significant reduction compared to a permanently connected path [SESB05]. According to Schuetz et. al. [I-D.schuetz-tcpm-tcp-rlci] connectivity disruptions can be classified into two groups: "short" and "long" connectivity disruptions. A connectivity disruption is short if connectivity returns before the retransmission timeout (RTO) fires for the first time. In this case, TCP recovers lost data Zimmermann & Hannemann Expires August 1, 2009 [Page 2] Internet-Draft Make TCP more Robust to LCDs January 2009 segments through Fast Retransmit and lost ACKs through successfully delivered later ACKs. Connectivity disruptions are declared as long for a given TCP connection, if the RTO fires at least once before connectivity returns. Whether or not path characteristics have changed when the connectivity returns after a disruption is second important aspect for TCP's retransmission scheme [I-D.schuetz-tcpm-tcp-rlci]. This memo will focus on TCP's behavior in face of long connectivity disruptions in the time "before" connectivity is restored. Moreover, this document does not describe any additional optimization to detect if the path characteristics remain unchanged. Therefore, TCP's RTO based Loss Recovery and in particular Slow-Start [RFC2581] will be unchanged. When a long connectivity disruption occurs on path between two communicating hosts, the TCP sender stops receiving ACKs. After expiration of the RTO the TCP sender will repeatedly retransmit the first unacknowledged segment (SND.UNA) until it is successfully acknowledged. TCP implementations that follow the recommended RTO management proposed in RFC 2988 [RFC2988] double the RTO value after each retransmission attempt. However, the RTO growth may be bounded by an upper limit maximum RTO, which is at least 60s, but may be longer: Linux for example uses 120s. If the connectivity is restored between two retransmission attempts, a TCP still have to wait until the RTO expires before resuming transmission, since TCP simply does not have any means to know that the connectivity is re-established. Therefore, depending on when connectivity becomes available again, this can waste up to maximum RTO of possible transmission time. This retransmission behavior is not efficient, especially in scenarios or networks like wireless (multi-hop) networks where connectivity disruptions are frequent. In the ideal case, TCP would attempt a retransmission as soon as connectivity to its peer was re- established. In this document a method how the standard Internet Control Message Protocol (ICMP) can be exploited to improve TCP's performance is described. The presented scheme is a sender only modification, i.e., neither intermediate routers nor the TCP receiver have to be modified. Furthermore, the proposed modification approaches the ideal behavior, if the network allows for it (i.e., no congestion is present). By an RTO revert strategy, higher-frequency retransmissions can be realized to enable earlier detection of whether connectivity to a previously disconnected peer node has been restored. Zimmermann & Hannemann Expires August 1, 2009 [Page 3] Internet-Draft Make TCP more Robust to LCDs January 2009 3. Connectivity Disruption Indication As long as the queue of a router experiencing a link outage is deep enough, i.e., it can buffer all incoming packets, a connectivity disruption will only cause variation in delay which is handled well by a contemporary TCP with the help of Eifel [RFC3522] or forward RTO (F-RTO) [RFC4138]. However, if the link outage lasts too long, the router experiencing the link outage is forced to drop packets and finally to discard the according route. Means to detect such link outages comprise reacting on failed address resolution protocol (ARP) queries, unsuccessful link sensing, and the like. However, this is solely in the responsibility of the respective router. Note: The focus of this memo is on introducing a method how ICMP messages may be exploited to improve TCP's performance; how different physical-and link layer mechanisms underneath the network layer may trigger ICMP destination unreachable messages are out of scope of this memo. The removal of the route usually goes along with a notification to the corresponding TCP source about the dropped packets via ICMP destination unreachable messages of code 0 (net unreachable) or code 1 (host unreachable) [RFC1812]. Therefore, since ICMP destination unreachable messages of these codes are evidence that packets were dropped due to a link outage, they can be interpreted as a connectivity disruption indication. Note that there are also other ICMP destination unreachable messages with different codes. Some of them are candidates for connectivity disruption indications too, but need further investigation. For example ICMP destination unreachable messages with code 5 (source route failed), code 11 (net unreachable for TOS), or code 12 (host unreachable for TOS). On the other side codes that flag hard errors [RFC1122] are of no use for the proposed scheme. In the following, the term "ICMP unreachable message" is used as synonym for ICMP destination unreachable messages of code 0 or code 1. A router experiencing a link outage is an obvious candidate for being heavily congested because it is not just unable to forward packets fast enough, it is even unable to forward packets at all. Therefore, TCP's exponential back-off may seem appropriate. However, taking into account the congestion control principles [RFC2914], i.e., congestion is indicated by packet loss, receiving an ICMP unreachable message might be an indication that there is no congestion. For instance, when a (re-)transmission is replied to with an ICMP unreachable message, this is a strong indication that there is no congestion in the network - at least on that very part of the path which was traveled by both, the TCP segment eliciting the ICMP Zimmermann & Hannemann Expires August 1, 2009 [Page 4] Internet-Draft Make TCP more Robust to LCDs January 2009 unreachable message as well as the ICMP unreachable message itself. Therefore, it seems a little bit harsh for TCP to back-off as if there was true congestion. The accurate interpretation of ICMP unreachable messages as an connectivity disruption indication is complicated by the following two peculiarities of ICMP messages. Firstly, they do not necessarily operate on the same timescale as the packets, i.e., in the given case TCP segments, which elicited them. When a router drops a packet due to a missing route it will not necessarily send an ICMP unreachable message immediately, but rather queues it for later delivery. Secondly, ICMP messages are subject to rate limiting, e.g., when a router drops a whole window of data due to a link outage, it will hardly send as many ICMP unreachable messages as it dropped TCP segments. Depending on the load of the router it may even send no ICMP unreachable messages at all. Both peculiarities originate from RFC 1812 [RFC1812]. Fortunately, according to RFC 792 [RFC0792] ICMP unreachable messages are obliged to contain in their body the Internet Protocol (IP) header of the datagram eliciting the ICMP unreachable messages plus the first 64 bits of the payload of that datagram, i.e., in case of a TCP segment both port numbers and the sequence number. This allows the originating TCP to identify the connection which an ICMP unreachable message is reporting an error about. Moreover, it allows the originating TCP to identify which segment of the respective connection triggered the ICMP unreachable message, provided that there are not several segments in flight with the same sequence number. This may very well be the case when TCP is recovering lost segments. 4. Connectivity Disruption Reaction The complete algorithm is specified in Section 4.1. In section Section 4.2, the different steps of the algorithm are discussed in detail. 4.1. The Algorithm The following scheme MAY be used by a TCP sender to avoid over- conservative back-offs of the retransmission timer in the case of long connectivity disruptions: (1) Set a "UndoBackOff" variable to UNPROVED (equal 0) UndoBackOff := UNPROVED. Zimmermann & Hannemann Expires August 1, 2009 [Page 5] Internet-Draft Make TCP more Robust to LCDs January 2009 (2) Wait for the expiration the retransmission timer, proceed to step (RTO). (3) Wait either for the arrival of an acceptable ACK. When an acceptable ACK has arrived, proceed to step (ACK), or for the arrival of an ICMP destination unreachable message. When ICMP destination unreachable message has arrived, proceed to step (4), or for the expiration the retransmission timer, proceed to step (RTO). (4) Extract the TCP segment header included in the ICMP destination unreachable message SEG := Extract(ICMP_MSG). (5) If "SEG.SEQ == SND.UNA", i.e., ICMP unreachable message reports on a retransmission, then If "UndoBackOff == UNPROVED", then set the "UndoBackOff" variable to PROVED (equal 1) UndoBackOff := PROVED. else revert one RTO back-off RTO := max(MINIMUM_RTO, RTO / 2). (6) Proceed to step (3). (RTO) This is a placeholder for the standard TCP behavior that must be executed at this point in the case the retransmission timer is expired. Proceed to step (3). (ACK) This is a placeholder for the standard TCP behavior that must be executed at this point in the case an acceptable ACK is arrived. Proceed to step (1). 4.2. The Algorithm in Detail When an RTO expires a TCP marks all outstanding segments as lost, sets the congestion window (CWND) to one segment, back-offs the RTO, and retransmits the first unacknowledged segment SND.UNA (step 2). If the RTO expires again a TCP will repeat the retransmission of the Zimmermann & Hannemann Expires August 1, 2009 [Page 6] Internet-Draft Make TCP more Robust to LCDs January 2009 first unacknowledged segment and back-off again (step 3c). This pattern will be repeated as long as no packet arrives or until the maximum RTO expired. If the first received packet after the retransmission(s) is an acceptable ACK (step 3a), a TCP will proceed as normal, i.e., slow- start the connection. It ignores later ICMP unreachable messages from the window of data which experienced RTO. Late ICMP unreachable messages are of no use as the ACK clock is already restarting due to the successful retransmission. On the other side if the first received packet after the retransmission(s) is an ICMP unreachable message, a TCP SHOULD revert one back-off for each ICMP unreachable message reporting an error on a retransmission. To decide if an ICMP unreachable message reports on a retransmission, the sequence number therein is exploited (step 4, step 5). Nevertheless, the first unacknowledged sequence number is suffering from the ambiguity if it refers to the original transmission or to any of the retransmissions. To be conservative, it should be considered to belong to the original transmission (step 5a). However, for each next ICMP unreachable message reporting on the retransmission, TCP SHOULD revert one back-off (step 5b). Upon receipt of an ICMP unreachable message which legitimately reverts one back-off there is the possibility that this new RTO has expired already. Then, a TCP SHOULD retransmit immediately, i.e., an ICMP message clocked retransmission. In case the new RTO has not expired yet, TCP MUST wait accordingly. 5. Discussion Apart from the possibility to receive ICMP unreachable messages reporting on the sequence number of the retransmission, there might as well arrive ICMP unreachable messages reporting on the original window of data while a TCP is in RTO induced recovery. As TCP cannot decide by a single or a few ICMP unreachable messages if the whole window of data was dropped because of a link outage, there is the option that at least one of the segments was dropped due to true congestion in the network, calling for back-off. Therefore, to be conservative, a TCP MUST NOT revert the back-off in such a case (step 5a). Although, there is still the unlikely possibility that the intermediate router indeed sends an ICMP unreachable message for each dropped segment. Then, TCP should be allowed to even revert the first back-off. However, as this case is very unlikely and requires one more state variable to detect it is not recommended in this Zimmermann & Hannemann Expires August 1, 2009 [Page 7] Internet-Draft Make TCP more Robust to LCDs January 2009 document. Besides the ambiguity if the first unacknowledged sequence number refers to the original transmission or to any of the retransmissions, there is another source of ambiguity about the sequence numbers contained in the ICMP unreachable messages. For high bandwidth paths like modern gigabit links the sequence space may wrap rather quickly, thereby allowing the possibility that a late ICMP unreachable message reporting on an old error may coincidentally fit as input in the scheme explained above. As a result, the scheme would wrongly revert one back-off. However, chances for this to happen are minuscule. Moreover, as the scheme is tailored most conservatively no threat to the network from this issues may arise. Finally, the scheme explicitly does not call for a differentiation of ICMP unreachable messages originating from different routers, as the evidence of no congestion still holds even if the reporting router changed. Another exploitation of ICMP unreachable messages in the context of TCP congestion control might seem appropriate in case the ICMP unreachable message is received while TCP is in steady-state and the message refers to a segment from within the current window of data. As the round trip time (RTT) up to the router which generates the ICMP unreachable message is likely to be substantially shorter than the overall RTT to the destination, the ICMP unreachable message may very well reach the originating TCP while it is transmitting the current window of data. In case the remaining window is large, it might seem appropriate to refrain from transmitting the remaining window as there is timely evidence that it will only trigger further ICMP unreachable messages at the very router. Although this might seem appropriate from a wastage perspective, it may be counterproductive from a security perspective since ICMP messages are easy to spoof, thereby allowing an easy attack to the TCP by simply forging such ICMP messages. An additional consideration is the following: in the presence of multi-path routing even the receipt of a legitimate ICMP unreachable message cannot be exploited accurately because there is the option that only one of the multiple paths to the destination is suffering from a connectivity disruption which causes ICMP unreachable messages to be sent. Then however, there is the possibility that the path along which the connectivity disruption occurred contributed considerably to the overall bandwidth, such that a congestion response is very well reasonable. However, this is not necessarily the case. Therefore, a TCP has no means except for its inherent congestion control to decide on this matter. All in all, it seems that for a connection in steady-state, i.e., not in RTO induced Zimmermann & Hannemann Expires August 1, 2009 [Page 8] Internet-Draft Make TCP more Robust to LCDs January 2009 recovery, reacting on ICMP unreachable messages in regard to congestion control is not appropriate. For the case of RTO-based retransmissions, however, there is a reasonable congestion response, which is skipping further back-off of the RTO because there is no congestion indication - as described above. 6. Related Work In literature there are several methods, which address TCP's problems in the presence of connectivity disruptions. Some of them try to improve TCP's performance by modifying the lower layers. For example [SM03] introduces a "smart link layer" that buffers one segment for each ongoing connection and replaying these segments on connectivity reestablishment. This approach has a serious drawback: previously state-less intermediate routers have to be modified in order to inspect TCP headers, track the end-to-end connection and to provide additional buffer space that lead all in all to an additional need of memory and processing power. On the other hand stateless link layer schemes, like proposed in RFC 3819 [RFC3819], which unconditionally buffer some small number of packets may have another problem: if a packet is buffered longer than the maximum segment lifetime (MSL) of [RFC0793] 2 min, i.e., the disconnection lasts longer than MSL, TCP's assumption that such segments will never be received will no longer be true, violating TCP's semantics [I-D.eggert-tcpm-tcp-retransmit-now]. Other approaches like TCP-F [CRVP01] or the Explicit Link Failure Notification (ELFN) [HV02] inform the TCP senders about disrupted paths by special messages generated from intermediate routers. In case of a link failure they stop sending data segments and freeze TCP's retransmission timers. TCP-F stays in this state and remains silent until either a "route establishment notification" is received or an internal timer expires. In contrast, ELFN periodically probes the network to detect connectivity reestablishment. Both proposals rely on changes to intermediate routers, whereas the scheme proposed in this memo is a sender only modification. Moreover, ELFN also does not consider congestion in the network and may impose serious additional load on the network, depending on the probe interval. The authors of ATCP [LS01] propose enhancements to identify different types of packet loss, by introducing a layer between TCP and IP. They utilize ICMP destination unreachable messages to set TCP's receiver advertised window to zero and thus forcing the TCP sender to do zero window probing with exponential back-off. ICMP destination unreachable messages, which arrive during this probing period, are ignored. This approach is nearly orthogonal to this memo, which Zimmermann & Hannemann Expires August 1, 2009 [Page 9] Internet-Draft Make TCP more Robust to LCDs January 2009 exploits ICMP messages to revert a RTO back-off, when TCP is already probing. In principle both mechanisms could be combined, however, due to security considerations it does not seem appropriate to adopt ATCP's reaction as discussed in Section 5. Schuetz et al. describe in [I-D.schuetz-tcpm-tcp-rlci] a set of TCP extensions that improve behavior when transmitting over paths whose characteristics can change on short time-scales. Their proposed TCP extensions modify the local behavior of TCP and introduce a new TCP option to signal locally received connectivity-change indications (CCIs) to remote peers. Upon reception of a CCI, they re-probe the path characteristics either by performing a speculative retransmission or by sending a single segment of new data, depending on whether the connection is currently in the loss state or transmitting in steady-state, respectively. The authors focus on specifying TCP response mechanisms, nevertheless underlying layers would have to be modified to explicitly send CCIs to make these immediate responses possible. 7. IANA Considerations This memo includes no request to IANA. 8. Security Considerations The proposed mechanism is considered to be secure. For example an attacker cannot make a TCP modified with proposed scheme flood the network just by sending forged ICMP unreachable messages reverting RTO back-offs. Even in the case the attacker could correctly guess the sequence number of the current retransmitted segment, the retransmission frequency is limited by the minimum value for the RTO of 1s specified by RFC 2988 [RFC2988]. 9. References 9.1. Normative References [RFC0792] Postel, J., "Internet Control Message Protocol", STD 5, RFC 792, September 1981. [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, RFC 793, September 1981. [RFC1812] Baker, F., "Requirements for IP Version 4 Routers", RFC 1812, June 1995. Zimmermann & Hannemann Expires August 1, 2009 [Page 10] Internet-Draft Make TCP more Robust to LCDs January 2009 [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion Control", RFC 2581, April 1999. [RFC2988] Paxson, V. and M. Allman, "Computing TCP's Retransmission Timer", RFC 2988, November 2000. 9.2. Informative References [CRVP01] Chandran, K., Raghunathan, S., Venkatesan, S., and R. Prakash, "A feedback-based scheme for improving TCP performance in ad hoc wireless networks", IEEE Personal Communications vol. 8, no. 1, pp. 34-39, February 2001. [HV02] Holland, G. and N. Vaidya, "Analysis of TCP performance over mobile ad hoc networks", Wireless Networks vol. 8, no. 2-3, pp. 275-288, March 2002. [I-D.eggert-tcpm-tcp-retransmit-now] Eggert, L., "TCP Extensions for Immediate Retransmissions", draft-eggert-tcpm-tcp-retransmit-now-02 (work in progress), June 2005. [I-D.schuetz-tcpm-tcp-rlci] Schuetz, S., Koutsianas, N., Eggert, L., Eddy, W., Swami, Y., and K. Le, "TCP Response to Lower-Layer Connectivity- Change Indications", draft-schuetz-tcpm-tcp-rlci-03 (work in progress), February 2008. [LS01] Liu, J. and S. Singh, "ATCP: TCP for mobile ad hoc networks", IEEE Journal on Selected Areas in Communications vol. 19, no. 7, pp. 1300-1315, 2001 July. [RFC1122] Braden, R., "Requirements for Internet Hosts - Communication Layers", STD 3, RFC 1122, October 1989. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2914] Floyd, S., "Congestion Control Principles", BCP 41, RFC 2914, September 2000. [RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm for TCP", RFC 3522, April 2003. [RFC3819] Karn, P., Bormann, C., Fairhurst, G., Grossman, D., Ludwig, R., Mahdavi, J., Montenegro, G., Touch, J., and L. Wood, "Advice for Internet Subnetwork Designers", BCP 89, RFC 3819, July 2004. Zimmermann & Hannemann Expires August 1, 2009 [Page 11] Internet-Draft Make TCP more Robust to LCDs January 2009 [RFC4138] Sarolahti, P. and M. Kojo, "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting Spurious Retransmission Timeouts with TCP and the Stream Control Transmission Protocol (SCTP)", RFC 4138, August 2005. [SESB05] Schuetz, S., Eggert, L., Schmid, S., and M. Brunner, "Protocol enhancements for intermittently connected hosts", SIGCOMM Computer Communication Review vol. 35, no. 3, pp. 5-18, December 2005. [SM03] Scott, J. and G. Mapp, "Link layer-based TCP optimisation for disconnecting networks", SIGCOMM Computer Communication Review vol. 33, no. 5, pp. 31-42, October 2003. Authors' Addresses Alexander Zimmermann RWTH Aachen University Ahornstrasse 55 Aachen, 52074 Germany Phone: +49 241 80 21422 Email: zimmermann@cs.rwth-aachen.de Arnd Hannemann RWTH Aachen University Ahornstrasse 55 Aachen, 52074 Germany Phone: +49 241 80 21423 Email: hannemann@nets.rwth-aachen.de Zimmermann & Hannemann Expires August 1, 2009 [Page 12]