Transport Area Meeting (tsvarea) Monday, November 6, 2006 -- 17:40--19:50 ======================================== The meeting was chaired by the transport area directors, Lars Eggert and Magnus Westerlund, and was scribed by Matt Zekauskas. AGENDA: 1. IEEE 802.1 Audio/Video Bridging Michael Johas Teener 2. IEEE 802.1Qau Project on Congestion Notification Pat Thaler 3. Flow Rate Fairness: Dismantling a Religion Bob Briscoe 4. PWE3 Congestion Control Bruce Davie 5. Achieving Gbps performance on operational networks with e-VLBI: Results from the Haystack Workshop Marshall Eubanks 1. AV Bridging and Ethernet AV 802.1 Task Group -- Michael Johas Teener See slides. Michael described the position of 802.1 within IEEE 802, and the position of the audio-video bridging task group within 802.1. They are providing a specification for time synchronization of low-latency streams, along with admission control and QoS guarantees (see http://www.ieee802.org/1/pages/avbridges.html). He described the performance goals -- how strict the timing guarantees were, and that performance had to be achieved with very low cost to meet market requirements. The basic requirements are timing synchronization to the sub-microsecond, with clock jitter under 100nS, and an end-to-end latency of less than 2ms in a 7 hop Ethernet. He talked about home digital media distribution, which is one of the target markets. It is a heterogeneous environment, but generally uses 802.1, and so they are looking for a unified Layer 2 QoS. They have a scheme, which will go under the trademarked name "EthernetAV", because it will eventually be used as an enforcement mechanism. It is a change to 802.1 bridges; it changes 802.1Q and Layer 2. There are three basic additions: 802.1Qav for 802.3 only to support traffic shaping and prioritization; 802.1Qat for admission control, and 802.1AS for precise timing. The big compromise to make it feasible was to require that all devices in a network connecting two endpoints must be participating devices. They must all have the changes. There can be no hub or legacy bridge. The changes were then summarized, there are MAC changes for timing, changes to queuing/DMA, and admission control. Management would be the same as multicast registration mechanisms. An audience member asked if the admission control is orthogonal to uPNP and Bonjour. Michael noted that there is no admission control mechanism in Bonjour yet, and the uPNP folks hope to use this (as a service). The standardization is well underway. If you want to look at the drafts, ask Michael for them (mikejt@broadcom.com). Technical closure is expected in 2007. They expect that this service will follow the Ethernet product curve. Unified L2 QoS services will be available soon, and IP will likely be the dominant protocol run. Thus, they want to let the IETF know the services are coming, and see if we have any comment on the capabilities and interfaces. Some potential interactions: RTP might use this. There can be interactions with RSVP and NSIS. Should NTP be used for time synchronization? Yakov Stein asked about time synchronization. The IETF has had a group called "tick tock", and there has been two meetings. They want an official BOF, but they were waiting for the new 1588v2 standard. There are two design teams working on requirements. They are waiting on input on what kinds of MTIE and MTDEV for frequency problems and accuracies for absolute wall clock time. What is the right protocol going to be? Originally they were thinking this would be NTPv5, now calling "tick tock". Maybe it will be more 1588 based; we are looking for input. David Oran asked about master clock generation. Is the notion that any device could be the master, with the election of one? Or are only certain devices master clocks? The standard says that certain devices are master, can certain devices cannot be. We expect a 1588 election mechanism to choose the master. David asked if there was a provision for external clock references? Yes, there has to be a has to be a house clock. He followed up with asking if the system will synchronize free running if there is no external reference, and Michael answered yes. Scott Bradner said he is worried about RTP and it's ilk; they are not guaranteed the function is available end-to-end. Don O'Connor asked about scope. Michael said it was conceived for residential home LANs. Don followed up with could you build an AV Bridge across the city of San Diego, or Nationally? Michael said there was no scalability boundaries, but the initial market and requirements came from residential networks. However, there are people that want to make this available across an enterprise or metro area. A follow up question asked about the 2 ms requirement. Michael answered that was a requirement of the application, not the timing system. If you are willing to pay more for guaranteed latency, you can go farther. It depends on what your goals are, and what you are willing to put up with to get that goal. The 2ms came from worst case residential use and acceptable performance. Someone asked if this was built into 802.1ah provider backbone bridges. Michael answered yes, they are working with g.packetiming and g.8261, so people can do SONET interfaces with Ethernet over long distances. Dennis Ferguson said that he looked at these issues a long time ago. Was this a "returnable" time design? That is, say you have two boxes separated by 15ms of fiber. Would they end up with the same time to within 15 nS? Michael said yes, but this is a 1588 problem. You have to have the round-trip time be symmetric, and the longer the RTT is the harder it is. However, there are games you can play at Layer 1 to cancel that effect... if you are willing to pay for them. Pat Thaler added that yes, talked about some physical layers. Some of the smarter physical processing to send high-speed over twisted pair has substantial delays. You find from the physical layer what delays are. Both ends might have different receiver delay which might introduce asymmetry. Steve Casner asked if this provides services like clock synchronization and queuing for priority, with applications building on that? For instance, RTP systems use clocks that are local, if the clocks are synchronized things would work better. However, you don't need synchronization for RTP to work. Is there a structure like that here? Michael said that yes, they are trying to make sure that the services are such that you don't have to know they are there, but if you know they are there you can take advantage of them. For example, if there is time synchronization, the latencies can be lower because the buffers can be smaller. If there is QoS, you can make further application layer assumptions. A comment was made about systems over a larger geographic scale, where you can't make assumptions about the facility end to end. Would this still work if you can accept larger latencies? Scott Bradner followed up on the notion of smaller buffers because of synchronization. Would this lead into packet ingress syncnronization? Michael said not really. They are looking at actual forwarding and queuing of packets, the interaction between what priorities mean, and what allocation is on per-traffic-class basis. Scott said he worried about this, because he has seen startups that think they can synchronize the whole Internet. Michael said that isn't their goal, nor a requirement. 2. IEEE 802.1Qau Congestion Notification --Pat Thaler, pthaler@broadcom.com This project is a younger one than AV bridging. Some work started long ago in 802.3, but moved to 802.1 in January 2006. It's past being a study group, and it's now a task force with it's first project, congestion notification. (See slides for complementary detail.) Pat started by talking about what she means by congestion notification -- backpressure at the link layer for a bridge to notify the source of congestion to slow down. The target environment is one with low bandwidth-delay product, like a data center. If the latency is too large, the feedback is meaningless -- it's too late. Tim Shepard asked about the target low-delay situation. Wouldn't existing pause frames do everything you need? Pat responded that in multiple-switch situations the pause frames cause "congestion spreading". Say there are many devices in a room, with a 1G flow to an iSCSI target, and there is another 10G flow going through the same switch. If you want to pause the 1G flow, you end up stopping the 10G flow as well when pause frames are used, and you get pacing... the 10G flow gets the equivalent of 1G. Another problem was noted, that you can't tell TCP to slow down. A Project Authorization Request (PAR) is the equivalent of an IETF charter. Pat went through the 802.1Qau PAR, showing the scope and the purpose. In addition to the target environment mentioned above, the goal is to control long-lived flows, and to use VLAN tag priorities to segregate congestion- controlled flows. This would allow a single Ethernet network to be used for storage, inter-process communication and wide area communications. More and more high-speed data center networks are being created to run with multiple applications, where earlier specialized networks were used for each application before. To some extent, this should help iSCSI win over fibrechannel. People like Ethernet, the cost factors, the scalability, but want more predictable performance. As noted earlier, 802.3x (with pause frames) doesn't provide a good solution; it reduces the throughput of "innocent" flows, and increases latency and jitter. The objectives of this work include an independence from "upper layer" protocols, and compatibility with TCP/IP. Pat stated that they had a concern that it might interact badly with some TCP options. Dave Oran asked for an example of a TCP option that might interact badly, and Pat did not have a ready example; it was just some speculation among committee members. Sally Floyd thought that there was no TCP option that would cause a problem, but the experimental IP option QuickStart might be a problem. She noted that the abstract of the RFC says to be very careful. Dave Oran asked how would you know that the endpoints can support this notification? Pat said they envisioned some kind of port to port discovery protocol. And, it's not just the endpoints that have to support the notification, everything in the middle must support notification as well. Lars Eggert wondered about the situation where this is a transit hop for and end to end connection; the ingress might have no control over the source TCP if the source is not on a control device. Pat said the target is a fully controlled environment; both endpoints and all of the middle must be in a congestion-managed cloud. Bob Briscoe wondered if there was interest in working to extend this protocol to situations when you don't have full cloud? He thought MPLS groups were working on similar protocols and were planning on thinking about Ethernet next. He was writing an IETF draft about interaction of congestion notification with Layer 2 and Layer3; should feedback be to the source or the next IP layer device? Pat responded that the IEEE group has not thought about that; so far, the objective has definitely been a completely managed cloud. It would be possible to consider this situation, but someone needs to come talk to the IEEE group to spawn discussion. it would be a change in direction from the current work. David Oran said he was trying to understand how to detect a completely managed cloud. Is there some end to end signaling protocol? Pat said that the group is really getting started, and that has not yet been determined. Another followup question was asked about signaling, and what about a cloud that was router to host. How a flow on the router through the cloud might be handled. The router can have many flows on that port, some managed and some not. You can't use management to disambiguate, you can't use provisioning, and you can't define on a port basis. Pat said the group hasn't discussed this. Right now, the router would be an end host, and the signaling would only be layer 2. However, the intuition is that the scheme would therefore not work. Dennis Ferguson followed up with another question. He thought that he understood that congestion was reported to end hosts. However, you don't say what congestion is being reported. It's not just on the last link, but any link on the path traversed over the network. If the link between two bridges is congested, you tell the end host to slow down, correct? Yes, you tell the Layer2 end hosts. But what do you say... "slow down"? send a pause frame? send a congestion management frame? The answer is in the technical slides in the presentation. Pat then returned to the objective slides. She mentioned that the priority in 802.1 is not necessarily strict; the only one codified to date is strict, but other proprietary schemes are possible. They want to make sure that 802.1Qau is to not require per-flow state or queuing in bridges. Pat then provided a backward congestion notification (BCN) example. The project was authorized in September, so this is preliminary. There is queue level monitoring at congestion points. There is a desired queue level, and a change since last time, so you get a level and a derivative. When the offset is above the equilibrium point, and a sample is taken, it will check if the frame has a rate-limiter tag. If it has a tag, it will congestion notification information. If it does not have a tag, and it will only send a congestion notification if the flow needs to slow down. Dave Oran asked if there was anything marked in the packets going forward, and the answer was no. There is a validation in progress, see the slides for a reference URL. Some sample simulation results were shown. Queue lengths without BCN and with BCN shown, the BCN example is much more stable. David Black wanted to emphasize an earlier point. He believes the results are in a world where everything is doing congestion-managed BCN. He thinks it will be important to understand how this scheme works with respect to TCP reaction times. If the RTT with this mechanism is different than the RTT without, it will have bad results in practice. Pat said that's exactly why they are presenting here, to interest the IETF in helping to understand the issues, while the protocol is still under development. Dave Oran asked if there are multiple congestion points on the path will you get signals from each? Yes, as long as the signal says to slow down, you will listen to all, but only remember the last one. When you get speed up indications, you only listen once the last one also says to speed up. There was a question if anyone has simulated oscillations that might be caused; they have some simulations but not many yet, they've only started simulating since May. Sally Floyd thought there were some difficult cases to consider. One where TCP has idle periods with longer RTT than link-level network. Because demand is waiting for the computer or a button press or whatever. If the link sees multiple RTT idle, then the endpoint sends full speed, what happens? In most of the cases examined so far, nodes are sending at full speed. TCP behavior is not accounted for in the simulations. Francois Le Faucher noted that one thing the IEEE committee might consider is that the IETF is just considering doing some work on different types of congestion notification for real time traffic, with early congestion notification, and having end system react. If that work happens, it would be useful to map to Layer2, where you traverse Layer 2 domains. Perhaps there can be some information sharing, and discussion of different congestion notification; this scheme may require more basic forward congestion notification. Was this work considering fixed or variable bandwidth? Pat replied that Michael's work (above) has been looking at fixed. This work considers highly bursty mechanisms. Matt Mathis wanted to offer words of encouragement. One potential opportunity is to combine this with work done recently on the amount of buffering needed in routers, an open research question. Perhaps this would let you have switches with small, cheap, buffers, and shift some of the queues back to routers which still need to have some queues. It might make for a more efficient overall system. 3. Flow Rate Fairness: Dismantling a Religion -- Bob Briscoe Bob noted that he was now taking us from Layer 2 to Layer 11, and giving a rant about fairness. His goal is to change the way we think about fairness. See slides (and associated paper) for lots of details. He felt that the IETF has been deciding what's fair for a while, using an unsubstantiated notion that equal flow rates are fair. It's badly off what the market would allocate, leading to underinvestment. Doing nothing invites more middlebox kludges. He talked about what he thought a notion of equality might be (sharing congestion volume among each users bit rate) and then fairness can be defined from that (although you don't have to be equal to be fair). He spent some time saying why a bunch of other notions are wrong. re-ECN is one possible solution. Matt Mathis asked that if we assume noone is cheating, isn't this equivalent to window fairness -- which is what TCP is doing now? Sally Floyd said no, because this takes into account time. Matt followed up with asking in steady state, isn't this equivalent to window fairness? The answer was no, because this idea is around individuals, not flows. 4. Congestion Framework for Pseudowires --Bruce Davie This talk is to inform this group about issues being picked up in another area. They are looking for the input of congestion control experts on the topic. The issue is congestion control for pseudowires, and there is a framework draft. Pseudowires can carry any sort of traffic, and to date have been carried over well engineered networks. However, that isn't a requirement, and someone will run them over the general Internet. Of particular concern are TDM pseudowires, and packet pseudowires carrying non-congestion controlled traffic such as MPEG-2 streams. There are a series of constraints, to some degree based on that the devices are typically big routers with hardware-based forwarding engines where it is hard to keep per-flow state. TCP-friendly rate control (TFRC) seems a good building block, but then you have to accurately measure loss. If you are less than TFRC rate, nothing that needs to be done. However, what do you do if you are over the TFRC rate, terminate whole wires? Which one? Thinking about this because Yakov Rekhter noted the bad experiences with TCP over ATM-ABR, where there was policing by dropping excess traffic. David Black said that as one that helped get TFRC into fibrechannel pseudowire, fibrechannel pseudowire is variable rate. Some of the most important pseudowires are fixed rate, doing timing recovery at egress. This problem needs a lot of thought. If the pseudowire is Ethernet, then shaping might work. Benjamen Jenkins asked if a single generic solution is desired, or something that is more per-pseudowire? The thought is that it's too soon to tell, however there is a difference in TDM versus other types. Liam Casey said that he was not sure that anyone would be motivated to run more than the raw pseudowire. The thought was that the motivation would be similar to why folks run TCP congestion control. Matt Mathis asked about the nature of TDM pseudowires. Do they transmit idle? Yes, that is true. Was it the only one? Yes, TDM pseudowires are the only one that transmit idle. Andy noted that low-rate TDM pseudowires do zero suppression, though. Lars said that would be on-off, which is probably worse than on constantly. Another person wanted to know if it was true that traffic on a pseudowire matches the traffic at higher layers, except for TDM? The answer was "maybe". David Black said that fibrechannel wouldn't fit into 1500 byte frames. David McDysan noted that it looks like there is a plan to signal the source. It looks like that signal will traverse a untrusted unengineered network? Maybe that's the place you want to really apply some control, or build that notion into the signaling. It's also clear you don't need to enable congestion control all the time, you could signal, or turn off if you know that there is a well engineered network end to end. The goal of this presentation is not to design a solution, but to make transport experts aware, and invite David Black wanted to follow up on a fibrechannel pseudowire. The fibrechannel pseudowire is variable rate; it can tolerate some delay variation, but has a strong dislike for having any of it's packets dropped. 5. Achieving Gbps performance on operational networks with e-VLBI: Results from the Haystack Workshop --Marshall Eubanks VLBI is Very-Long Baseline Interferometry. It's radio astronomers creating the illusion of a very large antenna by correlating the observations of a number of antennas that are located far apart. Originally, data was brought to the correlator by tape; later by disk pack. e-VLBI uses networks to send the data. Data streams are currently about 1Gbps, and they are looking to go to 100 Gbps. The streams are somewhat loss tolerant, but there is an implosion problem with all streams coming to a central correlator. There was a recent workshop at MIT/Haystack. There are still complaints about not being able to achieve the desired data rates. There is work in tuning TCP stacks, using new stacks, and moving away from TCP. This includes the use of "lightpaths" and GMPLS -- a move towards dynamic circuits, and "hybrid networks" that merge circuit services and IP services. In the short term, eVLBI is using RTP and RTCP. Given the high data rates, though, there is some concern that the timestamps will wrap too fast; they are proposing a timestamp scaling bit and scaling factor. Collin Perkins did some work with uncompressed high-definition video over RTP a while ago, and had a different approach to the timestamp problem. This will be, or was, recommended to the Haystack group.