Transport Area Meeting (tsvarea)
Monday, November 6, 2006 -- 17:40--19:50
========================================

The meeting was chaired by the transport area directors, Lars Eggert and
Magnus Westerlund, and was scribed by Matt Zekauskas.

AGENDA:

1. IEEE 802.1 Audio/Video Bridging
	Michael Johas Teener
2. IEEE 802.1Qau Project on Congestion Notification
	Pat Thaler
3. Flow Rate Fairness: Dismantling a Religion
	Bob Briscoe
4. PWE3 Congestion Control
	Bruce Davie
5. Achieving Gbps performance on operational networks with e-VLBI:
   Results from the Haystack Workshop
	Marshall Eubanks


1. AV Bridging and Ethernet AV 802.1 Task Group
   -- Michael Johas Teener

See slides.  Michael described the position of 802.1 within IEEE 802,
and the position of the audio-video bridging task group within 802.1.
They are providing a specification for time synchronization of low-latency
streams, along with admission control and QoS guarantees
(see http://www.ieee802.org/1/pages/avbridges.html).  He described
the performance goals -- how strict the timing guarantees were,
and that performance had to be achieved with very low cost to meet
market requirements.  The basic requirements are timing synchronization
to the sub-microsecond, with clock jitter under 100nS, and an end-to-end
latency of less than 2ms in a 7 hop Ethernet.  He talked about home digital
media distribution, which is one of the target markets.  It is a heterogeneous
environment, but generally uses 802.1, and so they are looking for a
unified Layer 2 QoS.  They have a scheme, which will go under the
trademarked name "EthernetAV", because it will eventually be used
as an enforcement mechanism.  It is a change to 802.1 bridges; it
changes 802.1Q and Layer 2.  There are three basic additions:
802.1Qav for 802.3 only to support traffic shaping and prioritization;
802.1Qat for admission control, and 802.1AS for precise timing.

The big compromise to make it feasible was to require that all devices
in a network connecting two endpoints must be participating devices.
They must all have the changes.  There can be no hub or legacy bridge.
The changes were then summarized, there are MAC changes for timing,
changes to queuing/DMA, and admission control.  Management would
be the same as multicast registration mechanisms.

An audience member asked if the admission control is orthogonal to
uPNP and Bonjour.  Michael noted that there is no admission control
mechanism in Bonjour yet, and the uPNP folks hope to use this (as
a service).

The standardization is well underway.  If you want to look at the drafts,
ask Michael for them (mikejt@broadcom.com).  Technical closure is expected
in 2007.  They expect that this service will follow the Ethernet product
curve.  Unified L2 QoS services will be available soon, and IP will likely
be the dominant protocol run.  Thus, they want to let the IETF know the
services are coming, and see if we have any comment on the capabilities
and interfaces.  Some potential interactions:  RTP might use this.
There can be interactions with RSVP and NSIS.  Should NTP be used for
time synchronization?

Yakov Stein asked about time synchronization.  The IETF has had a group called
"tick tock", and there has been two meetings.  They want an official BOF,
but they were waiting for the new 1588v2 standard.  There are two design
teams working on requirements.  They are waiting on input on what kinds
of MTIE and MTDEV for frequency problems and accuracies for absolute wall
clock time.  What is the right protocol going to be?  Originally they
were thinking this would be NTPv5, now calling "tick tock".  Maybe it
will be more 1588 based; we are looking for input.

David Oran asked about master clock generation.  Is the notion that
any device could be the master, with the election of one?  Or are
only certain devices master clocks?  The standard says that certain
devices are master, can certain devices cannot be.  We expect a 1588
election mechanism to choose the master.

David asked if there was a provision for external clock references?
Yes, there has to be a has to be a house clock.
He followed up with asking if the system will synchronize free running if
there is no external reference, and Michael answered yes.

Scott Bradner said he is worried about RTP and it's ilk; they are not
guaranteed the function is available end-to-end.

Don O'Connor asked about scope.  Michael said it was conceived for
residential home LANs.  Don followed up with could you build an AV
Bridge across the city of San Diego, or Nationally?  Michael said
there was no scalability boundaries, but the initial market and
requirements came from residential networks.  However, there are
people that want to make this available across an enterprise or metro
area.

A follow up question asked about the 2 ms requirement.  Michael
answered that was a requirement of the application, not the timing
system.  If you are willing to pay more for guaranteed latency, you
can go farther.  It depends on what your goals are, and what you
are willing to put up with to get that goal.  The 2ms came from
worst case residential use and acceptable performance.

Someone asked if this was built into 802.1ah provider backbone
bridges.  Michael answered yes, they are working with g.packetiming
and g.8261, so people can do SONET interfaces with Ethernet
over long distances.

Dennis Ferguson said that he looked at these issues a long time ago.
Was this a "returnable" time design?  That is, say you have two
boxes separated by 15ms of fiber.  Would they end up with the
same time to within 15 nS?  Michael said yes, but this is a 1588
problem.  You have to have the round-trip time be symmetric, and
the longer the RTT is the harder it is.  However, there are games
you can play at Layer 1 to cancel that effect... if you are willing
to pay for them.

Pat Thaler added that yes, talked about some physical layers.  Some of
the smarter physical processing to send high-speed over twisted pair
has substantial delays.  You find from the physical layer what delays
are.  Both ends might have different receiver delay which might
introduce asymmetry.

Steve Casner asked if this provides services like clock synchronization
and queuing for priority, with applications building on that?  For
instance, RTP systems use clocks that are local, if the clocks are
synchronized things would work better. However, you don't need
synchronization for RTP to work.  Is there a structure like that here?
Michael said that yes, they are trying to make sure that the services
are such that you don't have to know they are there, but if you know
they are there you can take advantage of them.  For example, if there
is time synchronization, the latencies can be lower because the
buffers can be smaller.  If there is QoS, you can make further
application layer assumptions.

A comment was made about systems over a larger geographic scale, where
you can't make assumptions about the facility end to end.  Would this
still work if you can accept larger latencies?

Scott Bradner followed up on the notion of smaller buffers because of
synchronization.  Would this lead into packet ingress syncnronization?
Michael said not really. They are looking at actual forwarding and
queuing of packets, the interaction between what priorities mean, and
what allocation is on per-traffic-class basis.  Scott said he worried
about this, because he has seen startups that think they can
synchronize the whole Internet.  Michael said that isn't their goal,
nor a requirement.





2. IEEE 802.1Qau Congestion Notification
   --Pat Thaler, pthaler@broadcom.com

This project is a younger one than AV bridging.  Some work started
long ago in 802.3, but moved to 802.1 in January 2006.  It's past
being a study group, and it's now a task force with it's first
project, congestion notification.

(See slides for complementary detail.)

Pat started by talking about what she means by congestion notification
-- backpressure at the link layer for a bridge to notify the source of
congestion to slow down.  The target environment is one with low
bandwidth-delay product, like a data center.  If the latency is too large,
the feedback is meaningless -- it's too late.

Tim Shepard asked about the target low-delay situation.  Wouldn't existing
pause frames do everything you need?  Pat responded that in multiple-switch
situations the pause frames cause "congestion spreading".  Say there
are many devices in a room, with a 1G flow to an iSCSI target, and there
is another 10G flow going through the same switch.  If you want to pause
the 1G flow, you end up stopping the 10G flow as well when pause frames
are used, and you get pacing... the 10G flow gets the equivalent of 1G.
Another problem was noted, that you can't tell TCP to slow down.

A Project Authorization Request (PAR) is the equivalent of an IETF charter.
Pat went through the 802.1Qau PAR, showing the scope and the purpose.  In
addition to the target environment mentioned above, the goal is to control
long-lived flows, and to use VLAN tag priorities to segregate congestion-
controlled flows.  This would allow a single Ethernet network to be used
for storage, inter-process communication and wide area communications.
More and more high-speed data center networks are being created to run
with multiple applications, where earlier specialized networks were used
for each application before.  To some extent, this should help iSCSI
win over fibrechannel.  People like Ethernet, the cost factors, the
scalability, but want more predictable performance.

As noted earlier, 802.3x (with pause frames) doesn't provide a good solution;
it reduces the throughput of "innocent" flows, and increases latency and
jitter.  

The objectives of this work include an independence from "upper layer"
protocols, and compatibility with TCP/IP.  Pat stated that they had a
concern that it might interact badly with some TCP options.  Dave Oran asked
for an example of a TCP option that might interact badly, and Pat did
not have a ready example; it was just some speculation among committee
members.  Sally Floyd thought that there was no TCP option that would
cause a problem, but the experimental IP option QuickStart might be a
problem.  She noted that the abstract of the RFC says to be very careful. 

Dave Oran asked how would you know that the endpoints can support this
notification?  Pat said they envisioned some kind of port to port discovery
protocol.  And, it's not just the endpoints that have to support the
notification, everything in the middle must support notification as well.

Lars Eggert wondered about the situation where this is a transit hop
for and end to end connection; the ingress might have no control over
the source TCP if the source is not on a control device.  Pat said
the target is a fully controlled environment; both endpoints and all
of the middle must be in a congestion-managed cloud.

Bob Briscoe wondered if there was interest in working to extend this
protocol to situations when you don't have full cloud?  He thought
MPLS groups were working on similar protocols and were planning on 
thinking about Ethernet next.  He was writing an IETF draft about
interaction of congestion notification with Layer 2 and Layer3; 
should feedback be to the source or the next IP layer device?
Pat responded that the IEEE group has not thought about that;
so far, the objective has definitely been a completely managed cloud.
It would be possible to consider this situation, but someone needs
to come talk to the IEEE group to spawn discussion.  it would be a
change in direction from the current work.

David Oran said he was trying to understand how to detect a completely
managed cloud.  Is there some end to end signaling protocol?  Pat said
that the group is really getting started, and that has not yet been
determined.

Another followup question was asked about signaling, and what about a
cloud that was router to host.  How a flow on the router through the
cloud might be handled.  The router can have many flows on that port,
some managed and some not.  You can't use management to disambiguate,
you can't use provisioning, and you can't define on a port basis.
Pat said the group hasn't discussed this.  Right now, the router
would be an end host, and the signaling would only be layer 2.
However, the intuition is that the scheme would therefore not work.

Dennis Ferguson followed up with another question.  He thought that
he understood that congestion was reported to end hosts.  However,
you don't say what congestion is being reported.  It's not just on
the last link, but any link on the path traversed over the network.
If the link between two bridges is congested, you tell the end host
to slow down, correct?  Yes, you tell the Layer2 end hosts.  But
what do you say... "slow down"?  send a pause frame?  send a
congestion management frame?  The answer is in the technical 
slides in the presentation.

Pat then returned to the objective slides.  She mentioned that the
priority in 802.1 is not necessarily strict; the only one codified
to date is strict, but other proprietary schemes are possible.
They want to make sure that 802.1Qau is to not require per-flow state
or queuing in bridges.

Pat then provided a backward congestion notification (BCN) example.
The project was authorized in September, so this is preliminary.
There is queue level monitoring at congestion points.  There is a desired
queue level, and a change since last time, so you get a level and
a derivative.  When the offset is above the equilibrium point,
and a sample is taken, it will check if the frame has a rate-limiter
tag.  If it has a tag, it will congestion notification information.
If it does not have a tag, and it will only send a congestion notification
if the flow needs to slow down.

Dave Oran asked if there was anything marked in the packets going forward,
and the answer was no.

There is a validation in progress, see the slides for a reference URL.
Some sample simulation results were shown.  Queue lengths without BCN
and with BCN shown, the BCN example is much more stable.

David Black wanted to emphasize an earlier point.  He believes the
results are in a world where everything is doing congestion-managed
BCN.  He thinks it will be important to understand how this scheme
works with respect to TCP reaction times.  If the RTT with this mechanism
is different than the RTT without, it will have bad results in practice.
Pat said that's exactly why they are presenting here, to interest the
IETF in helping to understand the issues, while the protocol is still
under development.

Dave Oran asked if there are multiple congestion points on the path
will you get signals from each?  Yes, as long as the signal says to
slow down, you will listen to all, but only remember the last one.
When you get speed up indications, you only listen once the last one
also says to speed up.  There was a question if anyone has simulated
oscillations that might be caused; they have some simulations but not
many yet, they've only started simulating since May.

Sally Floyd thought there were some difficult cases to consider.  One
where TCP has idle periods with longer RTT than link-level network.
Because demand is waiting for the computer or a button press or whatever.
If the link sees multiple RTT idle, then the endpoint sends full speed,
what happens?  In most of the cases examined so far, nodes are sending
at full speed.  TCP behavior is not accounted for in the simulations.

Francois Le Faucher noted that one thing the IEEE committee might
consider is that the IETF is just considering doing some work on
different types of congestion notification for real time traffic, with
early congestion notification, and having end system react.  If that
work happens, it would be useful to map to Layer2, where you traverse
Layer 2 domains.  Perhaps there can be some information sharing, and
discussion of different congestion notification; this scheme may
require more basic forward congestion notification.

Was this work considering fixed or variable bandwidth?  Pat replied that
Michael's work (above) has been looking at fixed.  This work considers
highly bursty mechanisms.

Matt Mathis wanted to offer words of encouragement.  One potential
opportunity is to combine this with work done recently on the amount
of buffering needed in routers, an open research question.  Perhaps
this would let you have switches with small, cheap, buffers, and
shift some of the queues back to routers which still need to have
some queues.  It might make for a more efficient overall system.






3. Flow Rate Fairness: Dismantling a Religion
   -- Bob Briscoe

Bob noted that he was now taking us from Layer 2 to Layer 11,
and giving a rant about fairness.  His goal is to change the
way we think about fairness.  See slides (and associated paper)
for lots of details.

He felt that the IETF has been deciding what's fair for a while, using
an unsubstantiated notion that equal flow rates are fair.  It's badly
off what the market would allocate, leading to underinvestment.  Doing
nothing invites more middlebox kludges.  He talked about what he
thought a notion of equality might be (sharing congestion volume among
each users bit rate) and then fairness can be defined from that
(although you don't have to be equal to be fair).  He spent some time
saying why a bunch of other notions are wrong.

re-ECN is one possible solution.

Matt Mathis asked that if we assume noone is cheating, isn't this
equivalent to window fairness -- which is what TCP is doing now?
Sally Floyd said no, because this takes into account time.

Matt followed up with asking in steady state, isn't this equivalent
to window fairness?  The answer was no, because this idea is around
individuals, not flows.




4. Congestion Framework for Pseudowires
   --Bruce Davie

This talk is to inform this group about issues being picked up
in another area.  They are looking for the input of congestion
control experts on the topic.  The issue is congestion control
for pseudowires, and there is a framework draft.

Pseudowires can carry any sort of traffic, and to date have been
carried over well engineered networks.  However, that isn't a
requirement, and someone will run them over the general Internet.
Of particular concern are TDM pseudowires, and packet pseudowires
carrying non-congestion controlled traffic such as MPEG-2 streams.

There are a series of constraints, to some degree based on that
the devices are typically big routers with hardware-based forwarding
engines where it is hard to keep per-flow state.

TCP-friendly rate control (TFRC) seems a good building block, but then you
have to accurately measure loss.  If you are less than TFRC rate,
nothing that needs to be done.  However, what do you do if you
are over the TFRC rate, terminate whole wires?  Which one?

Thinking about this because Yakov Rekhter noted the bad experiences
with TCP over ATM-ABR, where there was policing by dropping excess
traffic.

David Black said that as one that helped get TFRC into fibrechannel
pseudowire, fibrechannel pseudowire is variable rate.  Some of the 
most important pseudowires are fixed rate, doing timing recovery at
egress.  This problem needs a lot of thought.

If the pseudowire is Ethernet, then shaping might work.

Benjamen Jenkins asked if a single generic solution is desired, or
something that is more per-pseudowire?  The thought is that it's
too soon to tell, however there is a difference in TDM versus
other types.

Liam Casey said that he was not sure that anyone would be motivated
to run more than the raw pseudowire.  The thought was that the motivation
would be similar to why folks run TCP congestion control.

Matt Mathis asked about the nature of TDM pseudowires.  Do they
transmit idle?  Yes, that is true.  Was it the only one?  Yes,
TDM pseudowires are the only one that transmit idle.

Andy noted that low-rate TDM pseudowires do zero suppression, though.

Lars said that would be on-off, which is probably worse than on constantly.

Another person wanted to know if it was true that traffic on a pseudowire
matches the traffic at higher layers, except for TDM?  The answer was
"maybe".  

David Black said that fibrechannel wouldn't fit into 1500 byte frames.

David McDysan noted that it looks like there is a plan to
signal the source.  It looks like that signal will traverse
a untrusted unengineered network?  Maybe that's the place you want
to really apply some control, or build that notion into the signaling.

It's also clear you don't need to enable congestion control all the
time, you could signal, or turn off if you know that there is a 
well engineered network end to end.

The goal of this presentation is not to design a solution, but
to make transport experts aware, and invite 

David Black wanted to follow up on a fibrechannel pseudowire.
The fibrechannel pseudowire is variable rate; it can tolerate
some delay variation, but has a strong dislike for having any
of it's packets dropped.






5. Achieving Gbps performance on operational networks with e-VLBI:
   Results from the Haystack Workshop
   --Marshall Eubanks

VLBI is Very-Long Baseline Interferometry.  It's radio astronomers
creating the illusion of a very large antenna by correlating the
observations of a number of antennas that are located far apart.
Originally, data was brought to the correlator by tape; later by
disk pack.  e-VLBI uses networks to send the data.

Data streams are currently about 1Gbps, and they are looking to go
to 100 Gbps.  The streams are somewhat loss tolerant, but there
is an implosion problem with all streams coming to a central correlator.

There was a recent workshop at MIT/Haystack.  There are still complaints
about not being able to achieve the desired data rates.  There is work
in tuning TCP stacks, using new stacks, and moving away from TCP.  This
includes the use of "lightpaths" and GMPLS -- a move towards dynamic
circuits, and "hybrid networks" that merge circuit services and IP
services.

In the short term, eVLBI is using RTP and RTCP.  Given the high
data rates, though, there is some concern that the timestamps will
wrap too fast; they are proposing a timestamp scaling bit and
scaling factor.

Collin Perkins did some work with uncompressed high-definition video
over RTP a while ago, and had a different approach to the timestamp
problem.  This will be, or was, recommended to the Haystack group.