[net-perf] bad performance w/o jumbo frames over large BDPs

Michael Van Norman mvn at ucla.edu
Wed Oct 18 08:15:36 PDT 2006


If fragmentation is happening, it would have to be happening on the local
machine.  The network paths in question are jumbo-frame clean (as the
testing has included both jumbo and non-jumbo frames).

/Mike


On 10/18/06 7:48 AM, "Jim Madden" <jmadden at ucsd.edu> wrote:

> Yep, it's the wrong direction for traditional MTU management problems
> but there is the possibility of some broken MSS handling on your
> local machine if it's being told by the Abilene machines to use an
> MSS of larger than 1488 (or whatever the right number is) and trying
> to send large frames that have to be fragmented.
> Jim Madden
> 
> At 11:33 PM -0700 10/17/06, Michael Sinatra wrote:
>> If that were the case, I would expect the problem to happen in the
>> other direction.  In this case, the machine with the lower MTU is
>> really slow *sending* to the machine with the higher MTU, while the
>> machine with the higher MTU performs much better when sending to the
>> machine with the lower MTU.  If anything, it looks like the MSS
>> negotiation is okay.
>> 
>> michael
>> 
>> Jim Madden wrote:
>>> You've probably long since eliminated the possibility that MSS
>>> negotiation from your local machine to off campus machines works
>>> and produces something around 1500 bytes so that there's no packet
>>> fragmentation going on?  I don't see any indication of such a
>>> negotiation in the slow transmit test.
>>> Jim Madden
>>> 
>>> At 10:34 PM -0700 10/17/06, Michael Van Norman wrote:
>>>> I'm getting about a 6x increase in sending by turning off tcp
>>>> segmentation offload.  This was theoretically fixed in the 2.6.12
>>>> kernel, but I'm running  2.6.17.6-web100, so maybe not.  I am
>>>> still not getting the performance levels I should, but it is much
>>>> better with tso off.
>>>> 
>>>> /Mike
>>>> 
>>>> Michael Sinatra wrote:
>>>>> We're making progress at Berkeley building a performance measurement and
>>>>> characterization infrastructure.  The measurement nodes connect to
>>>>> backbone routers, and all of the measurement nodes (and the paths
>>>>> between them) are jumbo-frame-capable.  Recently, I set up an ad-hoc
>>>>> active measurement system on a user network that is not
>>>>> jumbo-frame-capable.  While the performance is very good between this
>>>>> net and our other campus measurement nodes (including the node at our
>>>>> border), performance is very poor between this host and nodes that are
>>>>> farther away.  This performance problem only exists when sending from
>>>>> the host in question to another node; receive performance is very good.
>>>>> 
>>>>> Here's an example:
>>>>> 
>>>>> drl10 ~ # /home/piPEs/bwctl/bin/bwctl -s nms1-chin.abilene.ucaid.edu -w
>>>>> 8m -i 2 -L 1800 -A A AESKEY ucb /home/piPEs/bwctl/etc/bwctld.keys
>>>>> bwctl: 34 seconds until test results available
>>>>> 
>>>>> RECEIVER START
>>>>> 3370091618.641665: /usr/bin/iperf -B 169.229.144.125 -P 1 -s -f b -m -p
>>>>> 5001 -w 8388608 -t 10 -i 2
>>>>> ------------------------------------------------------------
>>>>> Server listening on TCP port 5001
>>>>> Binding to local address 169.229.144.125
>>>>> TCP window size: 16777216 Byte (WARNING: requested 8388608 Byte)
>>>>> ------------------------------------------------------------
>>>>> [ 14] local 169.229.144.125 port 5001 connected with 198.32.8.162
>>>>> port 59750
>>>>> [ 14]  0.0- 2.0 sec  116840032 Bytes  467360128 bits/sec
>>>>> [ 14]  2.0- 4.0 sec  211912536 Bytes  847650144 bits/sec
>>>>> [ 14]  4.0- 6.0 sec  213602576 Bytes  854410304 bits/sec
>>>>> [ 14]  6.0- 8.0 sec  210712912 Bytes  842851648 bits/sec
>>>>> [ 14]  8.0-10.0 sec  212375576 Bytes  849502304 bits/sec
>>>>> [ 14]  0.0-10.0 sec  968015872 Bytes  772489277 bits/sec
>>>>> [ 14] MSS size 1448 bytes (MTU 1500 bytes, ethernet)
>>>>> 
>>>>> RECEIVER END
>>>>> drl10 ~ # /home/piPEs/bwctl/bin/bwctl -c nms1-chin.abilene.ucaid.edu -w
>>>>> 8m -i 2 -L 1800 -A A AESKEY ucb /home/piPEs/bwctl/etc/bwctld.keys
>>>>> bwctl: 38 seconds until test results available
>>>>> 
>>>>> RECEIVER START
>>>>> 3370091715.986859: /ami/bin/iperf -B 198.32.8.162 -P 1 -s -f b -m -p
>>>>> 5001 -w 8388608 -t 10 -i 2
>>>>> ------------------------------------------------------------
>>>>> Server listening on TCP port 5001
>>>>> Binding to local address 198.32.8.162
>>>>> TCP window size: 16777216 Byte (WARNING: requested 8388608 Byte)
>>>>> ------------------------------------------------------------
>>>>> [ 15] local 198.32.8.162 port 5001 connected with 169.229.144.125 port
>>>>> 5001
>>>>> [ 15]  0.0- 2.0 sec  4023992 Bytes  16095968 bits/sec
>>>>> [ 15]  2.0- 4.0 sec  12415152 Bytes  49660608 bits/sec
>>>>> [ 15]  4.0- 6.0 sec  15610888 Bytes  62443552 bits/sec
>>>>> [ 15]  6.0- 8.0 sec  15926552 Bytes  63706208 bits/sec
>>>>> [ 15]  8.0-10.0 sec  16485480 Bytes  65941920 bits/sec
>>>>> 
>>>>> RECEIVER END
>>>>> 
>>>>> 
>>>>> As you can see we peak at 850+mb/s from the Abilene Chicago node to this
>>>>> hosts, but we only get 65+mb/s from our host to Abilene-Chicago.
>>>>> Performance on a gig-connected host is actually worse than what is
>>>>> achieved on a host that is only connected to a 100mb/s interface.
>>>>> 
>>>>> I checked on more than one of our stationery measurement nodes,
>>>>> switching between jumbo and non-jumbo frames.  I can replicate the
>>>>> performance issue with non-jumbo frames on each of these nodes.  Again,
>>>>> it appears to manifest itself only when the BD product gets above a
>>>>> certain threshold.  Performance is still fine in both directions with
>>>>> jumbo frames.  This is the case both with FreeBSD 6-STABLE, 7-CURRENT
>>>>> (both csup'ed a few days ago) and Gentoo Linux (vanilla kernel
>>>>> 2.6.18-web100, with very few other modifications).  The platform is
>>>>> amd64--I haven't been able to compare with i386 yet.
>>>>> 
>>>>> I feel like I must be doing something wrong, especially since the
>>>>> Abilene nodes are clearly able to send traffic to me at near-line-rate
>>>>> using an MSS of only 1460, sending to our non-jumbo-enabled host.
>>>>> 
>>>>> Here are my sysctl variable changes.
>>>>> 
>>>>> In Linux:
>>>>> 
>>>>> # increase TCP maximum buffer size
>>>>> net.core.rmem_max = 16777216
>>>>> net.core.wmem_max = 16777216
>>>>> 
>>>>> # increase Linux autotuning TCP buffer limits
>>>>> # min, default, and maximum number of bytes to use
>>>>> net.ipv4.tcp_rmem = 4096 87380 16777216
>>>>> net.ipv4.tcp_wmem = 4096 65536 16777216
>>>>> # don't cache ssthresh from previous connection
>>>>> net.ipv4.tcp_no_metrics_save = 1
>>>>> # recommended to increase this for 1000 BT or higher
>>>>> net.core.netdev_max_backlog = 2500
>>>>> # for 10 GigE, use this
>>>>> #net.core.netdev_max_backlog = 30000
>>>>> 
>>>>> In FreeBSD:
>>>>> 
>>>>> kern.ipc.maxsockbuf=16777216
>>>>> net.inet.tcp.sendspace=8388608
>>>>> net.inet.tcp.recvspace=8388608
>>>>> net.inet.tcp.inflight.enable=0 # <-- doesn't seem to have any effect
>>>>>                                # either way
>>>>> 
>>>>> I am going to try configuring the kernel for the advanced TCP congestion
>>>>> stuff that Linux has built in and see how that goes.  I will also set up
>>>>> to capture a packet trace, so that I can better figure out what
>>>>> is going on.
>>>>> 
>>>>> I am hoping that Jeff Boote and/or Eric Boyd might be able to shed some
>>>>> light on how the Abilene nodes are configured so as to get the
>>>>> performance they do.
>>>>> 
>>>>> And if anyone else can point out the folly of my ways, let me know...
>>>>> 
>>>>> thanks,
>>>>> michael
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Network-performance mailing list
>>>>> Network-performance at lists.cenic.org
>>>>> http://lists.cenic.org/mailman/listinfo/network-performance
>>>> _______________________________________________
>>>> Network-performance mailing list
>>>> Network-performance at lists.cenic.org
>>>> http://lists.cenic.org/mailman/listinfo/network-performance
>>> 
>>> _______________________________________________
>>> Network-performance mailing list
>>> Network-performance at lists.cenic.org
>>> http://lists.cenic.org/mailman/listinfo/network-performance
>> 
>> _______________________________________________
>> Network-performance mailing list
>> Network-performance at lists.cenic.org
>> http://lists.cenic.org/mailman/listinfo/network-performance
> 
> _______________________________________________
> Network-performance mailing list
> Network-performance at lists.cenic.org
> http://lists.cenic.org/mailman/listinfo/network-performance




More information about the Network-performance mailing list