[net-perf] bad performance w/o jumbo frames over large BDPs
Michael Van Norman
mvn at ucla.edu
Wed Oct 18 08:15:36 PDT 2006
If fragmentation is happening, it would have to be happening on the local
machine. The network paths in question are jumbo-frame clean (as the
testing has included both jumbo and non-jumbo frames).
/Mike
On 10/18/06 7:48 AM, "Jim Madden" <jmadden at ucsd.edu> wrote:
> Yep, it's the wrong direction for traditional MTU management problems
> but there is the possibility of some broken MSS handling on your
> local machine if it's being told by the Abilene machines to use an
> MSS of larger than 1488 (or whatever the right number is) and trying
> to send large frames that have to be fragmented.
> Jim Madden
>
> At 11:33 PM -0700 10/17/06, Michael Sinatra wrote:
>> If that were the case, I would expect the problem to happen in the
>> other direction. In this case, the machine with the lower MTU is
>> really slow *sending* to the machine with the higher MTU, while the
>> machine with the higher MTU performs much better when sending to the
>> machine with the lower MTU. If anything, it looks like the MSS
>> negotiation is okay.
>>
>> michael
>>
>> Jim Madden wrote:
>>> You've probably long since eliminated the possibility that MSS
>>> negotiation from your local machine to off campus machines works
>>> and produces something around 1500 bytes so that there's no packet
>>> fragmentation going on? I don't see any indication of such a
>>> negotiation in the slow transmit test.
>>> Jim Madden
>>>
>>> At 10:34 PM -0700 10/17/06, Michael Van Norman wrote:
>>>> I'm getting about a 6x increase in sending by turning off tcp
>>>> segmentation offload. This was theoretically fixed in the 2.6.12
>>>> kernel, but I'm running 2.6.17.6-web100, so maybe not. I am
>>>> still not getting the performance levels I should, but it is much
>>>> better with tso off.
>>>>
>>>> /Mike
>>>>
>>>> Michael Sinatra wrote:
>>>>> We're making progress at Berkeley building a performance measurement and
>>>>> characterization infrastructure. The measurement nodes connect to
>>>>> backbone routers, and all of the measurement nodes (and the paths
>>>>> between them) are jumbo-frame-capable. Recently, I set up an ad-hoc
>>>>> active measurement system on a user network that is not
>>>>> jumbo-frame-capable. While the performance is very good between this
>>>>> net and our other campus measurement nodes (including the node at our
>>>>> border), performance is very poor between this host and nodes that are
>>>>> farther away. This performance problem only exists when sending from
>>>>> the host in question to another node; receive performance is very good.
>>>>>
>>>>> Here's an example:
>>>>>
>>>>> drl10 ~ # /home/piPEs/bwctl/bin/bwctl -s nms1-chin.abilene.ucaid.edu -w
>>>>> 8m -i 2 -L 1800 -A A AESKEY ucb /home/piPEs/bwctl/etc/bwctld.keys
>>>>> bwctl: 34 seconds until test results available
>>>>>
>>>>> RECEIVER START
>>>>> 3370091618.641665: /usr/bin/iperf -B 169.229.144.125 -P 1 -s -f b -m -p
>>>>> 5001 -w 8388608 -t 10 -i 2
>>>>> ------------------------------------------------------------
>>>>> Server listening on TCP port 5001
>>>>> Binding to local address 169.229.144.125
>>>>> TCP window size: 16777216 Byte (WARNING: requested 8388608 Byte)
>>>>> ------------------------------------------------------------
>>>>> [ 14] local 169.229.144.125 port 5001 connected with 198.32.8.162
>>>>> port 59750
>>>>> [ 14] 0.0- 2.0 sec 116840032 Bytes 467360128 bits/sec
>>>>> [ 14] 2.0- 4.0 sec 211912536 Bytes 847650144 bits/sec
>>>>> [ 14] 4.0- 6.0 sec 213602576 Bytes 854410304 bits/sec
>>>>> [ 14] 6.0- 8.0 sec 210712912 Bytes 842851648 bits/sec
>>>>> [ 14] 8.0-10.0 sec 212375576 Bytes 849502304 bits/sec
>>>>> [ 14] 0.0-10.0 sec 968015872 Bytes 772489277 bits/sec
>>>>> [ 14] MSS size 1448 bytes (MTU 1500 bytes, ethernet)
>>>>>
>>>>> RECEIVER END
>>>>> drl10 ~ # /home/piPEs/bwctl/bin/bwctl -c nms1-chin.abilene.ucaid.edu -w
>>>>> 8m -i 2 -L 1800 -A A AESKEY ucb /home/piPEs/bwctl/etc/bwctld.keys
>>>>> bwctl: 38 seconds until test results available
>>>>>
>>>>> RECEIVER START
>>>>> 3370091715.986859: /ami/bin/iperf -B 198.32.8.162 -P 1 -s -f b -m -p
>>>>> 5001 -w 8388608 -t 10 -i 2
>>>>> ------------------------------------------------------------
>>>>> Server listening on TCP port 5001
>>>>> Binding to local address 198.32.8.162
>>>>> TCP window size: 16777216 Byte (WARNING: requested 8388608 Byte)
>>>>> ------------------------------------------------------------
>>>>> [ 15] local 198.32.8.162 port 5001 connected with 169.229.144.125 port
>>>>> 5001
>>>>> [ 15] 0.0- 2.0 sec 4023992 Bytes 16095968 bits/sec
>>>>> [ 15] 2.0- 4.0 sec 12415152 Bytes 49660608 bits/sec
>>>>> [ 15] 4.0- 6.0 sec 15610888 Bytes 62443552 bits/sec
>>>>> [ 15] 6.0- 8.0 sec 15926552 Bytes 63706208 bits/sec
>>>>> [ 15] 8.0-10.0 sec 16485480 Bytes 65941920 bits/sec
>>>>>
>>>>> RECEIVER END
>>>>>
>>>>>
>>>>> As you can see we peak at 850+mb/s from the Abilene Chicago node to this
>>>>> hosts, but we only get 65+mb/s from our host to Abilene-Chicago.
>>>>> Performance on a gig-connected host is actually worse than what is
>>>>> achieved on a host that is only connected to a 100mb/s interface.
>>>>>
>>>>> I checked on more than one of our stationery measurement nodes,
>>>>> switching between jumbo and non-jumbo frames. I can replicate the
>>>>> performance issue with non-jumbo frames on each of these nodes. Again,
>>>>> it appears to manifest itself only when the BD product gets above a
>>>>> certain threshold. Performance is still fine in both directions with
>>>>> jumbo frames. This is the case both with FreeBSD 6-STABLE, 7-CURRENT
>>>>> (both csup'ed a few days ago) and Gentoo Linux (vanilla kernel
>>>>> 2.6.18-web100, with very few other modifications). The platform is
>>>>> amd64--I haven't been able to compare with i386 yet.
>>>>>
>>>>> I feel like I must be doing something wrong, especially since the
>>>>> Abilene nodes are clearly able to send traffic to me at near-line-rate
>>>>> using an MSS of only 1460, sending to our non-jumbo-enabled host.
>>>>>
>>>>> Here are my sysctl variable changes.
>>>>>
>>>>> In Linux:
>>>>>
>>>>> # increase TCP maximum buffer size
>>>>> net.core.rmem_max = 16777216
>>>>> net.core.wmem_max = 16777216
>>>>>
>>>>> # increase Linux autotuning TCP buffer limits
>>>>> # min, default, and maximum number of bytes to use
>>>>> net.ipv4.tcp_rmem = 4096 87380 16777216
>>>>> net.ipv4.tcp_wmem = 4096 65536 16777216
>>>>> # don't cache ssthresh from previous connection
>>>>> net.ipv4.tcp_no_metrics_save = 1
>>>>> # recommended to increase this for 1000 BT or higher
>>>>> net.core.netdev_max_backlog = 2500
>>>>> # for 10 GigE, use this
>>>>> #net.core.netdev_max_backlog = 30000
>>>>>
>>>>> In FreeBSD:
>>>>>
>>>>> kern.ipc.maxsockbuf=16777216
>>>>> net.inet.tcp.sendspace=8388608
>>>>> net.inet.tcp.recvspace=8388608
>>>>> net.inet.tcp.inflight.enable=0 # <-- doesn't seem to have any effect
>>>>> # either way
>>>>>
>>>>> I am going to try configuring the kernel for the advanced TCP congestion
>>>>> stuff that Linux has built in and see how that goes. I will also set up
>>>>> to capture a packet trace, so that I can better figure out what
>>>>> is going on.
>>>>>
>>>>> I am hoping that Jeff Boote and/or Eric Boyd might be able to shed some
>>>>> light on how the Abilene nodes are configured so as to get the
>>>>> performance they do.
>>>>>
>>>>> And if anyone else can point out the folly of my ways, let me know...
>>>>>
>>>>> thanks,
>>>>> michael
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Network-performance mailing list
>>>>> Network-performance at lists.cenic.org
>>>>> http://lists.cenic.org/mailman/listinfo/network-performance
>>>> _______________________________________________
>>>> Network-performance mailing list
>>>> Network-performance at lists.cenic.org
>>>> http://lists.cenic.org/mailman/listinfo/network-performance
>>>
>>> _______________________________________________
>>> Network-performance mailing list
>>> Network-performance at lists.cenic.org
>>> http://lists.cenic.org/mailman/listinfo/network-performance
>>
>> _______________________________________________
>> Network-performance mailing list
>> Network-performance at lists.cenic.org
>> http://lists.cenic.org/mailman/listinfo/network-performance
>
> _______________________________________________
> Network-performance mailing list
> Network-performance at lists.cenic.org
> http://lists.cenic.org/mailman/listinfo/network-performance
More information about the Network-performance
mailing list