Took me a lot of time to realize that nobody really understands networks, many say they do, many more think they do - but very few actually know what they are talking about. Specially when it gets to setting up massive data centers. At the beginning of March 2008 we plan at JAJAH to turn the lights on our new data center in New-York City. In case you want to visit us you can probably find our IT & DBA teams here around the beginning on March, but don’t piss them off - they’ve been working day and night for the past couple of months so they might be a little cranky by the time you get there.
This post is greatly influenced by our head of engineering Alon and his quest for the flawless data center. Alon is not alone in this quest (hhm…) , this undertaking is joined by our system administrators, storage and other experts all working to get the perfect, scalable and robust solution to serve our growing user’s community as well as providing a sound base for JAJAH business growth. But before we pin the medals let me share with you one bizarre experience that led me to the belief that network is more of an ART than Science.
Alon setup site-to-site VPN between our existing data-centers and the new NY data-center. However, when our DBA tried to transfer data from a remote database to our new monster DB machine in NY, the connection would simply hung. Alon and his team looked everywhere, checked the ISP connection, Firewall cluster, routers, BGP setup, MTU, blades center set-up, DB cluster, OS, installed and re-installed every piece of hardware - you name it - but nothing worked and time was running out. Our DBA started to get agitated (big guys - you don’t want to upset them), we had to look for a work around. We found the most bizarre workaround ever - TCP OFFLOADING.
TCP Offloading is a great feature which moves TCP stack processing from the main CPU to the network interface. Works great and improves performance, but sometimes causes problems. Just by chance we discovered that if you disable TCP Offloading and move the network processing back to the main processor things start to work well for us. Dotan, our head DBA was smiling again. Yet another day in the office.
Although we could not prove the performance lose by disabling TCP Offloading we knew that we paid for the feature, might as well get the damn thing to work. So while our DBA team was back on track we started to investigate deeper. We found two things which you don’t always read about in the school text books, but you run into in daily life.
First some background. there are two important mechanisms that play a significant role when two sites have to be connected over the web (specially when firewalls are involved): MSS - the TCP maximal packet size between two networks, and MTU size. MSS is specially important when trying not to fragment packets (for obvious reasons packet segmentation is a problem when it comes to encrypted networks). The other important mechanism is PMTUD which stands for Path MTU Discovery - an automated mechanism for MTU size negotiation when sending traffic around the web. A nice article from CISCO that explain about MSS negotiation and path MTU discovery can be found here.
Few things you have to be aware of when setting the MSS & PMTU:
- MSS - MSS is negotiated between the two end-points, but it is fixed in nature - the smaller of the MSS will be used by both sides.
- While MSS is set between the two ends point, PMTUD is set according to the route between the two end points.
- MTU is automatically selected using ICMP. Namely, one side sends packets in certain MTU and the other side returns notification using ICMP protocol. However, ICMP is also used with the common ‘ping’ utility (ping is actually sending ICMP and waiting for the ICMP echo to return). Because of potential ‘denial of service’ attacks many security officers block ICMP messages (Alon is also our security officer, the guy would block HTTP if we’re not looking), and this many hamper PMTUD. You need to make sure your firewall rules are set correctly to allow the relevant ICMP messages to be sent between the two VPN sites. Your access list should look something like:
| access-list 101 permit icmp any any unreachable | 
| access-list 101 permit icmp any any time-exceeded | 
| access-list 101 deny icmp any any | 
| access-list 101 permit ip any any | 
- For PMTUD to work you need to make sure the DF (’dont fragment‘) bit is set in the routers.
Once Alon got the MSS correctly configured, and proper rules where applied to allow PMTUD - we could enable again the TCP Offloading. Go figure.
Amichay
p.s.
Ori, Alon, Nir, Ilya, Eran, Dotan, Dani - this one is for you…
