Linux TCP/IP

Wednesday, July 6, 2011

Ipv4 route lookup is now scalable

It has been a long time since people have been talking about IPv4 fib structures which were maintained as hash-tables. For huge number of routes installed on the Linux system, this would not scale and hash table lookups are much slower than tree walk. Now, with 2.6.39 Or may be slightly earlier releases, we have Ipv4 fib structures maintained as tree. This will make route lookup much more efficient in scaled environment.

Even though Linux is full of networking features, hash table lookup would be a mental block. Now that mental block is gone and junkies can stop blabbering about all this :(

v2.6.39/net/ipv4/fib_trie.c:fib_table_lookup()

The route caches are still maintained as hash table

v2.6.39/net/ipv4/route.c:__ip_route_output_key()

rt_hash_table

Friday, November 19, 2010

about grammar issues

I myself am much disappointed after going through readers comments. I understand that I'm not native English and at the same time this was the first experience for me to work on such a big writeup. At the same time I also understand that above can not be the excuses. I feel bad when readers mention difficulty :(

This time I'm working very hard to overcome all grammar issues and will make sure that things won't be the same. It will be much much better this time. It is going to be latest kernel-grammatical errors.

Saturday, September 11, 2010

kernel ticks, jiffies and high resolution timers...

kernel ticks and jiffies are explained beautifully in many places and best explanation can be found in "understanding Linux kernel" - O'reilly. I'm just trying to summarize some of the facts here!!!!

'jiffies' OR 'ticks' is the finest time resolution that we can get on Linux system. Like any sleep/poll/select/timers functionality in the kernel OR user land, the unit of time is jiffies. For eg., we can block for minimum 1 jiffies whether in the kernel OR user space. What does all this mean ?

Any event like sleep/poll/timers etc., need notification mechanism to wakeup after specified time. To be more specific, if a process says that it has to be woken up after 1 second by way of sleep OR poll, kernel registers timer for this event. Timer will fire after 1 second and the callback routine will do nothing but put the process on the run-queue after removing it from wait-queue so that it get's scheduled any time in future. Any wait event is converted into timer event by the kernel this way. What is that minimum time that one can ask kernel to notify you ? How is this time measured ?

This minimum time that we can request kernel to notify is one 'jiffy'. This 'jiffy' is nothing but a counter that is incremented periodically in timer interrupt handler. This timer interrupt handler is programmed to happen periodically at a pre-defined rate. This rate or frequency of timer interrupt is defined by 'HZ'. This value says how many times timer interrupt will happen in a second. Earlier this value was set to 100 but sometime back this value is changed to 250 for X86 systems.

100 HZ means, timer interrupt will happen 100 times in a second i.e., every 1/100 seconds OR (1000/100)10 ms. Which essentially mean 'jiffy' will be incremented every 10 ms OR highest time resolution that can be achieved is 10 ms. With respect to high speed CPU, this can be much long time. If we increase HZ, we can achieve much higher resolution but CPU will be spending more time in processing timer interrupt per second. So, there is a tradeoff between high timer resolution and % time spend in timer interrupt processing. On high speed CPU's this should not be an issue.

How do we relate jiffies to kernel notification event(timers). As we already discussed that sleep/poll/timers etc is converted into notification event(timers) by the kernel in terms of 'number of jiffies'. This can be minimum 1 jiffy away from now. When are these kernel timers processed ?

kernel timers are processed whenever timer interrupt occur. So, if we register a timer to expire after one jiffy, we will be woken up on occurrence of next timer interrupt. Imagine a situation where we need to be woken up after 1 ms OR imagine a situation where we need to be woken up after 5 ms. In prior case, we might get woken up after 1 ms if timer interrupt is 1 ms away OR max we will be woken up after 10 ms if timer interrupt has just occurred. In the later case, we face the same problem - either we can be woken up prior to 5 ms OR max after 10 ms. Once classic example is RTT calculations. In case RTT is below 1 jiffy, we still will see it as minimum 1 jiffy OR if transmit packet just before timer interrupt and we get ACK just after the timer interrupt, we still see RTT as 1 jiffy where as it can be few micro seconds. All this will get us into serious issues as far as timer resolution for the system is concerned. All these issues can be, to some extent, solved by increasing the frequency of timer interrupt(HZ). So, current X86 systems are programmed to achieve timer interrupt frequency of 250 (see CONFIG_HZ is set to 250 in autoconf.h) which means every 4 ms. On high end systems this HZ value is as high as 1000 where one jiffy will be 1 ms. Normally, we see HZ defined as -

include/*/param.h

#ifdef __KERNEL__
# define HZ CONFIG_HZ /* Internal kernel timer frequency */
# define USER_HZ 100 /* some user interfaces are */
# define CLOCKS_PER_SEC (USER_HZ) /* in "ticks" like times() */
#endif

Let's throw some light on the implementation -

there are two variables jiffies and jiffies_64. jiffies_64 is a 64-bit value where as jiffies is long. On 32-bit, jiffies is just 32-bit value where as jiffies_64 is a 64-bit value. The reason to keep two variables is that with higher timer-interrupt frequency, 32-bit jiffies value wrap around much faster. So, 64-bit jiffies_64 will help. Lower order 32-bit value in this 64-bit jiffies_64 is actually used by any part of the kernel as 32-bit satisfies our requirements of time resolution and accessing 64-bit value on 32-bit architecture takes more than one memory access -

include/linux/jiffies.h

extern u64 __jiffy_data jiffies_64;
extern unsigned long volatile __jiffy_data jiffies;

Once jiffies_64 is incremented on timer interrupt in do_timer() call -

void do_timer(unsigned long ticks)
{
jiffies_64 += ticks;
....
}

So, where is jiffies incremented? We don't increment jiffies explicitly. 'Jiffies' is accessing to lower order 32-bits of 64-bit jiffies_64 variable and we can see this in the linker file -

arch/x86/kernel/vmlinux.lds.S

#ifdef CONFIG_X86_32
OUTPUT_ARCH(i386)
ENTRY(phys_startup_32)
jiffies = jiffies_64;

What happens on SMP machine ? On SMP machine we have timer interrupt for each CPU. This timer interrupt will occur independently on each CPU. So, how do we increment jiffies ?

On SMP, only one CPU is responsible for incrementing jiffies and all other CPU's will cover complete functionality of timer interrupt as usual. Whichever CPU gets to tick_setup_device() routine first, does the jiffies management -

kernel/time/tick-common.c

static void tick_setup_device(struct tick_device *td,
....)
{
...
if (tick_do_timer_cpu == TICK_DO_TIMER_BOOT) {
tick_do_timer_cpu = cpu;
....
}
....
}

So, tick_do_timer_cpu will store CPU ID that will manage jiffies. This variable is checked against CPU ID of the current CPU in tick_periodic(). If we are the same CPU, do_timer() is called else not. Similarly we call tick_do_update_jiffies64() to update jiffies_64 in tick_sched_timer().

What is high resolution timers?

This is related to high resolution timer of the order of seconds. In case kernel and hardware supports high resolution timer of the order of nano seconds, it is not recommended to have periodic timer interrupt every few nano-seconds as it will pull down system performance because CPU will be spending most of it's time in the timer interrupt handler. For this, hardware(timer interrupt) is programmed to occur in the future on demand. This will depend on next pending timer event. If it is after few nano seconds, we program this interrupt to occur at that time else we program it to happen after default time(10 ticks, if ticks is 1 ms - then after 10 ms). check tick_do_update_jiffies64() called from tick_sched_timer(). All high resolution timer code is compiled with CONFIG_HIGH_RES_TIMERS macros.

Let's learn more about tickless kernel and high-resolution timers in our next discussion....

Let's find out more ....

Saturday, March 20, 2010

various TCP flavors modularized in kernel 2.6.xx

Modularized TCP congestion implementation

Linux TCP implementation has revolutionized by modularizing various TCP congestion control flavors. Linux kernel 2.6.13 onwards we see this kind of organization. There are different modules introduced for various TCP flavors like vegas, westwood, higspeed TCP, reno etc., One file for each TCP flavor exist under the directory -

net/ipv4/

Flexibility for applications to select TCP flavor -

The biggest flexibility that Linux offers by modularizing various TCP congestion flavors is that each TCP connection can choose congestion algorithm according to it's own requirements. A given application can have multiple TCP connections each one using different link to connect to it's peer. These connections may be on LAN, slow WAN, high speed optical network and on wireless networks. All these connection require different congestion algorithms to be active because of varying network characteristics. So, application has flexibility of selecting congestion algorithm. If an application does not select any specific TCP congestion flavor, reno is a default selection. New sysctl is introduced to change the default congestion flavor!!!!

Following socket option can be exploited to change congestion flavor for your connection -

TCP_CONGESTION

name of the TCP flavor needs to be passed to the socket option and that flavor need to be present as loadable module on the system.

Inside the kernel -

Each TCP flavor module has to register struct tcp_congestion_ops of it's own and register with the system. The key to find specific TCP congestion module is name of the flavor like vegas, highspeed etc., This is stored in name field of struct tcp_congestion_ops. Different congestion parameters like slowstart threshold, congestion window etc., are evaluated by routines registered with each module and at the same time we have congestion avoidance algorithms and recovery algorithms are also implemented as part of these modules.

Monday, August 10, 2009

Major differences in TCP/IP stack from 2.4 to 2.6 kernel

I Got lot of comments on the book saying that the book is not up-to-date. This is partially correct and it is all in the mind. I have myself looked into 2.6 latest networking kernel related to TCP/IP. I found that except for the few features, major algorithms remain the same. If I take route lookup algorithm, I really don't see any. So, performance as far as route lookup is concerned should not vary much as data-structure remains the same for the routing table and Linux still implements hashtable. If I see major difference, it lies in the locking mechanism for the routing caches.

Instead of using read spin locks, we now use RCU locks(read-copy-update). This will get big gain in the route cache lookup performance, when huge route lookups are performed on the SMP Linux machine.
ip_route_output_key() takes additional argument 'struct flowi' and so also ip_route_output_slow(). In 2.4.x 'struct rt_key' is used instead for searching. struct flowi is more comprehensive search key to find more perfect match for a given flow for e.g., sourrce/destination port numbers.
This is one of the major difference but updting base routing table FIB, has not changed and has no performance gain in searching/inserting/deleting entry from here.
Another area where I see major difference is the socket data-structure. Once again the changes have got nothing much to do with performance. It is just better organization of the socket structure for better code readability. Following macros are used to access protocol specific socket information from 'struct sock' -
inet_sk() for getting pointer to inet protocol family.
udp/tcp_sk() for getting udp/tcp protocol specific socket pointer. This will map original 'struct sock' to struct tcp/udp_sock.
Entire description can't go here as it requires big space. But, I'll keep updating the entries, whereever I find major differences that can be explained in short. If we know the baisc skeleton and terminology of the Linux networking kernel, it will be easy to understand the most recent code and reason it out.
I'm working on next revision where I'm putting all the details related to the differences and much more........!!!!

Friday, February 13, 2009

book piracy and terrorism

I find myself in such a helpless situation, when I get to see sites allowing free downloads of the recently published titles. Some of the countries exploit loopholes in DMCA and allow uploads of complete book content which have not yet soaked in the market. There are much better ways to share knowledge instead of uploading scanned books. People who can not afford to buy the book can always make use of libraries or can share a book. But downloading free books from the sites, is almost supporting crime. First of all, it is the moral responsibility of the internet site owners to avoid such illegal practice in the name of "sharing knowledge". If this is for social cause, they should open a library and buy in books for everyone. Sharing knowledge agenda can never go together with theft and illegal practices. Yes, I'm talking about http://www.rapidshare.com and all other such sites who are responsible for all such illegal practices. I appeal to the respectable audience to avoid downloading books from these sites. Instead use libraries or share the book with others but avoid supporting illegal practices. There is effort and resources involved both from author and publishers end. Publishing companies and book stores are getting income by selling these books. Sales of a copy of the title will earn bread for so many in the chain. Lastly, knowledge gained by any illegal means will never solve any purpose. I hope for co-operation from at least those who are downloading books. If they avoid downloading books from such illegal sites, it will surely discourage site owners from carrying such illegal and immoral practices.

Monday, February 2, 2009

about my book Linux TCP/IP - ISBN 0470147733

Hi, This is the first time I'm blogging on the site so would need some introduction. I introduce my self as Sameer Seth who has been associated with the software industry for around a decade. I worked with Sun Microsystems as a internet engineer, where I worked on NFS, TCP/IP stack and streams framework. Currently working with Juniper as a senior engineer in the junos kernel team. My activities revolve mostly around opensource comminuty mostly to serve it in whatever way I can. I've worked for opensolaris and Linux is anytime favourite as I've known the industry by this name since I entered into it. As far as opensolaris is concerned, my contribution are presentation on OpenSolaris internet technology and blogging goes here - http://blogs.sun.com/sameers I've recently written a book along with my co-author Ajay with a different approach that explains "TCP/IP Architecture, Design and Implementation in Linux" - 0470147733. Amazon link to the title "TCP/IP Architecture, Design and Implementation in Linux" is - http://www.amazon.com/exec/obidos/ASIN/0470147733/ref=nosim/porfessionalp4-20 conceptualization of "TCP/IP Architecture, Design and Implementation in Linux" The book was started long back when kernel 2.4 was most stable and acceptable with point of view of just ducomenting TCP/IP internals in Linux but never thought it would take the shape of the book. When the title was accepted by the publisher(IEEE Computer-society - Wiley), kernel 2.6 had already appeared and was being accepted slowly. That was the time when decision was to be made whether to make it for kernel 2.6. The time was little to do all this because the format of the book and descriptions was such that huge amount of work was required to convert the existing work. The reason was that code snippets included line numbers and explanations also had the line numbers in the description. So, we planned to keep it for kernel 2.4 and to revise the contents for kernel 2.6 in the next edition. what is different in "TCP/IP Architecture, Design and Implementation in Linux" This was all about the 2.4 and 2.6 fight. The intent of writing a book was not show the latest but to bridge the gap. There are good books that explain Linux network internals and one of those is "Lnux networking Internals" by Christian Benvenuti. It explains IP and related protocols to a great extent. But when I am a new comer to Linux networking stack, I'd like to know very systematically what is happening when - we do recv/send over the TCP socket ? - connection is established? - what does socket, bind, listen, accept, connect system calls do internally - how is route look up done, - how is packet transmitted and received end to end in the linux kernel - how is TCP timers, memory managed. - how is TCP state machine implemented and congestion handled.

- QOS implementation and how does it fit in to the stack with implementation details illustrating CBQ.

- how are various layers in the stack linked and the boundaries for each layer.

- how is the net filter hooks implemented and their relevance in the stack.

- Implementation details for ipchain and iptables.

- NET softIRQ implementation and it's relevance with respect to packet processing on SMP architecture. I did not find any book tying up these ends together and linking each part of the kernel responsible for making the stack. I did not find exact explanation and working of net softIRQ and even if someone explains, it's relevance with respect to packet reception and transmission and scalability issues on SMP architecture is not very well explained. Moreover, no book explains TCP and it's implementaiton for Linux. All together with net-filters(network hooks in the Linux stack framework) and QOS implementaion is not provided by any existing book in one place. For example, net-filter framework will give a very clear picture of what needs to be done when you want to implement your own extensions to the TCP/IP stack like implementing NAT, masquerading but it will not explain details of NAT and masquerading. So, I though of bridging these gaps along with the explanations for all those topics which have already been covered in details by other authors in my own way. The intent was not to compete with the good existing titles, it was to bring something which doesn't exist. Why not to bother about 2.4 and 2.6 with "TCP/IP Architecture, Design and Implementation in Linux" Many question me - why not for kernel 2.6? So I'd like to say that if you are a learner, kernel version should not matter much(2.4<->2.6). These are subsequent versions and differences will surely be there but basic skeleton will never change. Most of the important changes as far as networking stack and supporting framework is concerned, was done from 2.2->2.4(soft IRQ). Rest what we see is new features being added on top of the existing skeleton, data-structure layout changes to accomodate those changes or for some code optimization or for performance enhancements. Some of the changes will be clubbing of two or more functions to create a new function. These changes will matter for those who are contributing to kernel 2.6 directly. For those who are starting their journey into Linux networking stack(even for those new comers who are directly touching kernel 2.6), 2.4 or 2.6 will not matter as far as networking stack is concerned. If you are a starter and you don't have any idea, getting knowledge of kernel 2.4 implementation will surely help diving into kernel 2.6. All you need to know is - what to look where - which part is responsible for what in the TCP/IP stack - what all frameworks are supporting it - how are these various parts linked. If you know most of the things, still this book can work as reference point for most of the things. debugging related chapter in "TCP/IP Architecture, Design and Implementation in Linux" As far as debugging chapter is concerned, lcrash may be outdated for kernel 2.6 but I'd say it is still supported by kernel 2.6. The intent here was not to show what lcrash or other debuggers can do, but to let the readers visualise what all kernel data-structures are involved and to intersect the stack when different operations are performed. There will be equivalent commands supported by different debuggers as offered by lcrash that can be used to do the same stuff as shown in the debugging chapter. So, this shoulld not matter. Expectations from "TCP/IP Architecture, Design and Implementation in Linux" So, the main attraction of the book is the systematic approach to understanding Linux TCP/IP stack implemantation and other supporting kernel framework. The intent is to bridge gaps from the existing titles and not to re-work what others have done beautifully. This will not only help new enthusiasts to have better understandng about the subject but also act as reference point for the contributers to linux community with experience. I've started planning about the next edition that explains latest kernel 2.6. But that is a long way to go and I make sure that it should appear before next kernel version is out, surely. I hope this information will be helpful. If anyone is interested in reviewing the book, please let me know along with brief biodata about youself(required by the publisher). In case some one is interested and qualifies too, a copy will be sent on the mentioned address.