Discussion:
dynamic linking files slow fork down significantly
(too old to reply)
David Lang
2007-03-03 01:20:08 UTC
Permalink
I have a fork-heavy workload (a proxy that forks per connection, I know it's not
the most efficiant design) and I discovered a 2x performance difference between
a static and dynamicly linked version of the same program (2200 connections/sec
vs 4700 connections/sec)

I know that there is overhead on program startup, but didn't expect to find it
on a fork with no exec. If I has been asked I would have guessed that the static
version would have been slower due to the need to mark more memory as COW.

what is it that costs so much with dynamic libraries on a fork/clone?

according to strace, the clone call that's being made is
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0xb7c92c08)

David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Eric Dumazet
2007-03-03 07:50:13 UTC
Permalink
Post by David Lang
I have a fork-heavy workload (a proxy that forks per connection, I know
it's not the most efficiant design) and I discovered a 2x performance
difference between a static and dynamicly linked version of the same
program (2200 connections/sec vs 4700 connections/sec)
I know that there is overhead on program startup, but didn't expect to
find it on a fork with no exec. If I has been asked I would have guessed
that the static version would have been slower due to the need to mark
more memory as COW.
what is it that costs so much with dynamic libraries on a fork/clone?
man ld.so

LD_BIND_NOW
If set to non-empty string, causes the dynamic linker to resolve
all symbols at program startup instead of deferring function
call resolval to the point when they are first referenced.


If you do :
export LD_BIND_NOW=1
before starting your dynamicaly linked version, do you get different numbers ?

If some symbols are resolved dynamically after your forks(), the dynamic
linker has to dirty some parts of memory and each child gets its own copy of
modified pages. The cpu cost is not factorized, and memory needs are larger,
so cpu caches are less efficient.

With LD_BIND_NOW=1, the initial exec of your programm will be a litle bit
longer, but in the end you win.

You may see effect of immediate binding with ldd command :
Its -r option asks to do the full binding :

# time ldd ./groff
libstdc++.so.5 => /usr/lib/libstdc++.so.5 (0xf7ea0000)
libm.so.6 => /lib/tls/libm.so.6 (0xf7e7e000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xf7e73000)
libc.so.6 => /lib/tls/libc.so.6 (0x42000000)
/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0xf7f5d000)
0.00user 0.00system 0:00.00elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+696minor)pagefaults 0swaps

# time ldd -r ./groff
libstdc++.so.5 => /usr/lib/libstdc++.so.5 (0xf7e8f000)
libm.so.6 => /lib/tls/libm.so.6 (0xf7e6d000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xf7e62000)
libc.so.6 => /lib/tls/libc.so.6 (0x42000000)
/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0xf7f4c000)
0.00user 0.00system 0:00.00elapsed 50%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+777minor)pagefaults 0swaps

You can see 777 pagefaults instead of 696 on this example.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
David Lang
2007-03-04 08:50:10 UTC
Permalink
Post by Eric Dumazet
Post by David Lang
I have a fork-heavy workload (a proxy that forks per connection, I know
it's not the most efficiant design) and I discovered a 2x performance
difference between a static and dynamicly linked version of the same
program (2200 connections/sec vs 4700 connections/sec)
I know that there is overhead on program startup, but didn't expect to find
it on a fork with no exec. If I has been asked I would have guessed that
the static version would have been slower due to the need to mark more
memory as COW.
what is it that costs so much with dynamic libraries on a fork/clone?
man ld.so
LD_BIND_NOW
If set to non-empty string, causes the dynamic linker to resolve
all symbols at program startup instead of deferring
function
call resolval to the point when they are first referenced.
export LD_BIND_NOW=1
before starting your dynamicaly linked version, do you get different numbers ?
slightly, with LD_BIND_NOW=1 the dynamicly linked version goes up to 3000
connections/sec, gaining back about half the difference.
Post by Eric Dumazet
If some symbols are resolved dynamically after your forks(), the dynamic
linker has to dirty some parts of memory and each child gets its own copy of
modified pages. The cpu cost is not factorized, and memory needs are larger,
so cpu caches are less efficient.
thanks, this makes sense.
Post by Eric Dumazet
With LD_BIND_NOW=1, the initial exec of your programm will be a litle bit
longer, but in the end you win.
definantly, any cost that can be moved to startup instead of the per-connection
handling pays off _very_ quickly
Post by Eric Dumazet
# time ldd ./groff
libstdc++.so.5 => /usr/lib/libstdc++.so.5 (0xf7ea0000)
libm.so.6 => /lib/tls/libm.so.6 (0xf7e7e000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xf7e73000)
libc.so.6 => /lib/tls/libc.so.6 (0x42000000)
/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0xf7f5d000)
0.00user 0.00system 0:00.00elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+696minor)pagefaults 0swaps
# time ldd -r ./groff
libstdc++.so.5 => /usr/lib/libstdc++.so.5 (0xf7e8f000)
libm.so.6 => /lib/tls/libm.so.6 (0xf7e6d000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xf7e62000)
libc.so.6 => /lib/tls/libc.so.6 (0x42000000)
/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0xf7f4c000)
0.00user 0.00system 0:00.00elapsed 50%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+777minor)pagefaults 0swaps
You can see 777 pagefaults instead of 696 on this example.
the version of time on my system (debian 3.1) doesn't give me the cpu or
pagefault info, just the times. where should I get the version that gives more
info? (although I don't think I can apply it directly to my troubleshooting as
the work is all being done in child processes).

David Lang
Eric Dumazet
2007-03-04 10:20:06 UTC
Permalink
Post by David Lang
Post by Eric Dumazet
# time ldd -r ./groff
libstdc++.so.5 => /usr/lib/libstdc++.so.5 (0xf7e8f000)
libm.so.6 => /lib/tls/libm.so.6 (0xf7e6d000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xf7e62000)
libc.so.6 => /lib/tls/libc.so.6 (0x42000000)
/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0xf7f4c000)
0.00user 0.00system 0:00.00elapsed 50%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (0major+777minor)pagefaults 0swaps
You can see 777 pagefaults instead of 696 on this example.
the version of time on my system (debian 3.1) doesn't give me the cpu or
pagefault info, just the times. where should I get the version that
gives more info? (although I don't think I can apply it directly to my
troubleshooting as the work is all being done in child processes).
time is a shell builtin.
But there is normally an /usr/bin/time standalone program.

# type time
time is a shell keyword
# time date
Sun Mar 4 12:15:54 CET 2007

real 0m0.003s
user 0m0.000s
sys 0m0.000s

# /usr/bin/time date
Sun Mar 4 12:15:56 CET 2007
0.00user 0.00system 0:00.00elapsed ?%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+192minor)pagefaults 0swaps
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Loading...