Discussion:
[PATCH] Linux Kernel Markers
(too old to reply)
Mathieu Desnoyers
2006-09-18 23:50:07 UTC
Permalink
Hello,

Following this huge discussion thread, I tried to come with a marker mechanism
(which is something everyone seems to agree that is a necessity) that would be
useful to each kind of tracing (dynamic and static) (concerned projects :
SystemTAP, LKET, LKST, LTTng) and even combinations of those. Religious
considerations aside, I really think that this kind of generic markup is
necessary to fill *everybody*'s need. If I forgot about a specific genericity
aspect, please tell me.

I take for agreed that both static and dynamic tracing are useful for different
needs and that a full markup must support both and combinations, letting the
user or the distribution choose.

If you like it, please add the right menuconfig lines in arch/*/Kconfig and a
NOPS macro in include/asm-*/marker.h.

Comments are, as always, welcome.

Mathieu

--- BEGIN ---

--- a/arch/i386/Kconfig
+++ b/arch/i386/Kconfig
@@ -1082,6 +1082,8 @@ config KPROBES
for kernel debugging, non-intrusive instrumentation and testing.
If in doubt, say "N".

+source "kernel/Kconfig.marker"
+
source "ltt/Kconfig"

endmenu
--- /dev/null
+++ b/include/asm-i386/marker.h
@@ -0,0 +1,12 @@
+/*****************************************************************************
+ * marker.h
+ *
+ * Code markup for dynamic and static tracing. i386 support.
+ *
+ * Mathieu Desnoyers <***@polymtl.ca>
+ *
+ * September 2006
+ */
+
+#define JPROBE_TARGET \
+ __asm__ ( GENERIC_NOP5 )
--- /dev/null
+++ b/include/linux/marker.h
@@ -0,0 +1,77 @@
+/*****************************************************************************
+ * marker.h
+ *
+ * Code markup for dynamic and static tracing.
+ *
+ * Use either :
+ * MARK
+ * MARK_NOPRINT (will never call printk)
+ * MARK_STATIC (not dynamically instrumentable, will never call printk)
+ *
+ * Example :
+ *
+ * MARK(subsystem_event, "Event happened %d %s", someint, somestring);
+ * Where :
+ * - Subsystem is the name of your subsystem.
+ * - event is the name of the event to mark.
+ * - "Event happened %d %s" is the formatted string for printk.
+ * - someint is an integer.
+ * - somestring is a char *.
+ *
+ * Mathieu Desnoyers <***@polymtl.ca>
+ *
+ * September 2006
+ */
+
+#include <linux/config.h>
+#include <linux/kernel.h>
+
+#include <asm/marker.h>
+
+#define MARK_SYM(event) \
+ __asm__ ( "__mark_" KBUILD_BASENAME "_" #event ":" )
+
+#define MARK_INACTIVE(event, format, args...)
+
+#define MARK_PRINT(event, format, args...) printk(format, ##args);
+
+#define MARK_FPROBE(event, format, args...) fprobe_##event(args);
+
+#define MARK_KPROBE(event, format, args...) MARK_SYM(event);
+
+#define MARK_JPROBE(event, format, args...) \
+ do { \
+ MARK_SYM(event); \
+ JPROBE_TARGET; \
+ } while(0)
+
+/* Menu configured markers */
+#ifndef CONFIG_MARK
+#define MARK MARK_INACTIVE
+#elif defined(CONFIG_MARK_PRINT)
+#define MARK MARK_PRINT
+#elif defined(CONFIG_MARK_FPROBE)
+#define MARK MARK_FPROBE
+#elif defined(CONFIG_MARK_KPROBE)
+#define MARK MARK_KPROBE
+#elif defined(CONFIG_MARK_JPROBE)
+#define MARK MARK_JPROBE
+#endif
+
+#ifndef CONFIG_MARK_NOPRINT
+#define MARK_NOPRINT MARK_INACTIVE
+#elif defined(CONFIG_MARK_NOPRINT_FPROBE)
+#define MARK_NOPRINT MARK_FPROBE
+#elif defined(CONFIG_MARK_NOPRINT_KPROBE)
+#define MARK_NOPRINT MARK_KPROBE
+#elif defined(CONFIG_MARK_NOPRINT_JPROBE)
+#define MARK_NOPRINT MARK_JPROBE
+#endif
+
+#ifndef CONFIG_MARK_STATIC
+#define MARK_STATIC MARK_INACTIVE
+#else
+#define MARK_STATIC MARK_FPROBE
+#endif
+
+
--- /dev/null
+++ b/kernel/Kconfig.marker
@@ -0,0 +1,75 @@
+# Code markers configuration
+
+menu "Marker configuration"
+
+
+config MARK
+ bool "Enable MARK code markers"
+ default y
+ help
+ Activate markers that can call printk or can be instrumented
+ dynamically.
+
+choice
+ prompt "MARK code marker behavior"
+ default MARK_KPROBE
+ depends on MARK
+ help
+ Configuration of markers that can call printk or can be
+ instrumented dynamically.
+
+config MARK_KPROBE
+ bool "KPROBE"
+ ---help---
+ Change markers for a symbol "__mark_modulename_event".
+config MARK_JPROBE
+ bool "JPROBE"
+ ---help---
+ Change markers for a symbol "__mark_modulename_event"
+ and create a target for a high speed dynamic probe.
+config MARK_FPROBE
+ bool "FPROBE"
+ ---help---
+ Change markers for a function call.
+config MARK_PRINT
+ bool "PRINT"
+ ---help---
+ Call printk from the marker.
+endchoice
+
+config MARK_NOPRINT
+ bool "Enable MARK_NOPRINT code markers"
+ default y
+ help
+ Activate markers that cannot call printk.
+
+choice
+ prompt "MARK_NOPRINT code marker behavior"
+ default MARK_NOPRINT_KPROBE
+ depends on MARK_NOPRINT
+ help
+ Configuration of markers that cannot call printk.
+
+config MARK_NOPRINT_KPROBE
+ bool "KPROBE"
+ ---help---
+ Change markers for a symbol "__mark_modulename_event".
+config MARK_NOPRINT_JPROBE
+ bool "JPROBE"
+ ---help---
+ Change markers for a symbol "__mark_modulename_event"
+ and create a target for a high speed dynamic probe.
+config MARK_NOPRINT_FPROBE
+ bool "FPROBE"
+ ---help---
+ Change markers for a function call.
+endchoice
+
+config MARK_STATIC
+ bool "Enable MARK_STATIC code markers"
+ default y
+ help
+ Activate markers that cannot be instrumented dynamically. They will
+ generate function calls to each function-style probe.
+
+endmenu


--- END ---


OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Alan Cox
2006-09-19 00:20:09 UTC
Permalink
Post by Mathieu Desnoyers
+#define MARK_KPROBE(event, format, args...) MARK_SYM(event);
+
+#define MARK_JPROBE(event, format, args...) \
+ do { \
+ MARK_SYM(event); \
+ JPROBE_TARGET; \
+ } while(0)
Seems a good path and has scope to be combined with some of our debug
trace printks to take them out into trace tool space instead of
cluttering up mainstream

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Dave Jones
2006-09-19 01:20:09 UTC
Permalink
Post by Mathieu Desnoyers
+ *
+ * September 2006
+ */
+
+#include <linux/config.h>
+#include <linux/kernel.h>
config.h is automatically included in the build process.
kernel.h is too iirc.

Dave
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ingo Molnar
2006-09-19 08:30:12 UTC
Permalink
Post by Mathieu Desnoyers
+choice
+ prompt "MARK code marker behavior"
+config MARK_KPROBE
+config MARK_JPROBE
+config MARK_FPROBE
+ Change markers for a function call.
+config MARK_PRINT
as indicated before in great detail, NACK on this profileration of
marker options, especially the function call one. I'd like to see _one_
marker mechanism that distros could enable, preferably with zero (or at
most one NOP) in-code overhead. (You can of course patch whatever
extension ontop of it, in out-of-tree code, to gain further performance
advantage by generating direct system-calls.)

There might be a hodgepodge of methods and tools in userspace to do
debugging, but in the kernel we should get our act together and only
take _one_ (or none at all), and then spend all our efforts on improving
that primary method of debug instrumentation. As kprobes/SystemTap has
proven, it is possible to have zero-overhead inactive probes.

Furthermore, for such a patch to make sense in the upstream kernel,
downstream tracing code has to make actual use of that NOP-marker. I.e.
a necessary (but not sufficient) requirement for upstream inclusion (in
my view) would be for this mechanism to be used by LTT and LKST. (again,
you can patch LTT for your own purposes in your own patchset if you
think the performance overhead of probes is too much)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ingo Molnar
2006-09-19 08:30:15 UTC
Permalink
Post by Ingo Molnar
Post by Mathieu Desnoyers
+choice
+ prompt "MARK code marker behavior"
+config MARK_KPROBE
+config MARK_JPROBE
+config MARK_FPROBE
+ Change markers for a function call.
+config MARK_PRINT
as indicated before in great detail, NACK on this profileration of
marker options, especially the function call one. I'd like to see _one_
marker mechanism that distros could enable, preferably with zero (or at
most one NOP) in-code overhead. (You can of course patch whatever
extension ontop of it, in out-of-tree code, to gain further performance
advantage by generating direct system-calls.)
^---function

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin J. Bligh
2006-09-19 15:20:11 UTC
Permalink
Post by Ingo Molnar
Post by Mathieu Desnoyers
+choice
+ prompt "MARK code marker behavior"
+config MARK_KPROBE
+config MARK_JPROBE
+config MARK_FPROBE
+ Change markers for a function call.
+config MARK_PRINT
as indicated before in great detail, NACK on this profileration of
marker options, especially the function call one. I'd like to see _one_
marker mechanism that distros could enable, preferably with zero (or at
most one NOP) in-code overhead. (You can of course patch whatever
extension ontop of it, in out-of-tree code, to gain further performance
advantage by generating direct system-calls.)
There might be a hodgepodge of methods and tools in userspace to do
debugging, but in the kernel we should get our act together and only
take _one_ (or none at all), and then spend all our efforts on improving
that primary method of debug instrumentation. As kprobes/SystemTap has
proven, it is possible to have zero-overhead inactive probes.
Furthermore, for such a patch to make sense in the upstream kernel,
downstream tracing code has to make actual use of that NOP-marker. I.e.
a necessary (but not sufficient) requirement for upstream inclusion (in
my view) would be for this mechanism to be used by LTT and LKST. (again,
you can patch LTT for your own purposes in your own patchset if you
think the performance overhead of probes is too much)
You know ... it strikes me that there's another way to do this, that's
zero overhead when not enabled, and gets rid of the inflexibility in
kprobes. It might not work well in all cases, but at least for simple
non-inlined functions, it'd seem to.

Why don't we just copy the whole damned function somewhere else, and
make an instrumented copy (as a kernel module)? Then reroute all the
function calls through it, instead of the original version. OK, it's
not completely trivial to do, but simpler than kprobes (probably
doing the switchover atomically is the hard part, but not impossible).
There's NO overhead when not using, and much lower than probes when
you are.

That way we can do whatever the hell we please with internal variables,
however GCC optimises it, can write flexible instrumenting code to just
about anything, program in C as God intended, etc, etc. No, it probably
won't fix every case under the sun, but hopefully most of them, and we
can still use kprobes/djprobes/bodilyprobes for the rest of the cases.

M.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Frank Ch. Eigler
2006-09-19 15:50:08 UTC
Permalink
Hi -
[...] Why don't we just copy the whole damned function somewhere
else, and make an instrumented copy (as a kernel module)? Then
reroute all the function calls through it [...]
Interesting idea. Are you imagining this instrumented copy being
built at kernel compile time (something like building a "-g -O0"
parallel)? Or compiled anew from original sources after deployment?
Or on-the-fly binary-level rewriting a la SPIN?
OK, it's not completely trivial to do, but simpler than kprobes [...]
None of the three above are that easy. Do you have an implementation
idea?


- FChE
Martin Bligh
2006-09-19 16:10:12 UTC
Permalink
Post by Frank Ch. Eigler
Hi -
[...] Why don't we just copy the whole damned function somewhere
else, and make an instrumented copy (as a kernel module)? Then
reroute all the function calls through it [...]
Interesting idea. Are you imagining this instrumented copy being
built at kernel compile time (something like building a "-g -O0"
parallel)? Or compiled anew from original sources after deployment?
Or on-the-fly binary-level rewriting a la SPIN?
"compiled anew from original sources after deployment" seems the most
practical to do to me. From second hand info on using systemtap, you
seem to need the same compiler and source tree to work from anyway, so
this doesn't seem much of a burden.
Post by Frank Ch. Eigler
OK, it's not completely trivial to do, but simpler than kprobes [...]
None of the three above are that easy. Do you have an implementation
idea?
not in detail, but given the problems that the other probe technologies
solved, it seems easy in comparison. It seems like all we'd need to do
is "list all references to function, freeze kernel, update all
references, continue", but perhaps I'm oversimplifying it ... if it's
all just straight calls, it'd seem easy. The freeze would be very short,
it's just poking a few addresses.

Having multiple hooks inside the same function pieced in at different
times, etc gets tricky, but you can always fall back on one of the other
methods if you get something complicated (or enforce some self-dicipline
in userspace on how to compound them together).
Post by Frank Ch. Eigler
yeah, this would be nice - if it werent it for function pointers,
and if all kernel functions were relocatable. But if you can think of
a method to do this, it would be nice.
Well, it doesn't have to work for everything. But would be much nicer
for when it does work, it seems to me. Which functions are not
relocatable? Function pointers are indeed a problem, for the functions
they're used on, but they're not common. Some simple markup for these
types of functions would fix it easily enough, I'd think.

A more common problem would seem to me to be instrumenting a inlined
function that was pulled into multiple places, but even that doesn't
seem particularly difficult.

M.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ingo Molnar
2006-09-19 16:50:10 UTC
Permalink
On Tue, 19 Sep 2006 09:04:43 -0700
Post by Martin Bligh
It seems like all we'd need to do
is "list all references to function, freeze kernel, update all
references, continue"
"overwrite first 5 bytes of old function with `jmp new_function'".
Yes, that's simple. but slower, as you have a double jump. Probably a
damned sight faster than int3 though.
modern CPUs will probably even optimize that intermediate jump away in
their BTB-ish caches. But in any case this would solve the function
pointer problem too.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Richard J Moore
2006-09-19 16:50:14 UTC
Permalink
Post by Martin Bligh
Post by Frank Ch. Eigler
Hi -
[...] Why don't we just copy the whole damned function somewhere
else, and make an instrumented copy (as a kernel module)? Then
reroute all the function calls through it [...]
Interesting idea. Are you imagining this instrumented copy being
built at kernel compile time (something like building a "-g -O0"
parallel)? Or compiled anew from original sources after deployment?
Or on-the-fly binary-level rewriting a la SPIN?
"compiled anew from original sources after deployment" seems the most
practical to do to me. From second hand info on using systemtap, you
seem to need the same compiler and source tree to work from anyway, so
this doesn't seem much of a burden.
If I'm not mistaken, this has been done before under the guise of dynamic
patch. Doesn't Solaris have the capability? I'm certain that some UNIXes do
as well as non-UNIX O/Ss.

Richard

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andrew Morton
2006-09-19 16:50:18 UTC
Permalink
On Tue, 19 Sep 2006 09:04:43 -0700
Post by Martin Bligh
It seems like all we'd need to do
is "list all references to function, freeze kernel, update all
references, continue"
"overwrite first 5 bytes of old function with `jmp new_function'".
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin Bligh
2006-09-19 16:50:16 UTC
Permalink
On Tue, 19 Sep 2006 09:04:43 -0700
Post by Martin Bligh
It seems like all we'd need to do
is "list all references to function, freeze kernel, update all
references, continue"
"overwrite first 5 bytes of old function with `jmp new_function'".
Yes, that's simple. but slower, as you have a double jump. Probably
a damned sight faster than int3 though.

M.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
S. P. Prasanna
2006-09-19 17:10:06 UTC
Permalink
On Tue, 19 Sep 2006 09:04:43 -0700
Post by Martin Bligh
It seems like all we'd need to do
is "list all references to function, freeze kernel, update all
references, continue"
"overwrite first 5 bytes of old function with `jmp new_function'".
Yes, that's simple. but slower, as you have a double jump. Probably
a damned sight faster than int3 though.
M.
The advantage of using int3 over jmp to launch the instrumented
module is that int3 (or breakpoint in most architectures) is an
atomic operation to insert.

I am getting some more ideas...

1. Copy the original functions, instrument them and insert them as
a part of kernel module with different name prefix.
2. Insert breakpoint only on those routines at runtime.
3. When the breakpoint gets hit, change the instruction pointer to
the instrumented routine. No need to single step at all.

Adv:
Can be enabled/disabled dynamically by inserting/removing
breakpoints. No overhead of single stepping.
No restriction of running the handler in interrupt context.
You can have pre-compiled instrumented routines.
This mechanism can be used for pre-defined set of routines and for
arbiratory probe points, you can use kprobes/jprobes/systemtap.
No need to be super-user for predefined breakpoints.

Dis:
Maintainence of the code, since it can code base need to be
duplicated and instrumented.

The above idea is similar to runtime or dynamic patching, but here we
use int3(breakpoint) rather than jump instruction.

Please correct me if I am wrong.
Please let me know if need more information.

Thanks
Prasanna
--
Prasanna S.P.
Linux Technology Center
India Software Labs, IBM Bangalore
Email: ***@in.ibm.com
Ph: 91-80-41776329
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin Bligh
2006-09-19 17:20:09 UTC
Permalink
Post by S. P. Prasanna
Post by Martin Bligh
It seems like all we'd need to do
is "list all references to function, freeze kernel, update all
references, continue"
"overwrite first 5 bytes of old function with `jmp new_function'".
Yes, that's simple. but slower, as you have a double jump. Probably
a damned sight faster than int3 though.
The advantage of using int3 over jmp to launch the instrumented
module is that int3 (or breakpoint in most architectures) is an
atomic operation to insert.
Ah, good point. Though ... how much do we care what the speed of
insertion/removal actually is? If we can tolerate it being slow,
then just sync everyone up in an IPI to freeze them out whilst
doing the insert.
Post by S. P. Prasanna
I am getting some more ideas...
1. Copy the original functions, instrument them and insert them as
a part of kernel module with different name prefix.
2. Insert breakpoint only on those routines at runtime.
3. When the breakpoint gets hit, change the instruction pointer to
the instrumented routine. No need to single step at all.
Surely this still carries the overhead of doing the breakpoint,
which was part of what we were trying to get away from? I suppose
we get more flexibility this way. Or does the slowness not actually
come from the int3, but only the single-stepping?

How about we combine all three ideas together ...

1. Load modified copy of the function in question.
2. overwrite the first instruction of the routine with an int3 that
does what you say (atomically)
3. Then overwrite the second instruction with a jump that's faster
4. Now atomically overwrite the int3 with a nop, and let the jump
take over.
Post by S. P. Prasanna
Can be enabled/disabled dynamically by inserting/removing
breakpoints. No overhead of single stepping.
No restriction of running the handler in interrupt context.
You can have pre-compiled instrumented routines.
This mechanism can be used for pre-defined set of routines and for
arbiratory probe points, you can use kprobes/jprobes/systemtap.
No need to be super-user for predefined breakpoints.
Maintainence of the code, since it can code base need to be
duplicated and instrumented.
CONFIG_FOO_BAR .... turn it on or off to turn on the instrumentation.
compiled out by default. Compiled in when making the tracing functions.
Post by S. P. Prasanna
The above idea is similar to runtime or dynamic patching, but here we
use int3(breakpoint) rather than jump instruction.
Depends what we're trying to fix. I was trying to fix two things:

1. Flexibility - kprobes seem unable to access all local variables etc
easily, and go anywhere inside the function. Plus keeping low overhead
for doing things like keeping counters in a function (see previous
example I mentioned for counting pages in shrink_list).

2. Overhead of the int3, which was allegedly 1000 cycles or so, though
faster after Ingo had played with it, it's still significant.

M.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
S. P. Prasanna
2006-09-19 17:40:22 UTC
Permalink
Post by Martin Bligh
Post by S. P. Prasanna
Post by Martin Bligh
It seems like all we'd need to do
is "list all references to function, freeze kernel, update all
references, continue"
"overwrite first 5 bytes of old function with `jmp new_function'".
Yes, that's simple. but slower, as you have a double jump. Probably
a damned sight faster than int3 though.
The advantage of using int3 over jmp to launch the instrumented
module is that int3 (or breakpoint in most architectures) is an
atomic operation to insert.
Ah, good point. Though ... how much do we care what the speed of
insertion/removal actually is? If we can tolerate it being slow,
then just sync everyone up in an IPI to freeze them out whilst
doing the insert.
I guess using IPI occasionally would be acceptable. But I think
using IPI for each probes will lots of overhead.
Post by Martin Bligh
Surely this still carries the overhead of doing the breakpoint,
which was part of what we were trying to get away from? I suppose
we get more flexibility this way. Or does the slowness not actually
come from the int3, but only the single-stepping?
Yes, it comes from int3 as well.
Post by Martin Bligh
How about we combine all three ideas together ...
1. Load modified copy of the function in question.
2. overwrite the first instruction of the routine with an int3 that
does what you say (atomically)
3. Then overwrite the second instruction with a jump that's faster
4. Now atomically overwrite the int3 with a nop, and let the jump
take over.
That's a good solution.

Thanks
Prasanna
Post by Martin Bligh
Post by S. P. Prasanna
Can be enabled/disabled dynamically by inserting/removing
breakpoints. No overhead of single stepping.
No restriction of running the handler in interrupt context.
You can have pre-compiled instrumented routines.
This mechanism can be used for pre-defined set of routines and for
arbiratory probe points, you can use kprobes/jprobes/systemtap.
No need to be super-user for predefined breakpoints.
Maintainence of the code, since it can code base need to be
duplicated and instrumented.
CONFIG_FOO_BAR .... turn it on or off to turn on the instrumentation.
compiled out by default. Compiled in when making the tracing functions.
Post by S. P. Prasanna
The above idea is similar to runtime or dynamic patching, but here we
use int3(breakpoint) rather than jump instruction.
1. Flexibility - kprobes seem unable to access all local variables etc
easily, and go anywhere inside the function. Plus keeping low overhead
for doing things like keeping counters in a function (see previous
example I mentioned for counting pages in shrink_list).
2. Overhead of the int3, which was allegedly 1000 cycles or so, though
faster after Ingo had played with it, it's still significant.
M.
--
Prasanna S.P.
Linux Technology Center
India Software Labs, IBM Bangalore
Email: ***@in.ibm.com
Ph: 91-80-41776329
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin Bligh
2006-09-19 18:10:19 UTC
Permalink
Post by S. P. Prasanna
Post by Martin Bligh
Ah, good point. Though ... how much do we care what the speed of
insertion/removal actually is? If we can tolerate it being slow,
then just sync everyone up in an IPI to freeze them out whilst
doing the insert.
I guess using IPI occasionally would be acceptable. But I think
using IPI for each probes will lots of overhead.
Depends how often you're inserting/removing probes, I guess.
Aren't these being done manually, in which case it really can't
be that many? Still doesn't fix the problem Matieu just pointed
out though. Humpf.
Post by S. P. Prasanna
Post by Martin Bligh
How about we combine all three ideas together ...
1. Load modified copy of the function in question.
2. overwrite the first instruction of the routine with an int3 that
does what you say (atomically)
3. Then overwrite the second instruction with a jump that's faster
4. Now atomically overwrite the int3 with a nop, and let the jump
take over.
That's a good solution.
It's not exactly elegant or simple, but I guess it'd work if we have
to go to that extent. Seems like a lot of complexity though, I'd
rather get rid of the int3 trap if we can.

M.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Karim Yaghmour
2006-09-19 21:00:14 UTC
Permalink
Post by Martin Bligh
be that many? Still doesn't fix the problem Matieu just pointed
out though. Humpf.
There's one possibility if we're willing to insert a placeholder
at function entry that allows to essentially do what Andrew
suggests without much impact. Specifically, if you need a 5-byte
operation to jump to the alternate instrumented function, you
can then do something like:
1- At build time insert 5-byte unconditional jump to instruction
right after placeholder.
2- At runtime for diverting flow:
- Replace first byte with int3 (atomically)
- Replace next 4 bytes with instrumented function destination
- Replace first byte
3- At runtime for returning flow:
- Do #2 but for the original placeholder jump.

There's not race condition here or fear of interrupt return in
the middle of anything, or any need to stop the kernel from
operating and the likes, or even dependency on kprobes or need
for dprobes, at least in as far as I can see -- so this should
be trivial on m68k ;). The price to pay is an additional
unconditional jump at all times, which should be optimized at
runtime by the CPU. Benchmarks could help show the real impact,
but as Ingo said, these things should be minimal.

In sum, this would work for function pointers and wouldn't
require having to walk the code in search of instances of
"call foo" to replace.

Just a thought.

Karim

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Masami Hiramatsu
2006-09-20 13:30:14 UTC
Permalink
Hi Karim,
Post by Karim Yaghmour
Post by Martin Bligh
be that many? Still doesn't fix the problem Matieu just pointed
out though. Humpf.
There's one possibility if we're willing to insert a placeholder
at function entry that allows to essentially do what Andrew
suggests without much impact. Specifically, if you need a 5-byte
operation to jump to the alternate instrumented function, you
This method is very similar to the djprobe.
And I had gotten the same idea to support preemptive kernel.
Post by Karim Yaghmour
1- At build time insert 5-byte unconditional jump to instruction
right after placeholder.
This means the below code, doesn't this?
---
jmp 1f /* short jump consumes 2 bytes */
nop
nop
nop
1:
---
Post by Karim Yaghmour
- Replace first byte with int3 (atomically)
- Replace next 4 bytes with instrumented function destination
- Serialize all processor's cache by using IPI and cpuid.
Post by Karim Yaghmour
- Replace first byte
- Do #2 but for the original placeholder jump.
I think the djprobe can provide most of functionalities which
your idea requires.
I'll update the djprobe against for 2.6.17 or later as soon as
possible. Would you try to use it?

Thanks,
--
Masami HIRAMATSU
2nd Research Dept.
Hitachi, Ltd., Systems Development Laboratory
E-mail: ***@hitachi.com





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Karim Yaghmour
2006-09-20 17:20:12 UTC
Permalink
Post by Masami Hiramatsu
This method is very similar to the djprobe.
And I had gotten the same idea to support preemptive kernel.
...
Post by Masami Hiramatsu
This means the below code, doesn't this?
---
jmp 1f /* short jump consumes 2 bytes */
nop
nop
nop
Actually this is slightly different (and requires more support
on behalf of the underlying mechanism then what I was suggesting.)
Basically, as was discussed elsewhere, there is some complex
mechanisms required for taking care of the case where you got
an interrupt at, say, the second or third nop. With the
mechanism I'm suggesting (replacing a 5 byte jmp with a 5 byte
jmp), the underlying mechanics do not require having to take
care of the above-mentioned case.
Post by Masami Hiramatsu
- Serialize all processor's cache by using IPI and cpuid.
Yes.
Post by Masami Hiramatsu
I think the djprobe can provide most of functionalities which
your idea requires.
I'll update the djprobe against for 2.6.17 or later as soon as
possible. Would you try to use it?
Basically I'm trying to come up with a mechanism that will be
relatively trivial to implement on any architecture. My
understanding is that kprobes/djprobes combo do not necessarily
fit this description. Of course, that's not a justification for
not trying to get it to work, but my understanding is that
Martin's proposal, if it were implemented, would have a number
of advantages over just having kprobes/djprobes.

Though, in fact, djprobes can be used on the x86 (since it
already works on that) for doing exactly what I'm looking
for: replacing a 5 byte jmp with a 5 byte jmp. My understanding
is that djprobes doesn't need any special intelligence (even
on preemptable kernels) here since it shouldn't need to worry
about an IP back anywhere inside a series of nops. IOW, we
should be able to do what Martin suggests fairly easily (if
we agree on a 5-byte "null" jump at the entry of functions
of interest). Right?

Karim
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Mathieu Desnoyers
2006-09-20 17:30:09 UTC
Permalink
Post by Karim Yaghmour
Post by Masami Hiramatsu
This method is very similar to the djprobe.
And I had gotten the same idea to support preemptive kernel.
...
Post by Masami Hiramatsu
This means the below code, doesn't this?
---
jmp 1f /* short jump consumes 2 bytes */
nop
nop
nop
Actually this is slightly different (and requires more support
on behalf of the underlying mechanism then what I was suggesting.)
Basically, as was discussed elsewhere, there is some complex
mechanisms required for taking care of the case where you got
an interrupt at, say, the second or third nop. With the
mechanism I'm suggesting (replacing a 5 byte jmp with a 5 byte
jmp), the underlying mechanics do not require having to take
care of the above-mentioned case.
Karim, the jmp already there targets the end of the region : no possible
executioni of the three following nops. Clever :)

Mathieu

OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Karim Yaghmour
2006-09-20 17:30:12 UTC
Permalink
Post by Mathieu Desnoyers
Karim, the jmp already there targets the end of the region : no possible
executioni of the three following nops. Clever :)
Must get more coffee ...

Karim

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Frank Ch. Eigler
2006-09-20 18:10:13 UTC
Permalink
Hi -
[...] IOW, we should be able to do what Martin suggests fairly
easily (if we agree on a 5-byte "null" jump at the entry of
functions of interest). Right? [...]
My interpretation of Martin's Monday proposal is that, if implemented,
we wouldn't need any of this nop/int3 stuff. If function being
instrumented were recompiled on-the-fly, then it could sport plain &
direct C-level calls to the instrumentation handlers.

- FChE
Karim Yaghmour
2006-09-20 18:20:13 UTC
Permalink
Hello Frank,
Post by Frank Ch. Eigler
My interpretation of Martin's Monday proposal is that, if implemented,
we wouldn't need any of this nop/int3 stuff. If function being
instrumented were recompiled on-the-fly, then it could sport plain &
direct C-level calls to the instrumentation handlers.
Absolutely. I guess the length of these threads is just fertile
ground for misunderstandings. Basically what Hiramatsu-san and
myself were discussing was just the mechanism for selecting/
forking in between the uninstrumented function and the instrumented
one.

So, to recap:

If you had 100,000 instrumentation points in the scheduler (obviously
a totally bogus number here ...) you'd have 2 functions:
1- one with no instrumentation at all, but with a 5byte filler such
as the one presented by Hiramatsu-san.
2- one with the instrumentation.

Early in the proposal, the mechanics of switching in between "1" and "2"
seemed to be problematic, but I think with Hiramatsu-san's proposal
and, on the x86, djprobes, we've got it figured out.

Let me know if I'm not providing enough detail.

Thanks,

Karim

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin Bligh
2006-09-20 18:30:17 UTC
Permalink
Post by Frank Ch. Eigler
Hi -
[...] IOW, we should be able to do what Martin suggests fairly
easily (if we agree on a 5-byte "null" jump at the entry of
functions of interest). Right? [...]
My interpretation of Martin's Monday proposal is that, if implemented,
we wouldn't need any of this nop/int3 stuff. If function being
instrumented were recompiled on-the-fly, then it could sport plain &
direct C-level calls to the instrumentation handlers.
It's looking to me like it might still need djprobes to implement, in
order to get the atomic and safe switchover from the original function
into the traced one. All rather sad, but seems to be true from all the
CPU errata, etc. If anyone can see a way round that, I'd love to hear
it.

What it would give you above and beyond djprobes is an easier and more
flexible way to actually do the instrumentation itself.

M.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Karim Yaghmour
2006-09-20 18:50:15 UTC
Permalink
Post by Martin Bligh
It's looking to me like it might still need djprobes to implement, in
order to get the atomic and safe switchover from the original function
into the traced one. All rather sad, but seems to be true from all the
CPU errata, etc. If anyone can see a way round that, I'd love to hear
it.
But we don't need to fight the errata, there are fortunately solutions
that take care of it where it does exist (x86: djprobes/kprobes.)
What's more interesting, though, is that the method as it is proposed
at this stage *seems* to be easily portable to other archs. And where
such binary trickery is difficult to pull off, nothing precludes
having a universally "portable" mechanism including something akin to
switching between instrumented vs. normal function at function entry.
Even such conditional ifs can be optimized by the CPU nowadays.

The picture is, nevertheless, very bright at the moment (I think).
Just have a 5byte filler at function entry such as Hiramatsu-san
suggested, and use djprobes to fork to instrumented function. The
unconditional jump in the filler will most likely be utterly
unmeasurable, and benchmarks should confirm this.

So:
On x86: use 5byte filler and djprobes.
On "sane" archs: use filler and override as explained earlier.
Elsewhere: use standard "if" or function pointer at function entry.
Post by Martin Bligh
What it would give you above and beyond djprobes is an easier and more
flexible way to actually do the instrumentation itself.
Absolutely agree.

Karim

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin Bligh
2006-09-20 19:30:06 UTC
Permalink
Post by Karim Yaghmour
Post by Martin Bligh
It's looking to me like it might still need djprobes to implement, in
order to get the atomic and safe switchover from the original function
into the traced one. All rather sad, but seems to be true from all the
CPU errata, etc. If anyone can see a way round that, I'd love to hear
it.
But we don't need to fight the errata, there are fortunately solutions
that take care of it where it does exist (x86: djprobes/kprobes.)
What's more interesting, though, is that the method as it is proposed
at this stage *seems* to be easily portable to other archs. And where
such binary trickery is difficult to pull off, nothing precludes
having a universally "portable" mechanism including something akin to
switching between instrumented vs. normal function at function entry.
Even such conditional ifs can be optimized by the CPU nowadays.
The picture is, nevertheless, very bright at the moment (I think).
Just have a 5byte filler at function entry such as Hiramatsu-san
suggested, and use djprobes to fork to instrumented function. The
unconditional jump in the filler will most likely be utterly
unmeasurable, and benchmarks should confirm this.
On x86: use 5byte filler and djprobes.
On "sane" archs: use filler and override as explained earlier.
Elsewhere: use standard "if" or function pointer at function entry.
Do we even need the filler padding? I thought we could insert kprobes
at the beginning of any function without that ... it was only a
requirement for mid-function (sometimes). If we copy the whole function,
we don't even need that any more ...

if kprobes can do it, I don't see why djprobes can't ... after all, it
just seems to use kprobes to insert a jump, AFAICS.

M.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Karim Yaghmour
2006-09-20 19:40:14 UTC
Permalink
Post by Martin Bligh
Do we even need the filler padding? I thought we could insert kprobes
at the beginning of any function without that ... it was only a
requirement for mid-function (sometimes). If we copy the whole function,
we don't even need that any more ...
if kprobes can do it, I don't see why djprobes can't ... after all, it
just seems to use kprobes to insert a jump, AFAICS.
I guess I must not be explaining myself properly.

The padding is for one purpose and one purpose only: having
a know-to-be-good location at the beginning of the
uninstrumented function for later using djprobes on. Once
you've got that, then you can indeed copy the entire
function and do whatever you want *without* using djprobes
or kprobes, but using direct calls.

If you don't have the padding, then you might yourself in
a case where you're replacing bytes from multiple instructions
where something somewhere may have an IP within the replaced
range. And to get around that you have to pull a few magic
tricks *and* make a few assumptions. But if you replace a
5 bytes instruction (or the equivalent as in Hiramatsu-san's
proposla) with another 5 bytes instruction, none of that is
needed and djprobes can be used *today* to do that.

Using this, you've got an arguably non-existent penalty
for the function with the filler and a very fast jump to
the instrumented function. The best of both worlds
actually.

Let me know if I'm still not being clear.

Karim

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Karim Yaghmour
2006-09-20 19:50:08 UTC
Permalink
You mean using the jump-over thing that was posted earlier?
I thought the CPU erratas prevented doing that atomically
properly. From my understanding of the last 24 hours discussion,
it seemed like the ONLY thing we could do safely atomically was
insert an int3. Which sucks, frankly, but still.
No. djprobes already does safely insert other stuff than just
int3, that's the whole point.

Here are the relevant postings by Hiramatsu-san:
http://marc.theaimsgroup.com/?l=linux-kernel&m=115875912510827&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=115875867519302&w=2

Unless there's something *I* fundamentally misunderstood from
Hiramatsu-san's implementation and input, djprobes can replace
the 5-byte filler with a 5-byte unconditional jump. IOW your
mechanism works, no int3s involved.

Karim

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin Bligh
2006-09-20 19:50:08 UTC
Permalink
Post by Karim Yaghmour
Post by Martin Bligh
Do we even need the filler padding? I thought we could insert kprobes
at the beginning of any function without that ... it was only a
requirement for mid-function (sometimes). If we copy the whole function,
we don't even need that any more ...
if kprobes can do it, I don't see why djprobes can't ... after all, it
just seems to use kprobes to insert a jump, AFAICS.
I guess I must not be explaining myself properly.
The padding is for one purpose and one purpose only: having
a know-to-be-good location at the beginning of the
uninstrumented function for later using djprobes on. Once
you've got that, then you can indeed copy the entire
function and do whatever you want *without* using djprobes
or kprobes, but using direct calls.
If you don't have the padding, then you might yourself in
a case where you're replacing bytes from multiple instructions
where something somewhere may have an IP within the replaced
range. And to get around that you have to pull a few magic
tricks *and* make a few assumptions. But if you replace a
5 bytes instruction (or the equivalent as in Hiramatsu-san's
proposla) with another 5 bytes instruction, none of that is
needed and djprobes can be used *today* to do that.
Using this, you've got an arguably non-existent penalty
for the function with the filler and a very fast jump to
the instrumented function. The best of both worlds
actually.
Let me know if I'm still not being clear.
You mean using the jump-over thing that was posted earlier?
I thought the CPU erratas prevented doing that atomically
properly. From my understanding of the last 24 hours discussion,
it seemed like the ONLY thing we could do safely atomically was
insert an int3. Which sucks, frankly, but still.

Or are we talking about locking everyone in an NMI? Having
proposed that, I now think it doesn't work ... we still return
from it when it's done, and might be in the middle of the
instruction stream we just crapped on.

So, maybe I missed a bit of the conversation, or didn't understand
it, but I was trying to follow it pretty closely. Even with the
padding, I don't see how overwriting it is atomic ... they could
be off processing an interrupt / NMI or whatever when you were
in the midst of it.

One thing Michael (cc'ed) pointed out was the possibility of using
"jump to self" as a small marker instruction, where we set the
function in busy wait at the start as we overwrite the next few,
then overwrite the jump to selfs with a nop to liberate it again.
But I'm unconvinced that gets around the CPU errata Alan was
pointing to.

M.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Karim Yaghmour
2006-09-20 17:40:18 UTC
Permalink
Hello Hiramatsu-san,

So here's a more intelligent answer than last time :)
Post by Masami Hiramatsu
This method is very similar to the djprobe.
And I had gotten the same idea to support preemptive kernel.
...
Post by Masami Hiramatsu
This means the below code, doesn't this?
---
jmp 1f /* short jump consumes 2 bytes */
nop
nop
nop
---
YES, as pointed out by Mathieu, this does essentially the same.
And, yes, as mentioned earlier, this should work fine on
preemptable kernels.
Post by Masami Hiramatsu
I think the djprobe can provide most of functionalities which
your idea requires.
Indeed.

Thanks,

Karim

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Mathieu Desnoyers
2006-09-19 18:00:16 UTC
Permalink
Post by Martin Bligh
How about we combine all three ideas together ...
1. Load modified copy of the function in question.
2. overwrite the first instruction of the routine with an int3 that
does what you say (atomically)
3. Then overwrite the second instruction with a jump that's faster
4. Now atomically overwrite the int3 with a nop, and let the jump
take over.
Very good idea.. However, overwriting the second instruction with a jump could
be dangerous on preemptible and SMP kernels, because we never know if a thread
has an IP in any of its contexts that would return exactly at the middle of the
jump. I think it would be doable to overwrite a 5+ bytes instruction with a NOP
non-atomically in all cases, but as the instructions nin the prologue seems to
be smaller :

prologue on x86
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
epilogue on x86
3: 5d pop %ebp
4: c3 ret

Then is can be a problem. Ideas are welcome.

Mathieu


OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin Bligh
2006-09-19 18:10:13 UTC
Permalink
Post by Mathieu Desnoyers
Post by Martin Bligh
How about we combine all three ideas together ...
1. Load modified copy of the function in question.
2. overwrite the first instruction of the routine with an int3 that
does what you say (atomically)
3. Then overwrite the second instruction with a jump that's faster
4. Now atomically overwrite the int3 with a nop, and let the jump
take over.
Very good idea.. However, overwriting the second instruction with a jump could
be dangerous on preemptible and SMP kernels, because we never know if a thread
has an IP in any of its contexts that would return exactly at the middle of the
jump. I think it would be doable to overwrite a 5+ bytes instruction with a NOP
non-atomically in all cases, but as the instructions nin the prologue seems to
prologue on x86
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
epilogue on x86
3: 5d pop %ebp
4: c3 ret
Then is can be a problem. Ideas are welcome.
Ugh, yes that's somewhat problematic. It does seem rather unlikely that
there's a function call in the function prologue when we're busy
offloading stuff onto the stack, but still ...

For the cases where we're prepared to overwrite the call instruction in
the caller, rather than insert an extra jump in the callee, can we not
do that atomically by overwriting the address we're jumping to (the
call is obviously there already)? Doesn't fix function pointers, etc,
but might work well for the simple case at least.

M.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Mathieu Desnoyers
2006-09-19 18:20:10 UTC
Permalink
Post by Martin Bligh
Post by Mathieu Desnoyers
jump. I think it would be doable to overwrite a 5+ bytes instruction with a NOP
non-atomically in all cases, but as the instructions not in the prologue
seems to
prologue on x86
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
epilogue on x86
3: 5d pop %ebp
4: c3 ret
Then is can be a problem. Ideas are welcome.
Ugh, yes that's somewhat problematic. It does seem rather unlikely that
there's a function call in the function prologue when we're busy
offloading stuff onto the stack, but still ...
A function call is not the cause of the problem : an interrupt/trap is.
Post by Martin Bligh
For the cases where we're prepared to overwrite the call instruction in
the caller, rather than insert an extra jump in the callee, can we not
do that atomically by overwriting the address we're jumping to (the
call is obviously there already)? Doesn't fix function pointers, etc,
but might work well for the simple case at least.
I don't think we have any guarantee that the function pointer in the call is
aligned, so I guess it would not be an atomic replacement.

Mathieu

OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Alan Cox
2006-09-19 23:50:35 UTC
Permalink
Post by Mathieu Desnoyers
Very good idea.. However, overwriting the second instruction with a jump could
be dangerous on preemptible and SMP kernels, because we never know if a thread
has an IP in any of its contexts that would return exactly at the middle of the
jump.
No: on x86 it is the *same* case for all of these even writing an int3.
One byte or a megabyte,

You MUST ensure that every CPU executes a serializing instruction before
it hits code that was modified by another processor. Otherwise you get
CPU errata and the CPU produces results which vendors like to describe
as "undefined".

Thus you have to serialize, and if you are serializing it really doesn't
matter if you write a byte, a paragraph or a page.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Karim Yaghmour
2006-09-20 00:40:06 UTC
Permalink
Post by Alan Cox
Post by Mathieu Desnoyers
Very good idea.. However, overwriting the second instruction with a jump could
be dangerous on preemptible and SMP kernels, because we never know if a thread
has an IP in any of its contexts that would return exactly at the middle of the
jump.
No: on x86 it is the *same* case for all of these even writing an int3.
One byte or a megabyte,
You MUST ensure that every CPU executes a serializing instruction before
it hits code that was modified by another processor. Otherwise you get
CPU errata and the CPU produces results which vendors like to describe
as "undefined".
I was aware of that this errata existed, but never actually knew the
actual specifics of it. Are these two separate problems or just
one?
a) the errata & a possible thread having an IP leading back within (not
at the start of) the range to be replaced.
b) the errata & replacing single instruction with single instruction of
same size.

In a), there's almost an intractable problem of making sure no IP leads
back within the range to be replaced. In b) we still have to take care
of the errata part, but no worry about the stalled thread with invalid
IP.
Post by Alan Cox
Thus you have to serialize, and if you are serializing it really doesn't
matter if you write a byte, a paragraph or a page.
I was vaguely aware of the issue on x86. Do you know if this applies the
same on other achitectures?

Also, this is SMP-only, right? (Not that single UP matters for desktop
anymore, but just checking.)

Any pointers to the errata?

Karim
--
President / Opersys Inc.
Embedded Linux Training and Expertise
www.opersys.com / 1.866.677.4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Alan Cox
2006-09-20 10:30:17 UTC
Permalink
Post by Karim Yaghmour
a) the errata & a possible thread having an IP leading back within (not
at the start of) the range to be replaced.
b) the errata & replacing single instruction with single instruction of
same size.
Intel don't distinguish. Richard's reply later in the thread answers a
lot more including what Intels architecture team said about int3 being a
specific safe case for soem reason
Post by Karim Yaghmour
I was vaguely aware of the issue on x86. Do you know if this applies the
same on other achitectures?
I wouldn't know.
Post by Karim Yaghmour
Also, this is SMP-only, right? (Not that single UP matters for desktop
anymore, but just checking.)
There are some uniprocessor errata but I cannot see how you could patch
code, somehow take an interrupt (or return from one) without executing a
serializing instruction, so I likewise think its SMP only.
Post by Karim Yaghmour
Any pointers to the errata?
developer.intel.com 'specification update' documents (which are always
good reading).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Richard J Moore
2006-09-20 23:10:08 UTC
Permalink
Post by Alan Cox
Post by Karim Yaghmour
a) the errata & a possible thread having an IP leading back within (not
at the start of) the range to be replaced.
b) the errata & replacing single instruction with single instruction of
same size.
Intel don't distinguish. Richard's reply later in the thread answers a
lot more including what Intels architecture team said about int3 being a
specific safe case for soem reason
Post by Karim Yaghmour
I was vaguely aware of the issue on x86. Do you know if this applies the
same on other achitectures?
I wouldn't know.
It can for another reason - score-boarding: that's where a byte being
stored assumes intermediate values due to the bits not being set
simultaneously. Generally this doesn't cause a problem because data across
processors is serialised for update by mutexes. However, when applied to
code all sorts of interesting instructions can execute before the bits
settle down. I haven't heard of this troubling Intel, but it does occur on
some current architectures.

Richard

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Hugh Dickins
2006-09-23 15:40:08 UTC
Permalink
Post by Richard J Moore
It can for another reason - score-boarding: that's where a byte being
stored assumes intermediate values due to the bits not being set
simultaneously. Generally this doesn't cause a problem because data across
processors is serialised for update by mutexes. However, when applied to
code all sorts of interesting instructions can execute before the bits
settle down. I haven't heard of this troubling Intel, but it does occur on
some current architectures.
I'd not heard of this phenomenon, and it worries me. There are places
in kernel code where we peek at some volatile variable (perhaps a long)
without locking, and expect to see it in any one of several well-defined
states. Are you saying that there are architectures supported by Linux,
on which we might see an "impossible" mix of states, due to score-boarding?

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Richard J Moore
2006-09-26 08:50:12 UTC
Permalink
Post by Hugh Dickins
Post by Richard J Moore
It can for another reason - score-boarding: that's where a byte being
stored assumes intermediate values due to the bits not being set
simultaneously. Generally this doesn't cause a problem because data across
processors is serialised for update by mutexes. However, when applied to
code all sorts of interesting instructions can execute before the bits
settle down. I haven't heard of this troubling Intel, but it does occur on
some current architectures.
I'd not heard of this phenomenon, and it worries me. There are places
in kernel code where we peek at some volatile variable (perhaps a long)
without locking, and expect to see it in any one of several well-defined
states. Are you saying that there are architectures supported by Linux,
on which we might see an "impossible" mix of states, due to
score-boarding?
Post by Hugh Dickins
Hugh
These things tend not to be discussed in specific detail in the processor
reference manuals. If there are exposures they are generally covered by
blanket statements about the need to ensure correct serialization between
processors when reading from, and writing to, the same location. As far as
I am aware Linux is protected from such affects because we do use locks, or
serializing instructions, to protect the updating of variables that are
accessed by multiple processors. My guess is that the exposure to
score-boarding, if it exists at all, tends to be limited to concurrent
bitwise operations.

Richard

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

S. P. Prasanna
2006-09-20 01:10:10 UTC
Permalink
Hi Alan,
Post by Alan Cox
Post by Mathieu Desnoyers
Very good idea.. However, overwriting the second instruction with a jump could
be dangerous on preemptible and SMP kernels, because we never know if a thread
has an IP in any of its contexts that would return exactly at the middle of the
jump.
No: on x86 it is the *same* case for all of these even writing an int3.
One byte or a megabyte,
You MUST ensure that every CPU executes a serializing instruction before
it hits code that was modified by another processor. Otherwise you get
CPU errata and the CPU produces results which vendors like to describe
as "undefined".
Are you referring to Intel erratum "unsynchronized cross-modifying code"
- where it refers to the practice of modifying code on one processor
where another has prefetched the unmodified version of the code.

Thanks
Prasanna
Post by Alan Cox
Thus you have to serialize, and if you are serializing it really doesn't
matter if you write a byte, a paragraph or a page.
--
Prasanna S.P.
Linux Technology Center
India Software Labs, IBM Bangalore
Email: ***@in.ibm.com
Ph: 91-80-41776329
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Richard J Moore
2006-09-20 08:20:08 UTC
Permalink
Post by S. P. Prasanna
Hi Alan,
Post by Alan Cox
Post by Mathieu Desnoyers
Very good idea.. However, overwriting the second instruction
with a jump could
Post by Alan Cox
Post by Mathieu Desnoyers
be dangerous on preemptible and SMP kernels, because we never
know if a thread
Post by Alan Cox
Post by Mathieu Desnoyers
has an IP in any of its contexts that would return exactly at
the middle of the
Post by Alan Cox
Post by Mathieu Desnoyers
jump.
No: on x86 it is the *same* case for all of these even writing an int3.
One byte or a megabyte,
You MUST ensure that every CPU executes a serializing instruction before
it hits code that was modified by another processor. Otherwise you get
CPU errata and the CPU produces results which vendors like to describe
as "undefined".
Are you referring to Intel erratum "unsynchronized cross-modifying code"
- where it refers to the practice of modifying code on one processor
where another has prefetched the unmodified version of the code.
Thanks
Prasanna
In the special case of replacing an opcode with int3 that erratum doesn't
apply. I know that's not in the manuals but it has been confirmed by the
Intel microarchitecture group. And it's not reasonable to it to be any
other way.



- -
Richard J Moore
IBM Advanced Linux Response Team - Linux Technology Centre
MOBEX: 264807; Mobile (+44) (0)7739-875237
Office: (+44) (0)1962-817072

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Alan Cox
2006-09-20 10:20:10 UTC
Permalink
Post by Richard J Moore
Post by S. P. Prasanna
Are you referring to Intel erratum "unsynchronized cross-modifying code"
- where it refers to the practice of modifying code on one processor
where another has prefetched the unmodified version of the code.
In the special case of replacing an opcode with int3 that erratum doesn't
apply. I know that's not in the manuals but it has been confirmed by the
Intel microarchitecture group. And it's not reasonable to it to be any
other way.
Ok thats cool to know and I wish they'd documented it. Is the same true
for AMD ?

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andi Kleen
2006-09-20 12:00:13 UTC
Permalink
Post by Alan Cox
Post by Richard J Moore
Post by S. P. Prasanna
Are you referring to Intel erratum "unsynchronized cross-modifying code"
- where it refers to the practice of modifying code on one processor
where another has prefetched the unmodified version of the code.
In the special case of replacing an opcode with int3 that erratum doesn't
apply. I know that's not in the manuals but it has been confirmed by the
Intel microarchitecture group. And it's not reasonable to it to be any
other way.
Ok thats cool to know and I wish they'd documented it. Is the same true
for AMD ?
It pretty much has to, otherwise lots of debuggers would be unhappy

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Richard J Moore
2006-09-20 13:50:11 UTC
Permalink
Post by Alan Cox
Post by Richard J Moore
Post by S. P. Prasanna
Are you referring to Intel erratum "unsynchronized cross-modifying code"
- where it refers to the practice of modifying code on one processor
where another has prefetched the unmodified version of the code.
In the special case of replacing an opcode with int3 that erratum doesn't
apply. I know that's not in the manuals but it has been confirmed by the
Intel microarchitecture group. And it's not reasonable to it to be any
other way.
Ok thats cool to know and I wish they'd documented it. Is the same true
for AMD ?
Alan
Not sure probably - I can ask.

Intel explained it to me thus:

When the i-fetch has been done and the micro-ops are in the trace cache
then there's no longer a direct correlation between the original machine
instruction boundaries and the micro ops. This is due to optimization. For
example (artificial one for illustrative purposes):

mov eax,ebx
mov memory,eax
mov eax,1

(using intel notation not ATT - force of habit)

In the trace cache there would be no micro ops to update eax with ebx.

Altering the "mov eax,ebx" to "mov ecx,ebx" on the fly invalidates the
optimized trace cache, hence the onlhy recourse is a GPF.
If the modification doens't invalidate the trace cache then no GPF. The
question is: "can we predict th circumstances when the trace cache has not
been invalidated", and the answer in general is no since the
microarchtecture is not public. But one can guess that modifying the single
byte opcode with in interrupting instruction - int3 - doesn't cause an
inconsistency that can't be handled. And that's what Intel confirmed. Go
ahead and store int3 without the need to synchronise (i.e. force the trace
cache to be flushed).

My guess is that AMD behaves exactly the same way. But I'll check.


Richard

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Pavel Machek
2006-09-22 19:00:17 UTC
Permalink
Hi!
Post by Ingo Molnar
Post by Mathieu Desnoyers
Post by Alan Cox
Post by Mathieu Desnoyers
Very good idea.. However, overwriting the second instruction
with a jump could
Post by Alan Cox
Post by Mathieu Desnoyers
be dangerous on preemptible and SMP kernels, because we never
know if a thread
Post by Alan Cox
Post by Mathieu Desnoyers
has an IP in any of its contexts that would return exactly at
the middle of the
Post by Alan Cox
Post by Mathieu Desnoyers
jump.
No: on x86 it is the *same* case for all of these even writing an int3.
One byte or a megabyte,
You MUST ensure that every CPU executes a serializing instruction
before
Post by Mathieu Desnoyers
Post by Alan Cox
it hits code that was modified by another processor. Otherwise you get
CPU errata and the CPU produces results which vendors like to describe
as "undefined".
Are you referring to Intel erratum "unsynchronized cross-modifying code"
- where it refers to the practice of modifying code on one processor
where another has prefetched the unmodified version of the code.
In the special case of replacing an opcode with int3 that erratum doesn't
apply. I know that's not in the manuals but it has been confirmed by the
Intel microarchitecture group. And it's not reasonable to it to be any
other way.
What about replacing int3 with old instruction (i.e. marker being
deleted)?
Pavel
--
Thanks for all the (sleeping) penguins.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Mathieu Desnoyers
2006-09-20 01:20:06 UTC
Permalink
Post by Alan Cox
Post by Mathieu Desnoyers
Very good idea.. However, overwriting the second instruction with a jump could
be dangerous on preemptible and SMP kernels, because we never know if a thread
has an IP in any of its contexts that would return exactly at the middle of the
jump.
No: on x86 it is the *same* case for all of these even writing an int3.
One byte or a megabyte,
You MUST ensure that every CPU executes a serializing instruction before
it hits code that was modified by another processor. Otherwise you get
CPU errata and the CPU produces results which vendors like to describe
as "undefined".
Thus you have to serialize, and if you are serializing it really doesn't
matter if you write a byte, a paragraph or a page.
Hi Alan,

What I am trying to address is not "code patching with INT3", but "code patching
with a 5 bytes JMP". The errata you point to applies to both and kprobes
mechanism already takes care of this with the serialization method you describe.

However, there is a supplemental problem with the fact that a JMP is 5 bytes,
not 1. You are right about saying that overwriting code with any amount of
*int3* does not matter, but what happens when you put one or more 5 bytes long
jumps instead ?

Think about it : if you are replacing 1-2-3 or 4 bytes long instruction and,
unluckily, on any stack of any thread preempted from any CPU, you have a
current instruction pointer pointing at the middle of the region where you want
to put the 5 bytes JMP, the processor will likely trigger an illegal
instruction fault when this particular thread is scheduled back.

Mathieu

OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Vara Prasad
2006-09-19 19:20:13 UTC
Permalink
Post by Martin Bligh
[...]
1. Flexibility - kprobes seem unable to access all local variables etc
easily, and go anywhere inside the function. Plus keeping low overhead
for doing things like keeping counters in a function (see previous
example I mentioned for counting pages in shrink_list).
Using tools like systemtap on can consult DWARF information and put
probes in the middle of the function and access local variables as well,
that is not the real problem. The issue here is compiler doesn't seem to
generate required DWARF information in some cases due to optimizations.
The other related problem is when there exists debug information, the
way to specify the breakpoint location is using line number which is not
maintainable, having a marker solves this problem as well. Your proposal
still doesn't solve the need for markers if i understood correctly.
Post by Martin Bligh
2. Overhead of the int3, which was allegedly 1000 cycles or so, though
faster after Ingo had played with it, it's still significant.
The reason Kprobes use breakpoint instruction as pointed out by Prasanna
is, it is atomic on most platforms. We are already working on an
improved idea using jump instruction with which overhead is less than
100 cycles on modern CPU's but it has some limitations and issues
related to preemption and SMP.

You can get a glimpse of some of the issues here
http://sourceware.org/ml/systemtap/2006-q3/msg00507.html
http://sourceware.org/ml/systemtap/2005-q4/msg00117.html
For more details do a search for djprobe in the systemtap mailing list
(sorry i am not able to find few threads to summarize all the issues).

Here is the algorithm djprobes uses to

IA
|
[-2][-1][0][1][2][3][4][5][6][7]
[ins1][ins2][ ins3 ]
[<- DCR ->]
[<- JTPR ->]

ins1: 1st Instruction
ins2: 2nd Instruction
ins3: 3rd Instruction
IA: Insertion Address
JTPR: Jump Target Prohibition Region
DCR: Detoured Code Region


The replacement procedure of djpopbes is the following (i have simplified for readability the actual steps djprobes uses)

(1) copying instruction(s) in DCR
(2) putting break point instruction at IA
(3) make sure no cpu's have replacing instructions in the cache to avoid jump to the middle of jmp instruction
(4) replacing original instruction(s) with jump instruction


As you can see from the above your suggestion is very similar to the
djprobes hence i believe all the issues related to djprobes will be
valid for yours as well.
Post by Martin Bligh
M.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin Bligh
2006-09-19 19:30:17 UTC
Permalink
Post by Vara Prasad
Post by Martin Bligh
[...]
1. Flexibility - kprobes seem unable to access all local variables etc
easily, and go anywhere inside the function. Plus keeping low overhead
for doing things like keeping counters in a function (see previous
example I mentioned for counting pages in shrink_list).
Using tools like systemtap on can consult DWARF information and put
probes in the middle of the function and access local variables as well,
that is not the real problem. The issue here is compiler doesn't seem to
generate required DWARF information in some cases due to optimizations.
It seems difficult to seperate those two from each other. If the
subsystem you're relying on doesn't work, then ....
Post by Vara Prasad
The other related problem is when there exists debug information, the
way to specify the breakpoint location is using line number which is not
maintainable, having a marker solves this problem as well. Your proposal
still doesn't solve the need for markers if i understood correctly.
It could, but I think we're better off with the markers, yes.
Post by Vara Prasad
Post by Martin Bligh
2. Overhead of the int3, which was allegedly 1000 cycles or so, though
faster after Ingo had played with it, it's still significant.
The reason Kprobes use breakpoint instruction as pointed out by Prasanna
is, it is atomic on most platforms. We are already working on an
improved idea using jump instruction with which overhead is less than
100 cycles on modern CPU's but it has some limitations and issues
related to preemption and SMP.
You can get a glimpse of some of the issues here
http://sourceware.org/ml/systemtap/2006-q3/msg00507.html
http://sourceware.org/ml/systemtap/2005-q4/msg00117.html
For more details do a search for djprobe in the systemtap mailing list
(sorry i am not able to find few threads to summarize all the issues).
"This djprobe is NOT a replacement of kprobes. Djprobe and kprobes
have complementary qualities. (ex: djprobe's overhead is low, and
kprobes can be inserted in anywhere.)". Hmm. that seems problematic.

From what I was describing for function replacement, we could do an NMI
IPI to everyone, and lock them in there whilst we insert the probe, but
it's a bit sucky.
Post by Vara Prasad
Here is the algorithm djprobes uses to
IA
| [-2][-1][0][1][2][3][4][5][6][7]
[ins1][ins2][ ins3 ]
[<- DCR ->]
[<- JTPR ->]
ins1: 1st Instruction
ins2: 2nd Instruction
ins3: 3rd Instruction
IA: Insertion Address
JTPR: Jump Target Prohibition Region
DCR: Detoured Code Region
The replacement procedure of djpopbes is the following (i have
simplified for readability the actual steps djprobes uses)
(1) copying instruction(s) in DCR
(2) putting break point instruction at IA
(3) make sure no cpu's have replacing instructions in the cache to avoid
jump to the middle of jmp instruction
(4) replacing original instruction(s) with jump instruction
As you can see from the above your suggestion is very similar to the
djprobes hence i believe all the issues related to djprobes will be
valid for yours as well.
The hooking seems very similar, yes, perhaps I can be lazy and just
steal djprobes for this. The difference is that if we just replace the
whole function, we can just shove arbitrary changes into functions, and
do whatever we please. Plus we don't have to worry about locating
internal variables, etc.

M.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
S. P. Prasanna
2006-09-19 20:10:58 UTC
Permalink
Post by Martin Bligh
Post by Vara Prasad
Post by Martin Bligh
[...]
1. Flexibility - kprobes seem unable to access all local variables etc
easily, and go anywhere inside the function. Plus keeping low overhead
for doing things like keeping counters in a function (see previous
example I mentioned for counting pages in shrink_list).
Using tools like systemtap on can consult DWARF information and put
probes in the middle of the function and access local variables as well,
that is not the real problem. The issue here is compiler doesn't seem to
generate required DWARF information in some cases due to optimizations.
It seems difficult to seperate those two from each other. If the
subsystem you're relying on doesn't work, then ....
Post by Vara Prasad
The other related problem is when there exists debug information, the
way to specify the breakpoint location is using line number which is not
maintainable, having a marker solves this problem as well. Your proposal
still doesn't solve the need for markers if i understood correctly.
It could, but I think we're better off with the markers, yes.
Post by Vara Prasad
Post by Martin Bligh
2. Overhead of the int3, which was allegedly 1000 cycles or so, though
faster after Ingo had played with it, it's still significant.
The reason Kprobes use breakpoint instruction as pointed out by Prasanna
is, it is atomic on most platforms. We are already working on an
improved idea using jump instruction with which overhead is less than
100 cycles on modern CPU's but it has some limitations and issues
related to preemption and SMP.
You can get a glimpse of some of the issues here
http://sourceware.org/ml/systemtap/2006-q3/msg00507.html
http://sourceware.org/ml/systemtap/2005-q4/msg00117.html
For more details do a search for djprobe in the systemtap mailing list
(sorry i am not able to find few threads to summarize all the issues).
"This djprobe is NOT a replacement of kprobes. Djprobe and kprobes
have complementary qualities. (ex: djprobe's overhead is low, and
kprobes can be inserted in anywhere.)". Hmm. that seems problematic.
From what I was describing for function replacement, we could do an NMI
IPI to everyone, and lock them in there whilst we insert the probe, but
it's a bit sucky.
We can do batch processing here. Send one IPI to everyone
and then insert bunch of jump instructions. This will reduce number
of IPI required here.
Post by Martin Bligh
Post by Vara Prasad
Here is the algorithm djprobes uses to
IA
| [-2][-1][0][1][2][3][4][5][6][7]
[ins1][ins2][ ins3 ]
[<- DCR ->]
[<- JTPR ->]
ins1: 1st Instruction
ins2: 2nd Instruction
ins3: 3rd Instruction
IA: Insertion Address
JTPR: Jump Target Prohibition Region
DCR: Detoured Code Region
The replacement procedure of djpopbes is the following (i have
simplified for readability the actual steps djprobes uses)
(1) copying instruction(s) in DCR
(2) putting break point instruction at IA
(3) make sure no cpu's have replacing instructions in the cache to avoid
jump to the middle of jmp instruction
(4) replacing original instruction(s) with jump instruction
As you can see from the above your suggestion is very similar to the
djprobes hence i believe all the issues related to djprobes will be
valid for yours as well.
The hooking seems very similar, yes, perhaps I can be lazy and just
steal djprobes for this. The difference is that if we just replace the
whole function, we can just shove arbitrary changes into functions, and
do whatever we please. Plus we don't have to worry about locating
internal variables, etc.
Some more coplicated method.
How about inserting a (instruction size) number of breakpoints and
wait untill all the threads gets scheduled atleast once (so that
threads would hit the breakpoint, if their IPs are in the middle of
instruction we want to replace with jump) and then replace with
jump instruction.

Thanks
Prasanna
--
Prasanna S.P.
Linux Technology Center
India Software Labs, IBM Bangalore
Email: ***@in.ibm.com
Ph: 91-80-41776329
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Mathieu Desnoyers
2006-09-19 20:20:12 UTC
Permalink
Post by S. P. Prasanna
Some more coplicated method.
How about inserting a (instruction size) number of breakpoints and
wait untill all the threads gets scheduled atleast once (so that
threads would hit the breakpoint, if their IPs are in the middle of
instruction we want to replace with jump) and then replace with
jump instruction.
What happen if a thread is stopped ?

Mathieu

OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Masami Hiramatsu
2006-09-20 11:10:07 UTC
Permalink
Hi,
Post by S. P. Prasanna
Some more coplicated method.
How about inserting a (instruction size) number of breakpoints and
wait untill all the threads gets scheduled atleast once (so that
threads would hit the breakpoint, if their IPs are in the middle of
instruction we want to replace with jump) and then replace with
jump instruction.
I think there is no need to insert so many breakpoints.
Instead of that, you merely wait that all the threads which are
running on each processors at that time gets scheduled, if the kernel
is *NOT* preemptive.

If the kernel is preemptive, some threads might sleep on the target
address. In this case, we can use freeze_processes() to ensure safety.
This idea was proposed by Ingo.

Thanks,
--
Masami HIRAMATSU
2nd Research Dept.
Hitachi, Ltd., Systems Development Laboratory
E-mail: ***@hitachi.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Mathieu Desnoyers
2006-09-19 19:30:21 UTC
Permalink
Post by Vara Prasad
Post by Martin Bligh
[...]
1. Flexibility - kprobes seem unable to access all local variables etc
easily, and go anywhere inside the function. Plus keeping low overhead
for doing things like keeping counters in a function (see previous
example I mentioned for counting pages in shrink_list).
Using tools like systemtap on can consult DWARF information and put
probes in the middle of the function and access local variables as well,
that is not the real problem. The issue here is compiler doesn't seem to
generate required DWARF information in some cases due to optimizations.
The other related problem is when there exists debug information, the
way to specify the breakpoint location is using line number which is not
maintainable, having a marker solves this problem as well. Your proposal
still doesn't solve the need for markers if i understood correctly.
His implementation makes a heavy use of a marker mechanism : this is exactly
what permits to create the instrumented objects from the same source code, but
with different #defines.

Mathieu

OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin Bligh
2006-09-19 19:30:21 UTC
Permalink
Post by Mathieu Desnoyers
Post by Vara Prasad
Post by Martin Bligh
[...]
1. Flexibility - kprobes seem unable to access all local variables etc
easily, and go anywhere inside the function. Plus keeping low overhead
for doing things like keeping counters in a function (see previous
example I mentioned for counting pages in shrink_list).
Using tools like systemtap on can consult DWARF information and put
probes in the middle of the function and access local variables as well,
that is not the real problem. The issue here is compiler doesn't seem to
generate required DWARF information in some cases due to optimizations.
The other related problem is when there exists debug information, the
way to specify the breakpoint location is using line number which is not
maintainable, having a marker solves this problem as well. Your proposal
still doesn't solve the need for markers if i understood correctly.
His implementation makes a heavy use of a marker mechanism : this is exactly
what permits to create the instrumented objects from the same source code, but
with different #defines.
I don't think it ties us to markers, though I think they're superior for
maintaintance, personally. It could equally well be an out of tree
normal flat patch with all the tracing in, which would make Andrew
happy, even if I think it sucks ;-)

M.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Satoshi Oshima
2006-09-19 22:40:13 UTC
Permalink
Post by Mathieu Desnoyers
Post by Vara Prasad
Post by Martin Bligh
[...]
1. Flexibility - kprobes seem unable to access all local variables etc
easily, and go anywhere inside the function. Plus keeping low overhead
for doing things like keeping counters in a function (see previous
example I mentioned for counting pages in shrink_list).
Using tools like systemtap on can consult DWARF information and put
probes in the middle of the function and access local variables as well,
that is not the real problem. The issue here is compiler doesn't seem to
generate required DWARF information in some cases due to optimizations.
The other related problem is when there exists debug information, the
way to specify the breakpoint location is using line number which is not
maintainable, having a marker solves this problem as well. Your proposal
still doesn't solve the need for markers if i understood correctly.
His implementation makes a heavy use of a marker mechanism : this is exactly
what permits to create the instrumented objects from the same source code, but
with different #defines.
Djprobes don't depend on markers. Actually, markers help to find the
safe place to probe, but they are not necessary. At least, instructions
that are more than 4 byte are probable.

As Vara pointed out, we are developing the tools that find the
safe place for djprobes.


Satoshi OSHIMA

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Helge Hafting
2006-09-20 09:50:11 UTC
Permalink
Post by S. P. Prasanna
Yes, that's simple. but slower, as you have a double jump. Probably
a damned sight faster than int3 though.
M.
The advantage of using int3 over jmp to launch the instrumented
module is that int3 (or breakpoint in most architectures) is an
atomic operation to insert.
Yes, 5 bytes is not an atomic write except on 64-bit. So a race is possible.

How about this workaround:
1. Overwrite the start of the function with a hlt, which is atomic.
2. Write that 5-byte jump after the hlt.
3. Overwrite the hlt with nop so things will work
4. interrupt any cpus that got stuck on the hlt - or just wait for the
timer.

Helge Hafting

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Alan Cox
2006-09-20 10:10:11 UTC
Permalink
Post by Helge Hafting
Yes, 5 bytes is not an atomic write except on 64-bit. So a race is possible.
Untrue as well. Pentium and later have CMPXCHG8.
Post by Helge Hafting
1. Overwrite the start of the function with a hlt, which is atomic.
2. Write that 5-byte jump after the hlt.
3. Overwrite the hlt with nop so things will work
4. interrupt any cpus that got stuck on the hlt - or just wait for the
timer.
CPU errata time again. You have to synchronize.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Masami Hiramatsu
2006-09-20 13:30:13 UTC
Permalink
Hi,
Post by Alan Cox
Post by Helge Hafting
1. Overwrite the start of the function with a hlt, which is atomic.
2. Write that 5-byte jump after the hlt.
3. Overwrite the hlt with nop so things will work
4. interrupt any cpus that got stuck on the hlt - or just wait for the
timer.
CPU errata time again. You have to synchronize.
Sure, and the djprobe which I had developed method can treat it as below;
1. Overwrite the 1st instruction with int3. (atomic)
2. Wait until all processes running on every cpus are scheduled.
(I'm using synchronize_sched(). This step ensures no-one exist on
the instructions which will be overwritten by the dest-addr)
3. Write the destination address
4. Interrupt any cpus to serialize those caches (using CPUID).
5. Overwrite the int3 with jmp opcode. (atomic)

In this method, the instructions are updated like below;
0. [ insn1 ][ insn2]
1. [int3]1 ][ insn2]
2. wait
3. [int3][ destaddr]
4. sync
5. [jmp to destaddr]

Actually, #2 is not enough for the preemptive kernel. So, current
djprobe doesn't support CONFIG_PREEMPT. But Ingo proposed some
good ideas (use freeze_processes()). I'll try his ideas.

What would you think about djprobe's method?

Thanks,
--
Masami HIRAMATSU
2nd Research Dept.
Hitachi, Ltd., Systems Development Laboratory
E-mail: ***@hitachi.com



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin Bligh
2006-09-19 17:00:18 UTC
Permalink
[...] "compiled anew from original sources after deployment" seems
the most practical to do to me. From second hand info on using
systemtap, you seem to need the same compiler and source tree to
work from anyway [...]
Not quite. Systemtap does not look at sources, only object code and
its embedded debugging information. (How many distributions keep
around compilable source trees?)
???? Boggle. Any distro that cannot find the source code for it's kernel
deserves a swift kick to the head, plus a red hot poker somewhere else.
[...] It seems like all we'd need to do is "list all references to
function, freeze kernel, update all references, continue", [...]
One additional problem are external references made *by* the function.
Those too would all have to be relocated to the live data.
Not sure what you mean ... could you give a quick example?
Live code patching is theoretically useful for all kinds of things,
but I've never heard it described as relatively simple before! :-)
well, on a whole-function basis, it seems somewhat simpler.

M.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Frank Ch. Eigler
2006-09-19 17:10:22 UTC
Permalink
Hi -
Post by Martin Bligh
[...] (How many distributions keep around compilable source
trees?)
???? Boggle. Any distro that cannot find the source code for it's kernel
deserves a swift kick to the head, plus a red hot poker somewhere else.
My question is more whether they package up such a buildable
configured patched source tree (/usr/src/redhat/BUILD/* in RH-speak),
or just some extract like the .c/.h files.
Post by Martin Bligh
[...] It seems like all we'd need to do is "list all references to
function, freeze kernel, update all references, continue", [...]
One additional problem are external references made *by* the function.
Those too would all have to be relocated to the live data.
Not sure what you mean ... could you give a quick example?
Think about stuff that any function does. It calls other functions,
and manipulates global data, which all show up as external references
in the object code. All those references would have to be patched to
refer to the live running copy of the original compilation unit.


- FChE
Frank Ch. Eigler
2006-09-19 17:00:22 UTC
Permalink
[...] "compiled anew from original sources after deployment" seems
the most practical to do to me. From second hand info on using
systemtap, you seem to need the same compiler and source tree to
work from anyway [...]
Not quite. Systemtap does not look at sources, only object code and
its embedded debugging information. (How many distributions keep
around compilable source trees?)
[...] It seems like all we'd need to do is "list all references to
function, freeze kernel, update all references, continue", [...]
One additional problem are external references made *by* the function.
Those too would all have to be relocated to the live data.

Live code patching is theoretically useful for all kinds of things,
but I've never heard it described as relatively simple before! :-)

- FChE
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ingo Molnar
2006-09-19 15:50:11 UTC
Permalink
Post by Martin J. Bligh
You know ... it strikes me that there's another way to do this, that's
zero overhead when not enabled, and gets rid of the inflexibility in
kprobes. It might not work well in all cases, but at least for simple
non-inlined functions, it'd seem to.
Why don't we just copy the whole damned function somewhere else, and
make an instrumented copy (as a kernel module)? Then reroute all the
function calls through it, instead of the original version. OK, it's
not completely trivial to do, but simpler than kprobes (probably doing
the switchover atomically is the hard part, but not impossible).
There's NO overhead when not using, and much lower than probes when
you are.
That way we can do whatever the hell we please with internal
variables, however GCC optimises it, can write flexible instrumenting
code to just about anything, program in C as God intended, etc, etc.
No, it probably won't fix every case under the sun, but hopefully most
of them, and we can still use kprobes/djprobes/bodilyprobes for the
rest of the cases.
yeah, this would be nice - if it werent it for function pointers, and if
all kernel functions were relocatable. But if you can think of a method
to do this, it would be nice.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andi Kleen
2006-09-20 11:30:12 UTC
Permalink
Post by Ingo Molnar
yeah, this would be nice - if it werent it for function pointers, and if
all kernel functions were relocatable. But if you can think of a method
to do this, it would be nice.
x86-64 did it for some time statically to replace mem copies and some other
functions. Basically it just patches the beginning of the other function
to a jump. However this assumes that the code doesn't contain absolute addresses
(e.g. no switches). In the x86-64 it's easy because only assembly functions
are threated this way.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Vara Prasad
2006-09-19 16:10:11 UTC
Permalink
Post by Martin J. Bligh
Post by Ingo Molnar
Post by Mathieu Desnoyers
+choice
+ prompt "MARK code marker behavior"
+config MARK_KPROBE
+config MARK_JPROBE
+config MARK_FPROBE
+ Change markers for a function call.
+config MARK_PRINT
as indicated before in great detail, NACK on this profileration of
marker options, especially the function call one. I'd like to see
_one_ marker mechanism that distros could enable, preferably with
zero (or at most one NOP) in-code overhead. (You can of course patch
whatever extension ontop of it, in out-of-tree code, to gain further
performance advantage by generating direct system-calls.)
There might be a hodgepodge of methods and tools in userspace to do
debugging, but in the kernel we should get our act together and only
take _one_ (or none at all), and then spend all our efforts on
improving that primary method of debug instrumentation. As
kprobes/SystemTap has proven, it is possible to have zero-overhead
inactive probes.
Furthermore, for such a patch to make sense in the upstream kernel,
downstream tracing code has to make actual use of that NOP-marker.
I.e. a necessary (but not sufficient) requirement for upstream
inclusion (in my view) would be for this mechanism to be used by LTT
and LKST. (again, you can patch LTT for your own purposes in your own
patchset if you think the performance overhead of probes is too much)
You know ... it strikes me that there's another way to do this, that's
zero overhead when not enabled, and gets rid of the inflexibility in
kprobes. It might not work well in all cases, but at least for simple
non-inlined functions, it'd seem to.
Why don't we just copy the whole damned function somewhere else, and
make an instrumented copy (as a kernel module)? Then reroute all the
function calls through it, instead of the original version. OK, it's
not completely trivial to do, but simpler than kprobes (probably
doing the switchover atomically is the hard part, but not impossible).
There's NO overhead when not using, and much lower than probes when
you are.
That way we can do whatever the hell we please with internal variables,
however GCC optimises it, can write flexible instrumenting code to just
about anything, program in C as God intended, etc, etc. No, it probably
won't fix every case under the sun, but hopefully most of them, and we
can still use kprobes/djprobes/bodilyprobes for the rest of the cases.
M.
It is an interesting idea but there appears to be following hard issues
(some of which you have already listed) i am not able to see how we can
overcome them

1) We are going to have a duplicate of the whole function which means
any significant changes in the original function needs to be done on the
copy as well, you think maintainers would like this double work idea.

2) Inline functions is often the place where we need a fast path to
overcome the current kprobes overhead.

3) As you said it is not trivial across all the platforms to do a switch
to the instrumented function from the original during the execution.
This problem is similar to the issue we are dealing with djprobes.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin Bligh
2006-09-19 16:20:16 UTC
Permalink
Post by Vara Prasad
It is an interesting idea but there appears to be following hard issues
(some of which you have already listed) i am not able to see how we can
overcome them
1) We are going to have a duplicate of the whole function which means
any significant changes in the original function needs to be done on the
copy as well, you think maintainers would like this double work idea.
No, no ... the duplicate function isn't duplicated source code, only
object code. Either a config option via the markup macros that we've
been discussing, or something I hack up on the fly to debug a problem
dynamically. In terms of how the debugging-type source code is kept,
it's no different than something like systemtap or LTT (either would
work, and a normal diff could be used to keep out of tree stuff),
it's just how it hooks in is different to kprobes.
Post by Vara Prasad
2) Inline functions is often the place where we need a fast path to
overcome the current kprobes overhead.
You can still instrument inline functions, you just need to hook all
the callers, not the inline itself.
Post by Vara Prasad
3) As you said it is not trivial across all the platforms to do a switch
to the instrumented function from the original during the execution.
This problem is similar to the issue we are dealing with djprobes.
If we just freeze all kernel operations for a split second whilst we do
this, does it matter? Or even if we don't ... there's a brief race where
some calls are traced, and some are not ... does that even matter?
Doesn't seem like most usages would care.

M.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Mathieu Desnoyers
2006-09-19 17:50:11 UTC
Permalink
Post by Vara Prasad
It is an interesting idea but there appears to be following hard issues
(some of which you have already listed) i am not able to see how we can
overcome them
1) We are going to have a duplicate of the whole function which means
any significant changes in the original function needs to be done on the
copy as well, you think maintainers would like this double work idea.
Not with my marker proposal. There is only need to compile it with different
flags.
Post by Vara Prasad
2) Inline functions is often the place where we need a fast path to
overcome the current kprobes overhead.
3) As you said it is not trivial across all the platforms to do a switch
to the instrumented function from the original during the execution.
This problem is similar to the issue we are dealing with djprobes.
I would really like to know how good djprobes is at instrumenting the
prologue of a function.

Mathieu

OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin Bligh
2006-09-19 16:20:09 UTC
Permalink
Post by Martin J. Bligh
Why don't we just copy the whole damned function somewhere else, and
make an instrumented copy (as a kernel module)?
If you're going to go with that, then why not just use a comment-based
markup?
Comment, marker macro, flat patch, don't care much. all would work.
Then your alternate copy gets to be generated from the same codebase.
That was always the intent, or codebase + flat patch if really
necessary. Sorry if that wasn't clear.
It also solves the inherent problem of decided on whether
a macro-based markup is far too intrusive, since you can mildly allow
yourself more verbosity in a comment. Not only that, but if it's
comment-based, it's even forseable, though maybe not desirable, than
*everything* that deals with this type of markup be maintained out
of tree (i.e. scripts generating alternate functions and all.)
Not sure we need scripts, just a normal patch diff would do. I'm not
sure any of this alters the markup debate much ... it just would seem
to provide a simpler, faster, and more flexible way of hooking in than
kprobes.

M.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Karim Yaghmour
2006-09-19 16:30:52 UTC
Permalink
Post by Martin Bligh
That was always the intent, or codebase + flat patch if really
necessary. Sorry if that wasn't clear.
Ah, ok.
Post by Martin Bligh
Not sure we need scripts, just a normal patch diff would do. I'm not
sure any of this alters the markup debate much ...
It doesn't, just wasn't clear on the function duplication part.
Post by Martin Bligh
it just would seem
to provide a simpler, faster, and more flexible way of hooking in than
kprobes.
Sure.

Karim

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Karim Yaghmour
2006-09-19 16:50:13 UTC
Permalink
Post by Martin Bligh
That was always the intent, or codebase + flat patch if really
necessary. Sorry if that wasn't clear.
Actually rereading through your posts with this correction in mind
I find this to actually be one of the most interesting ideas I've
seen of late. There's probably not a 1-to-1 correlation here, but
some of the problems mentioned seem similar to RCU stuff (modify
pointer, make sure nobody's got copy to it, etc.), tough I could
be wrong.

Random thoughts -- no guarantees:

Instead of freezing everything and making sure all text refs to
function are modified, you might just be able to use kprobes (on
the architectures that have it) as a trampoline for on-the-fly
address call modifications. And on the archs that don't have
kprobes, you could at build time degrade this by replacing direct
calls to instrumented functions by function pointers or localized
ifs.

Not sure.

Karim

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Karim Yaghmour
2006-09-19 16:20:13 UTC
Permalink
Post by Martin J. Bligh
Why don't we just copy the whole damned function somewhere else, and
make an instrumented copy (as a kernel module)?
If you're going to go with that, then why not just use a comment-based
markup? Then your alternate copy gets to be generated from the same
codebase. It also solves the inherent problem of decided on whether
a macro-based markup is far too intrusive, since you can mildly allow
yourself more verbosity in a comment. Not only that, but if it's
comment-based, it's even forseable, though maybe not desirable, than
*everything* that deals with this type of markup be maintained out
of tree (i.e. scripts generating alternate functions and all.)

Karim

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Mathieu Desnoyers
2006-09-19 17:50:06 UTC
Permalink
Hi Martin,
Post by Martin J. Bligh
Why don't we just copy the whole damned function somewhere else, and
make an instrumented copy (as a kernel module)? Then reroute all the
function calls through it, instead of the original version. OK, it's
not completely trivial to do, but simpler than kprobes (probably
doing the switchover atomically is the hard part, but not impossible).
There's NO overhead when not using, and much lower than probes when
you are.
I just thought about your idea and I think it can be very powerful. I think it
can be a lot easier with a probe at the beginning of the function than changing
function pointers everywhere. First of all, if we just think about accessing
easily internal variables, we could think of this simple trampoline scheme :

1 - load the instrumented function with modprobe
2 - use kprobe to reroute the first instructions of the original function to the
new one.
3 - _not_ use the special kprobe_ret, simply return at the end of the
instrumented function.

Then, if we want to optimize the speed of this mechanism, we can deploy
djprobes : it would greatly help them to know in advance where the probe is
located. We would have to see if the prologue of a function is a good spot to
put a jump (it does not seem to be the case however) :( .

To stop this tracing behavior, we would just have to remove the kprobe.
Unloading of the instrumented module can be difficult though (we have to be sure
the code will no longer be executed).

Mathieu


OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Karim Yaghmour
2006-09-20 17:30:12 UTC
Permalink
Post by Martin J. Bligh
can still use kprobes/djprobes/bodilyprobes for the rest of the cases.
May I suggest we call your mechanism "bprobes" ... which stands for "branch"
probes of course ;)

Karim

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Frank Ch. Eigler
2006-09-19 15:30:08 UTC
Permalink
[...] I take for agreed that both static and dynamic tracing are
useful for different needs and that a full markup must support both
and combinations, letting the user or the distribution choose.
Elaborating on Ingo's "one mechanism" comments, I believe a marker
widget needs to be generic at run time. We're not just looking for a
way of hiding direct calls to lttng in a marker macro. We're looking
for a way of marking spots & data in a uniform way, then later
(run-time) binding each of those markers to (tools such as) lttng
and/or systemtap.

- FChE
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Masami Hiramatsu
2006-09-20 13:30:12 UTC
Permalink
Hi,
Post by Mathieu Desnoyers
Hello,
Following this huge discussion thread, I tried to come with a marker mechanism
(which is something everyone seems to agree that is a necessity) that would be
SystemTAP, LKET, LKST, LTTng) and even combinations of those. Religious
considerations aside, I really think that this kind of generic markup is
necessary to fill *everybody*'s need. If I forgot about a specific genericity
aspect, please tell me.
I take for agreed that both static and dynamic tracing are useful for different
needs and that a full markup must support both and combinations, letting the
user or the distribution choose.
Basically, I like this static marker concept.
But I wonder why wouldn't you use the architecture-independent
marker which SystemTap already supports.
If we use NOPs, it highly depends on architecture, and is hard
to port.

Thanks,
--
Masami HIRAMATSU
2nd Research Dept.
Hitachi, Ltd., Systems Development Laboratory
E-mail: ***@hitachi.com




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Mathieu Desnoyers
2006-09-20 13:40:06 UTC
Permalink
Post by Masami Hiramatsu
Hi,
Post by Mathieu Desnoyers
Hello,
Following this huge discussion thread, I tried to come with a marker mechanism
(which is something everyone seems to agree that is a necessity) that would be
SystemTAP, LKET, LKST, LTTng) and even combinations of those. Religious
considerations aside, I really think that this kind of generic markup is
necessary to fill *everybody*'s need. If I forgot about a specific genericity
aspect, please tell me.
I take for agreed that both static and dynamic tracing are useful for different
needs and that a full markup must support both and combinations, letting the
user or the distribution choose.
Basically, I like this static marker concept.
But I wonder why wouldn't you use the architecture-independent
marker which SystemTap already supports.
If we use NOPs, it highly depends on architecture, and is hard
to port.
Hi Masami,

Are you talking about the marker presented by Frank in his OLS paper (
void dest() = NULL; if(dest) dest()) ? I think it is a very good idea to use it
instead of nops.

Mathieu

OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Loading...