Discussion:
[RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
(too old to reply)
Borislav Petkov
2015-02-20 19:10:02 UTC
Permalink
+ Linus.

I'm sure he'll save something to say about it :-)
304bceda6a18 x86, fpu: use non-lazy fpu restore for processors supporting xsave
The result is rather messy. There are two code paths in almost all of the
FPU code, and only one of them (the eager case) is tested frequently, since
most kernel developers have new enough hardware that we use eagerfpu.
glibc uses SSE2, so laziness is probably overoptimistic, and, in any
case, manipulating TS is far slower that saving and restoring the full
state.
To try to shake out any latent issues on old hardware, this changes
the default to eager on all CPUs. If no performance or functionality
problems show up, a subsequent patch could remove lazy mode entirely.
---
arch/x86/kernel/xsave.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 0de1fae2bdf0..1928c8f34ce5 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -637,8 +637,8 @@ static void __init xstate_enable_boot_cpu(void)
prepare_fx_sw_frame();
setup_init_fpu_buf();
- /* Auto enable eagerfpu for xsaveopt */
- if (cpu_has_xsaveopt && eagerfpu != DISABLE)
+ /* Auto enable eagerfpu for everyone */
+ if (eagerfpu != DISABLE)
eagerfpu = ENABLE;
if (pcntxt_mask & XSTATE_EAGER) {
--
2.1.0
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-02-21 16:40:01 UTC
Permalink
So it would be nice to test this on at least one reasonably old (but
not uncomfortably old - say 5 years old) system, to get a feel for
what kind of performance impact it has there.
Yeah, this is exactly what Andy and I were talking about yesterday on
IRC. So let's measure our favourite workload - the kernel build! :-) My
assumption is that libc uses SSE for memcpy and thus the FPU will be
used. (I'll trace FPU-specific PMCs later to confirm).

Machine is an AMD F10h which should be 5-10 years old depending on what
you're looking at (uarch, revision, ...).

Numbers look great to me in the sense that we have a very small
improvement and the rest stays the same. Which would mean, killing lazy
FPU does not bring slowdown, if no improvement, but will bring a huuge
improvement in code quality and the handling of the FPU state by getting
rid of the lazyness...

IPC is the same, branch misses are *down* a bit, cache misses go up a
bit probably because we're shuffling FPU state more often to mem, page
faults go down and runtime goes down by half a second:

plain 3.19:
==========

perf stat -a -e task-clock,cycles,instructions,branch-misses,cache-misses,faults,context-switches,migrations --repeat 10 --sync --pre ~/bin/pre-build-kernel.sh make -s -j12

Performance counter stats for 'system wide' (10 runs):

1408897.576594 task-clock (msec) # 6.003 CPUs utilized ( +- 0.15% ) [100.00%]
3,137,565,760,188 cycles # 2.227 GHz ( +- 0.02% ) [100.00%]
2,849,228,161,721 instructions # 0.91 insns per cycle ( +- 0.00% ) [100.00%]
32,391,188,891 branch-misses # 22.990 M/sec ( +- 0.02% ) [100.00%]
27,879,813,595 cache-misses # 19.788 M/sec ( +- 0.01% )
27,195,402 faults # 0.019 M/sec ( +- 0.01% ) [100.00%]
1,293,241 context-switches # 0.918 K/sec ( +- 0.09% ) [100.00%]
69,548 migrations # 0.049 K/sec ( +- 0.22% )

234.681331200 seconds time elapsed ( +- 0.15% )


eagerfpu=ENABLE
===============

Performance counter stats for 'system wide' (10 runs):

1405208.771580 task-clock (msec) # 6.003 CPUs utilized ( +- 0.19% ) [100.00%]
3,137,381,829,748 cycles # 2.233 GHz ( +- 0.03% ) [100.00%]
2,849,059,336,718 instructions # 0.91 insns per cycle ( +- 0.00% ) [100.00%]
32,380,999,636 branch-misses # 23.044 M/sec ( +- 0.02% ) [100.00%]
27,884,281,327 cache-misses # 19.844 M/sec ( +- 0.01% )
27,193,985 faults # 0.019 M/sec ( +- 0.01% ) [100.00%]
1,293,300 context-switches # 0.920 K/sec ( +- 0.08% ) [100.00%]
69,791 migrations # 0.050 K/sec ( +- 0.18% )

234.066525648 seconds time elapsed ( +- 0.19% )
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-02-21 17:40:02 UTC
Permalink
My assumption is that libc uses SSE for memcpy and thus the FPU will
be used. (I'll trace FPU-specific PMCs later to confirm).
Ok, so I slapped a trace_printk() at the beginning of fpu_save_init()
and did a kernel build once with default 3.19 and once with Andy's
patch. So we end up with this count:

default: 712000
eager: 780000

This would explain the very small difference in the performance counters
data from the previous email.

Provided I've not made a mistake, this leads me to think that this
simple workload and pretty much everything else uses the FPU through
glibc which does the SSE memcpy and so on. Which basically kills the
whole idea behind lazy FPU as practically you don't really encounter
workloads nowadays which don't use the FPU thanks to glibc and the lazy
strategy doesn't really bring anything.

Which would then mean, we don't really need the lazy handling as
userspace is making it eager, so to speak, for us.

Thoughts?
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-02-21 19:20:01 UTC
Permalink
So the workload improved by ~600,000 usecs, and there's
68,000 less calls, so it saved 8.8 usecs per call. Isn't
I think you mean more calls. The eager measurement has more calls. Let
me do some primitive math:

def =(234.681331200 / 712000)*10^6 = 329.60861123595505000000 microsecs/call
eager=(234.066525648 / 780000)*10^6 = 300.08528929230769000000 microsecs/call

diff is 29.52332194364736000000 microsecs speedup per call which could
explain the cost of CR0.TS serialization semantics in the lazy mode.
that a bit too high?
Now, is 29 microseconds too high? I'm not sure this is even correct and
not some noise interfering.
I'd sleep a lot better if we had some runtime debug flag to
be able to do run-to-run comparisons on the same booted up
kernel, or so.
Let me take a look whether we could so some knob... The nice thing is,
code uses use_eager_fpu() to check stuff so we should be able to clear
the cpufeature flag.
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ingo Molnar
2015-02-21 19:30:01 UTC
Permalink
Post by Borislav Petkov
I'd sleep a lot better if we had some runtime debug
flag to be able to do run-to-run comparisons on the
same booted up kernel, or so.
Let me take a look whether we could so some knob... The
nice thing is, code uses use_eager_fpu() to check stuff
so we should be able to clear the cpufeature flag.
Either that, or you could quickly hack those checks to use
panic_timeout, add an 'extern int panic_timeout;' and after
bootup do:

echo 0 > /proc/sys/kernel/panic
echo 1 > /proc/sys/kernel/panic

to switch between the modes?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-02-21 21:40:03 UTC
Permalink
Post by Ingo Molnar
to switch between the modes?
I went all out and did a debugfs file, see patch at the end, which
counts FPU saves. Then I ran this script:

---
#!/bin/bash

D="/sys/kernel/debug/fpu/eager"

echo "Lazy FPU: "
echo 0 > $D
echo -n " FPU saves before: "; cat $D

perf stat -a -e task-clock,cycles,instructions,branch-misses,cache-misses,faults,context-switches,migrations --sync --pre ~/bin/pre-build-kernel.sh make -s -j12

echo -n " FPU saves after: "; cat $D

echo ""
echo "Eager FPU: "
echo 1 > $D
echo -n " FPU saves before: "; cat $D

perf stat -a -e task-clock,cycles,instructions,branch-misses,cache-misses,faults,context-switches,migrations --sync --pre ~/bin/pre-build-kernel.sh make -s -j12

echo -n " FPU saves after: "; cat $D
---

which spit this:

Lazy FPU:
FPU saves before: 3
Setup is 16252 bytes (padded to 16384 bytes).
System is 4222 kB
CRC c79a13ab
Kernel: arch/x86/boot/bzImage is ready (#41)

Performance counter stats for 'system wide':

1315527.989020 task-clock (msec) # 6.003 CPUs utilized [100.00%]
3,042,312,057,208 cycles # 2.313 GHz [100.00%]
2,790,807,863,402 instructions # 0.92 insns per cycle [100.00%]
31,658,299,111 branch-misses # 24.065 M/sec [100.00%]
27,504,255,277 cache-misses # 20.907 M/sec
26,802,015 faults # 0.020 M/sec [100.00%]
1,248,899 context-switches # 0.949 K/sec [100.00%]
69,553 migrations # 0.053 K/sec

219.127929718 seconds time elapsed

FPU saves after: 704186

Eager FPU:
FPU saves before: 4
Setup is 16252 bytes (padded to 16384 bytes).
System is 4222 kB
CRC 6767bb2e
Kernel: arch/x86/boot/bzImage is ready (#42)

Performance counter stats for 'system wide':

1321651.543922 task-clock (msec) # 6.003 CPUs utilized [100.00%]
3,044,403,437,364 cycles # 2.303 GHz [100.00%]
2,790,835,886,565 instructions # 0.92 insns per cycle [100.00%]
31,638,090,259 branch-misses # 23.938 M/sec [100.00%]
27,491,643,095 cache-misses # 20.801 M/sec
26,869,732 faults # 0.020 M/sec [100.00%]
1,252,034 context-switches # 0.947 K/sec [100.00%]
69,247 migrations # 0.052 K/sec

220.148034331 seconds time elapsed

FPU saves after: 901638

---
so we have a second slowdown and 200K FPU saves more in eager mode.

Provided I've not done a mistake, looks like the increase in cycles gets
mirrored in 1 second time longer. I've not done the --repeat 10 thing
again, maybe I should do it too, just to be fair as this is a single
run.

---
arch/x86/include/asm/fpu-internal.h | 4 ++++
arch/x86/kernel/xsave.c | 47 ++++++++++++++++++++++++++++++++++++-
2 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h
index e97622f57722..7141f353e960 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -38,6 +38,8 @@ int ia32_setup_frame(int sig, struct ksignal *ksig,
# define ia32_setup_rt_frame __setup_rt_frame
#endif

+
+extern unsigned long fpu_saved;
extern unsigned int mxcsr_feature_mask;
extern void fpu_init(void);
extern void eager_fpu_init(void);
@@ -242,6 +244,8 @@ static inline void fpu_fxsave(struct fpu *fpu)
*/
static inline int fpu_save_init(struct fpu *fpu)
{
+ fpu_saved++;
+
if (use_xsave()) {
fpu_xsave(fpu);

diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 0de1fae2bdf0..029de8b629d0 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -14,6 +14,8 @@
#include <asm/sigframe.h>
#include <asm/xcr.h>

+#include <linux/debugfs.h>
+
/*
* Supported feature mask by the CPU and the kernel.
*/
@@ -638,7 +640,7 @@ static void __init xstate_enable_boot_cpu(void)
setup_init_fpu_buf();

/* Auto enable eagerfpu for xsaveopt */
- if (cpu_has_xsaveopt && eagerfpu != DISABLE)
+ if (eagerfpu != DISABLE)
eagerfpu = ENABLE;

if (pcntxt_mask & XSTATE_EAGER) {
@@ -739,3 +741,46 @@ void *get_xsave_addr(struct xsave_struct *xsave, int xstate)
return (void *)xsave + xstate_comp_offsets[feature];
}
EXPORT_SYMBOL_GPL(get_xsave_addr);
+
+unsigned long fpu_saved;
+
+static int eager_get(void *data, u64 *val)
+{
+ *val = fpu_saved;
+
+ return 0;
+}
+
+static int eager_set(void *data, u64 val)
+{
+ if (val)
+ setup_force_cpu_cap(X86_FEATURE_EAGER_FPU);
+ else
+ setup_clear_cpu_cap(X86_FEATURE_EAGER_FPU);
+
+ fpu_saved = 0;
+
+ return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(eager_fops, eager_get, eager_set, "%llu\n");
+
+static int __init setup_eagerfpu_knob(void)
+{
+ static struct dentry *d_eager, *f_eager;
+
+ d_eager = debugfs_create_dir("fpu", NULL);
+ if (!d_eager) {
+ pr_err("Error creating fpu debugfs dir\n");
+ return -ENOMEM;
+ }
+
+ f_eager = debugfs_create_file("eager", 0644, d_eager, NULL, &eager_fops);
+ if (!f_eager) {
+ pr_err("Error creating fpu debugfs node\n");
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+late_initcall(setup_eagerfpu_knob);
--
2.2.0.33.gc18b867
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ingo Molnar
2015-02-22 08:20:01 UTC
Permalink
Post by Borislav Petkov
219.127929718 seconds time elapsed
220.148034331 seconds time elapsed
so we have a second slowdown and 200K FPU saves more in eager mode.
So am I interpreting the older and your latest numbers
correctly in stating that the cost observation has flipped
around 180 degrees: the first measurement showed eager FPU
to be a win, but now that we can do more precise
measurements, eager FPU has actually slowed down the kernel
build by ~0.5%?

That's not good, and kernel builds are just a random load
that isn't even that FPU or context switch heavy - there
will certainly be other loads that would be hurt even more.

So just before we base wide reaching decisions based on any
of these measurements, would you mind help us increase our
confidence in the numbers some more:

- It might make sense to do a 'perf stat --null --repeat'
measurement as well [without any -e arguments], to make
sure the rich PMU stats you are gathering are not
interfering?

With 'perf stat --null --repeat' perf acts essenially
as a /usr/bin/time replacement, but can measure down to
microseconds and will calculate noise/sttdev properly.

- Perhaps also double check the debug switch: is it
really properly switching FPU handling mode?

- Do you have enough RAM that there's essentially no IO
in the system worth speaking of? Do you have enough RAM
to copy a whole kernel tree to /tmp/linux/ and do the
measurement there, on ramfs?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ingo Molnar
2015-02-22 08:30:01 UTC
Permalink
Post by Ingo Molnar
- Do you have enough RAM that there's essentially no IO
in the system worth speaking of? Do you have enough RAM
to copy a whole kernel tree to /tmp/linux/ and do the
measurement there, on ramfs?
Doing that will also pin down the page cache: kernel build
times are very sensitive to the page cache layout, and once
a page cache layout is established on systems with lots of
RAM it does not tend to be flushed out. But the next bootup
will generate another random page cache layout - which
makes inter-kernel kernel build times comparisons much
noiser than the run-to-run numbers suggest.

So to get more precise measurements a 'pinned' page cache
layout and the dynamic debug switch you implemented is very
helpful and --repeat stddev will be pretty representative
of the true noise of the measurement.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-02-22 11:00:02 UTC
Permalink
Post by Ingo Molnar
So am I interpreting the older and your latest numbers
correctly in stating that the cost observation has flipped
around 180 degrees: the first measurement showed eager FPU
to be a win, but now that we can do more precise
measurements, eager FPU has actually slowed down the kernel
build by ~0.5%?
Well, I wouldn't take the latest numbers too seriously - that was a
single run without --repeat.
Post by Ingo Molnar
That's not good, and kernel builds are just a random load
that isn't even that FPU or context switch heavy - there
will certainly be other loads that would be hurt even more.
That is my fear.
Post by Ingo Molnar
So just before we base wide reaching decisions based on any
of these measurements, would you mind help us increase our
- It might make sense to do a 'perf stat --null --repeat'
measurement as well [without any -e arguments], to make
sure the rich PMU stats you are gathering are not
interfering?
With 'perf stat --null --repeat' perf acts essenially
as a /usr/bin/time replacement, but can measure down to
microseconds and will calculate noise/sttdev properly.
Cool, let me do that.
Post by Ingo Molnar
- Perhaps also double check the debug switch: is it
really properly switching FPU handling mode?
I've changed the use_eager_fpu() test to do:

static __always_inline __pure bool use_eager_fpu(void)
{
return boot_cpu_has(X86_FEATURE_EAGER_FPU);
}

and I'm clearing/setting eager FPU with
setup_force_cpu_cap/setup_clear_cpu_cap, see full diff below.
Post by Ingo Molnar
- Do you have enough RAM that there's essentially no IO
in the system worth speaking of? Do you have enough RAM
to copy a whole kernel tree to /tmp/linux/ and do the
measurement there, on ramfs?
/proc/meminfo says "MemTotal: 4011860 kB" which is probably not enough.
But I could find one somewhere :-)

---
arch/x86/include/asm/fpu-internal.h | 6 +++-
arch/x86/kernel/xsave.c | 57 ++++++++++++++++++++++++++++++++++++-
2 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h
index e97622f57722..c8a161d02056 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -38,6 +38,8 @@ int ia32_setup_frame(int sig, struct ksignal *ksig,
# define ia32_setup_rt_frame __setup_rt_frame
#endif

+
+extern unsigned long fpu_saved;
extern unsigned int mxcsr_feature_mask;
extern void fpu_init(void);
extern void eager_fpu_init(void);
@@ -87,7 +89,7 @@ static inline int is_x32_frame(void)

static __always_inline __pure bool use_eager_fpu(void)
{
- return static_cpu_has_safe(X86_FEATURE_EAGER_FPU);
+ return boot_cpu_has(X86_FEATURE_EAGER_FPU);
}

static __always_inline __pure bool use_xsaveopt(void)
@@ -242,6 +244,8 @@ static inline void fpu_fxsave(struct fpu *fpu)
*/
static inline int fpu_save_init(struct fpu *fpu)
{
+ fpu_saved++;
+
if (use_xsave()) {
fpu_xsave(fpu);

diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 0de1fae2bdf0..943af0adacff 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -14,6 +14,8 @@
#include <asm/sigframe.h>
#include <asm/xcr.h>

+#include <linux/debugfs.h>
+
/*
* Supported feature mask by the CPU and the kernel.
*/
@@ -638,7 +640,7 @@ static void __init xstate_enable_boot_cpu(void)
setup_init_fpu_buf();

/* Auto enable eagerfpu for xsaveopt */
- if (cpu_has_xsaveopt && eagerfpu != DISABLE)
+ if (eagerfpu != DISABLE)
eagerfpu = ENABLE;

if (pcntxt_mask & XSTATE_EAGER) {
@@ -739,3 +741,56 @@ void *get_xsave_addr(struct xsave_struct *xsave, int xstate)
return (void *)xsave + xstate_comp_offsets[feature];
}
EXPORT_SYMBOL_GPL(get_xsave_addr);
+
+unsigned long fpu_saved;
+
+static void my_clts(void *arg)
+{
+ asm volatile("clts");
+}
+
+static int eager_get(void *data, u64 *val)
+{
+ *val = fpu_saved;
+
+ return 0;
+}
+
+static int eager_set(void *data, u64 val)
+{
+ preempt_disable();
+ if (val) {
+ on_each_cpu(my_clts, NULL, 1);
+ setup_force_cpu_cap(X86_FEATURE_EAGER_FPU);
+ } else {
+ setup_clear_cpu_cap(X86_FEATURE_EAGER_FPU);
+ stts();
+ }
+ preempt_enable();
+
+ fpu_saved = 0;
+
+ return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(eager_fops, eager_get, eager_set, "%llu\n");
+
+static int __init setup_eagerfpu_knob(void)
+{
+ static struct dentry *d_eager, *f_eager;
+
+ d_eager = debugfs_create_dir("fpu", NULL);
+ if (!d_eager) {
+ pr_err("Error creating fpu debugfs dir\n");
+ return -ENOMEM;
+ }
+
+ f_eager = debugfs_create_file("eager", 0644, d_eager, NULL, &eager_fops);
+ if (!f_eager) {
+ pr_err("Error creating fpu debugfs node\n");
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+late_initcall(setup_eagerfpu_knob);
--
2.2.0.33.gc18b867
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ingo Molnar
2015-02-22 13:00:02 UTC
Permalink
219.406449195 seconds time elapsed ( +- 0.17% )
218.791122148 seconds time elapsed ( +- 0.13% )
Timing improvement of 0.6 secs on average looks nice to
me. Next I'll do the same thing in tmpfs to see how the
numbers look without the page cache.
This is also very similar to the ~0.6 secs improvement your
first set of numbers gave.

So now that it appears we have consistent numbers, it would
be nice to check it on older hardware (and other workloads)
as well, to see whether it's a consistent win.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-02-22 13:00:02 UTC
Permalink
Post by Ingo Molnar
- It might make sense to do a 'perf stat --null --repeat'
measurement as well [without any -e arguments], to make
sure the rich PMU stats you are gathering are not
interfering?
Well, the --repeat thing definitely is good to do and the previous
single run is simply not meaningful due to variance I'd guess:

perf stat --null --repeat 10 --sync --pre ~/bin/pre-build-kernel.sh make -s -j12

...

Lazy FPU:
FPU saves before: 5
Setup is 16252 bytes (padded to 16384 bytes).
System is 4222 kB
CRC 5417fa4f
Kernel: arch/x86/boot/bzImage is ready (#43)

...

Kernel: arch/x86/boot/bzImage is ready (#52)

Performance counter stats for 'make -s -j12' (10 runs):

219.406449195 seconds time elapsed ( +- 0.17% )

FPU saves after: 6974975

Eager FPU:
FPU saves before: 4
Setup is 16252 bytes (padded to 16384 bytes).
System is 4222 kB
CRC 823c1801
Kernel: arch/x86/boot/bzImage is ready (#53)

...

Kernel: arch/x86/boot/bzImage is ready (#62)

Performance counter stats for 'make -s -j12' (10 runs):

218.791122148 seconds time elapsed ( +- 0.13% )

FPU saves after: 8939084


The counter numbers are consistent with before: ~200000 FPU saves more
in eager mode per kernel build.

Timing improvement of 0.6 secs on average looks nice to me. Next I'll
do the same thing in tmpfs to see how the numbers look without the page
cache.
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Maciej W. Rozycki
2015-02-22 00:40:01 UTC
Permalink
Post by Borislav Petkov
Provided I've not made a mistake, this leads me to think that this
simple workload and pretty much everything else uses the FPU through
glibc which does the SSE memcpy and so on. Which basically kills the
whole idea behind lazy FPU as practically you don't really encounter
workloads nowadays which don't use the FPU thanks to glibc and the lazy
strategy doesn't really bring anything.
Which would then mean, we don't really need the lazy handling as
userspace is making it eager, so to speak, for us.
Please correct me if I'm wrong, but it looks to me like you're confusing
lazy FPU context allocation and lazy FPU context switching. These build
on the same hardware principles, but they are different concepts.

Your "userspace is making it eager" statement in the context of glibc
using SSE for `memcpy' is certainly true for lazy FPU context allocation,
however I wouldn't be so sure about lazy FPU context switching, and a
kernel compilation (or in fact any compilation) does not appear to be a
representative benchmark to me. I am sure lots of software won't be
calling `memcpy' all the time, there should be context switches between
which the FPU is not referred to at all.

Also, does `__builtin_memcpy' also expand to SSE? I'd expect it rather
than external `memcpy' to be used by GCC for copying fixed amounts of
data, especially smaller ones such as when passing structures by value in
function calls or for string operations like `strdup' or suchlike. These
I'd expect to be ubiquitous, whereas external `memcpy' I'd expect to be
called from time to time only.

Additionally I believe long-executing FPU instructions (i.e.
transcendentals) can take advantage of continuing to execute in parallel
where the context has already been switched rather than stalling an eager
FPU context switch until the FPU instruction has completed.

And last but not least, why does the handling of CR0.TS traps have to be
complicated? It does not look like rocket science to me, it should be a
mere handful of instructions, the time required to move the two FP
contexts out from and in to the FPU respectively should dominate
processing time. Where quoted the optimisation manual states 250 cycles
for FXSAVE and FXRSTOR combined.

And of course you can install the right handler (i.e. FSAVE vs FXSAVE) at
bootstrap depending on processor features, you don't have to do all the
run-time check on every trap. You can even optimise the FSAVE handler
away at the build time if you know it won't ever be used based on the
minimal supported processor family selected.

Do you happen to know or can determine how much time (in clock cycles) a
CR0.TS trap itself takes, including any time required to preserve the
execution state in the handler such as pushing/popping GPRs to/from the
stack (as opposed to processing time spent on moving the FP contexts back
and forth)? Is there no room for improvement left there? How many task
scheduling slots say per million must be there poking at the FPU for eager
FPU context switching to take advantage over lazy one?

Maciej
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-02-22 11:10:02 UTC
Permalink
That's true. The question is whether there are enough of them, and
whether twiddling TS is fast enough, that it's worth it.
Yes, and let me make it clear what I'm trying to do here: I want to make
sure that eager FPU handling (both allocation and switching - and no,
I'm not confusing the concepts) *doesn't* *hurt* any relevant workload.
If it does, then we'll stop wasting time right there.

But(!), if the CR0.TS lazy dance turns out to be really slow and the
eager handling doesn't bring a noticeable slowdown, in comparison, we
should consider doing the eager thing by default. After running a lot
more benchmarks, of course.

Which brings me to the main reason why we're doing this: code
simplification. If we switch to eager, we'll kill a lot of non-trivial
code and the FPU handling in the kernel becomes dumb and nice again.
Post by Maciej W. Rozycki
And last but not least, why does the handling of CR0.TS traps have to be
complicated? It does not look like rocket science to me, it should be a
If you think so, just go and look at the latest tip/x86/fpu commits and
the amount of WTF in the commit messages.
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Rik van Riel
2015-02-23 01:50:01 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Borislav Petkov
That's true. The question is whether there are enough of them,
and whether twiddling TS is fast enough, that it's worth it.
Yes, and let me make it clear what I'm trying to do here: I want to
make sure that eager FPU handling (both allocation and switching -
and no, I'm not confusing the concepts) *doesn't* *hurt* any
relevant workload. If it does, then we'll stop wasting time right
there.
But(!), if the CR0.TS lazy dance turns out to be really slow and
the eager handling doesn't bring a noticeable slowdown, in
comparison, we should consider doing the eager thing by default.
After running a lot more benchmarks, of course.
Which brings me to the main reason why we're doing this: code
simplification. If we switch to eager, we'll kill a lot of
non-trivial code and the FPU handling in the kernel becomes dumb
and nice again.
Currently the FPU handling does something really dumb for
KVM VCPU threads. Specifically, every time we enter a
KVM guest, we save the userspace FPU state of the VCPU
thread, and every time we leave the KVM guest, we load
the userspace FPU state of the VCPU thread.

This is done for a thread that hardly ever exits to
userspace, and generally just switches between kernel
and guest mode.

The reason for this acrobatics is that at every
context switch time, the userspace FPU state is
saved & loaded.

I am working on a patch series to avoid that needless
FPU save & restore, by moving the point at which the
user FPU state is loaded out to the point where we
return to userspace, in do_notify_resume.

One implication of this is that in kernel mode, we
can no longer just assume that the user space FPU
state is always loaded, and we need to check for that
(like the lazy FPU code does today). I would really
like to keep that code around, for obvious reasons :)

- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJU6oZBAAoJEM553pKExN6Dk1oH/iJhvE96xM8Yo38eplaI73nC
Bx8OAJk5ombiTroWTqU99Y/2iZmdt3k9KHYEQhYnvCu3RV/4N9GwVLobbh/EPN8Q
v/gXJHOPT1uT7arpIu+XqcbqYCUUMnFdAOfjuLupGRjX64YgzBltd4TUC4yPdW/2
TXnj7OLW3jGIJVOKjnx7zQaKqolAAxbprXkfe8MsGwy0ARS4kXIvcBG7e8s92uQb
oIIQrs5UxmhQo/8Sa+Q8jCF8bHrTJr/mkbPybT6o1et78kwT7FV+2d7oGQySn+v1
FMBRiQsUOdY6AZOtjkWxB+k73QDSwkdLivVWwXZMGICUcQz4756nINWQyPNL49U=
=DlRc
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andy Lutomirski
2015-02-23 05:30:02 UTC
Permalink
Post by Rik van Riel
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Borislav Petkov
That's true. The question is whether there are enough of them,
and whether twiddling TS is fast enough, that it's worth it.
Yes, and let me make it clear what I'm trying to do here: I want to
make sure that eager FPU handling (both allocation and switching -
and no, I'm not confusing the concepts) *doesn't* *hurt* any
relevant workload. If it does, then we'll stop wasting time right
there.
But(!), if the CR0.TS lazy dance turns out to be really slow and
the eager handling doesn't bring a noticeable slowdown, in
comparison, we should consider doing the eager thing by default.
After running a lot more benchmarks, of course.
Which brings me to the main reason why we're doing this: code
simplification. If we switch to eager, we'll kill a lot of
non-trivial code and the FPU handling in the kernel becomes dumb
and nice again.
Currently the FPU handling does something really dumb for
KVM VCPU threads. Specifically, every time we enter a
KVM guest, we save the userspace FPU state of the VCPU
thread, and every time we leave the KVM guest, we load
the userspace FPU state of the VCPU thread.
This is done for a thread that hardly ever exits to
userspace, and generally just switches between kernel
and guest mode.
The reason for this acrobatics is that at every
context switch time, the userspace FPU state is
saved & loaded.
I am working on a patch series to avoid that needless
FPU save & restore, by moving the point at which the
user FPU state is loaded out to the point where we
return to userspace, in do_notify_resume.
One implication of this is that in kernel mode, we
can no longer just assume that the user space FPU
state is always loaded, and we need to check for that
(like the lazy FPU code does today). I would really
like to keep that code around, for obvious reasons :)
I like that stuff, except for the fact that it still has code that
depends on whether we're in eager or lazy mode, even though eager is a
little less eager with your patches. Ideally I'd like to see your
patches applied *and* lazy mode removed.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Rik van Riel
2015-02-23 13:00:01 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Andy Lutomirski
One implication of this is that in kernel mode, we can no longer
just assume that the user space FPU state is always loaded, and
we need to check for that (like the lazy FPU code does today). I
would really like to keep that code around, for obvious reasons
:)
I like that stuff, except for the fact that it still has code that
depends on whether we're in eager or lazy mode, even though eager
is a little less eager with your patches. Ideally I'd like to see
your patches applied *and* lazy mode removed.
The latest version of my code (which I should forward-port
again to the latest tip bits) simplifies things a lot.

It moves all new task handling to switch_fpu_finished(),
which is called from do_notify_resume().

At that point we either load the FPU context, or we
set CR0.TS.

- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJU6yI3AAoJEM553pKExN6DyQEH/13E0T6wsII5L+0W1D0WU4v5
F2SmU74zvCv+RhRl/0npvKRQC3IkZ/peM9h/1PU/r61Y39VhexUxiUZ86afzz4KJ
ogPNEs6HZ8UYSji6RPlVsuJJ4vCsZ9GA+27JwKGzLUG1NfniL0qlc93rl+f3ZfFm
5H4bjtcO361Am15yKkFPRl/0PKqRhMDa4tzJGkvvCaJ+k195Ik0a7WXjOlGdIeJA
5Q1D0qaAccXtwfNjqhxIfTBi1LV2a1/jCKF8ktxGJ2+Ywvau9blJnPfD8vUf9JL2
9/le1kyNU3h8B2lU0jIi4a2Y/CcU47heEUy5htmoo6wChQWAy/pR8JaE6++qWqI=
=ee/J
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-02-23 15:10:02 UTC
Permalink
Post by Rik van Riel
At that point we either load the FPU context, or we
set CR0.TS.
Right, but provided eager doesn't bring any slowdown, we can drop the TS
fiddling altogether and only load FPU context.
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Rik van Riel
2015-02-23 16:00:02 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Borislav Petkov
At that point we either load the FPU context, or we set CR0.TS.
Right, but provided eager doesn't bring any slowdown, we can drop
the TS fiddling altogether and only load FPU context.
Agreed.

However, we would still need the rest of the kernel code to
check whether userspace FPU context is loaded in the registers
or not. We cannot simplify to the point of assuming that the
FPU state will always be loaded when in eager mode.

- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJU60x+AAoJEM553pKExN6DIMkIAJpZHQuojQMTUyy11EB7dx2N
ZMm5PiQGW47f/BIbIwR3hwZbC385ZLdx1Nkrys8cXd3WNJInrMxf2PDaU9YtDFw1
NdybRbbsxAAaB3oJldLFUH5ouo9ePKV4RyXFgsgNH75uv6u4pesaWwUZNuLmxzrk
h8/wns7Lpddr4C0ahPimEM1pXdfiFyo/0jaAiOj8o/IklXiJW25LBYUzE4ogD/2T
097wV5C8ojvS4doVD+GthxFuJFslZWxyNaRBRNqpE2LyhmjEdz2WwNTSrDtLsBK0
gKHjgD0eox/HpNkBzwaF6MOo1of2/WFNSJyRKwlwaWnRNpZ3jDfyU0VPg3yhF30=
=aIKn
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-02-23 18:10:03 UTC
Permalink
However, we would still need the rest of the kernel code to ...
Yeah, let's wait out first and see what the benchmarks say. Mel started
a bunch of them on a couple of boxes here, we'll have results in the
coming days.
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Maciej W. Rozycki
2015-02-23 21:20:02 UTC
Permalink
Post by Maciej W. Rozycki
Additionally I believe long-executing FPU instructions (i.e.
transcendentals) can take advantage of continuing to execute in parallel
where the context has already been switched rather than stalling an eager
FPU context switch until the FPU instruction has completed.
It seems highly unlikely to me that a slow FPU instruction can retire
*after* a subsequent fxsave, which would need to happen for this to
work.
I meant something else -- a slow FPU instruction can retire after a task
has been switched where the FP context has been left intact, i.e. in the
lazy FP context switching case, where only the MMU context and GPRs have
been replaced. Whereas in the eager FP context switching case you can get
through to FXSAVE while a slow FPU instruction hasn't completed yet (e.g.
started just as preemption was about to happen).

Obviously that FXSAVE will have to stall until the FPU instruction has
completed (IIRC the i486 aborted transcendental instructions on any
exceptions/interrupts instead, leading to the risk of process starvation
in heavily interrupt loaded systems, but I also believe it has been fixed
as from the Pentium). Though if, as you say, the lone taking of a
trap/interrupt gate can take hundreds of cycles, perhaps indeed no FPU
instruction will execute *that* long on modern silicon.
Post by Maciej W. Rozycki
And last but not least, why does the handling of CR0.TS traps have to be
complicated? It does not look like rocket science to me, it should be a
mere handful of instructions, the time required to move the two FP
contexts out from and in to the FPU respectively should dominate
processing time. Where quoted the optimisation manual states 250 cycles
for FXSAVE and FXRSTOR combined.
The TS traps aren't complicated -- they're just really slow. I think
that each of setting *and* clearing TS serializes and takes well over
a hundred cycles. A #NM interrupt (the thing that happens if you try
to use the FPU with TS set) serializes and does all kinds of slow
things, so it takes many hundreds of cycles. The handler needs to
clear TS (another hundred cycles or more), load the FPU state
(actually rather fast on modern CPUs), and then IRET back to userspace
(hundreds of cycles). This adds up to a lot of cycles. A round trip
through an exception handler seems to be thousands of cycles.
That sucks wet goat farts! :(

I have to admit I got moved a bit away from the x86 world and didn't
realise things have become so bad. Some 10 years ago or so taking a trap
or interrupt gate would need some 30 cycles (of course task gates are
unusable for anything that does not absolutely require them such as a #DF;
their usability for anything real ended with the 80286 or suchlike).
Similarly an IRET to reverse the actions taken. That was already rather
bad, but understandable, after all the CPU had to read the gate
descriptor, access the TSS, switch both CS and SS descriptors, etc.

What I don't understand is why CLTS, a dedicated instruction that avoids
the need to access whole CR0 (that again can understandably be costly,
because of the grouping of all the important bits there), has to be so
slow. It flips a single bit down there and does not to serialise
anything, as any instruction down the pipeline it could affect would
trigger a #NM anyway! And there's an IRET somewhere on the way too,
before the instruction that originally triggered the fault will be
reexecuted.

And why the heck over all these years a mechanism similar to SYSENTER and
its bunch of complementing MSRs hasn't been invented for the common
exceptions, to avoid all this gate descriptor dance!
Post by Maciej W. Rozycki
And of course you can install the right handler (i.e. FSAVE vs FXSAVE) at
bootstrap depending on processor features, you don't have to do all the
run-time check on every trap. You can even optimise the FSAVE handler
away at the build time if you know it won't ever be used based on the
minimal supported processor family selected.
None of this matters. A couple of branches in an #NM handler are
completely lost in the noise.
Agreed, given what you state, completely understood.
Post by Maciej W. Rozycki
Do you happen to know or can determine how much time (in clock cycles) a
CR0.TS trap itself takes, including any time required to preserve the
execution state in the handler such as pushing/popping GPRs to/from the
stack (as opposed to processing time spent on moving the FP contexts back
and forth)? Is there no room for improvement left there? How many task
scheduling slots say per million must be there poking at the FPU for eager
FPU context switching to take advantage over lazy one?
Thousands of cycles. Considerably worse in an SVM guest. x86
exception handling sucks.
I must have been spoilt with the MIPS exception handling. Taking an
exception on a MIPS processor is virtually instantaneous, just like
retiring another instruction. Of course there's the cost equivalent to
branch misprediction, you need to invalidate all the pipeline. So
depending on how many stages you have there, you can expect a latency of
say 3-7 clocks.

Granted, on a MIPS processor taking an exception does not change much --
it switches into the kernel mode (1 bit set in a control register, a
special kernel-mode-override bit dedicated to exception handling), saves
the old PC (another control register updated; called Exception PC or EPC)
and loads the PC with the exception vector. All the rest is left to the
kernel. Which is good!

The same stands for ERET, the exception return instruction -- it merely
loads the PC back from EPC and clears the kernel-mode-override bit in the
other control register. More recently it also serves the purpose of an
instruction hazard barrier, which you'd call synchronisation, the
strongest kind provided in the MIPS architecture (in older architecture
revisions you had to take care of any hazards caused by preceding
instructions that could affect user-mode execution, by inserting the right
number of NOPs before ERET, possibly taking other instructions already
executed since the origin of the hazard into account). So rather than 3-7
clocks that could be 20 or so, though usually much fewer.

A while ago I cooperated with the hardware team in adding an extra
instruction to the architecture under the assumption that it will be
emulated on legacy hardware, by taking the RI or Reserved Instruction
exception (the equivalent to x86's #UD) and doing the rest there.
Another assumption was a fast path would be taken for this single
instruction and all the handling done in assembly, without even reaching
the usual C-language RI handlers that we've accumulated over the years.

Mind that exceptions actually have to be decoded and dispatched to
individual handlers on MIPS processors first, it's not that individual
exception classes have individual vectors like with x86 -- there's only
one! And you need to update EPC too or you'd be trapping back. Finally
the instruction itself had to be decoded, so instruction memory had to be
read and compared against the pattern expected.

To make a long story short I was able to squeeze all the handling into
some 30 cycles, with a slight variation across different processor
implementations. How much different!

Oh well, some further benchmarking is still needed, but given the
circumstances I suppose the old good design will have to go after all,
sigh... Thanks for your input!

Maciej
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Rik van Riel
2015-02-23 21:30:01 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Maciej W. Rozycki
Post by Maciej W. Rozycki
Additionally I believe long-executing FPU instructions (i.e.
transcendentals) can take advantage of continuing to execute in
parallel where the context has already been switched rather
than stalling an eager FPU context switch until the FPU
instruction has completed.
It seems highly unlikely to me that a slow FPU instruction can
retire *after* a subsequent fxsave, which would need to happen
for this to work.
I meant something else -- a slow FPU instruction can retire after a
task has been switched where the FP context has been left intact,
i.e. in the lazy FP context switching case, where only the MMU
context and GPRs have been replaced.
I don't think that's true, because changing the MMU context and GPRs
also includes changing the instruction pointer, and changing over the
execution to the new task.

After a context switch, the instructions from the old task are no
longer in the pipeline.

- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJU65noAAoJEM553pKExN6DyVQH+gJDWuHyoFtgKFyELFi+tTGb
vB/9uR267N4Hxu0D3kedhX9HNJwBtcs1rXm++p+r4RXpz2+gl/HxihoL8v1qk4zy
1lAa5se8T5hUL7MCsD5nOIs2ca3HAGwOFtVXrjvJQ6dSCCVD/SuqJC49ENc5Nd3M
1FcW4jg+MlQ2TNJoi4QRBllmxUFffACzv8M3VWVigIibMGqrOXXSraHp0k1hFFOE
LeL122JKVp7YhVj5Dx7qx1DPUkT24RT9ip1f9r5DkxFmHwdCcY31WzT4asArqHb4
uwK8LVcpTax30VHghqy8KUBVKO5h7pHvdhTYSaoLWdUJfNFJ+8Hw5sDMfPdjN2k=
=6lA5
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
2015-02-23 22:20:02 UTC
Permalink
Post by Rik van Riel
Post by Maciej W. Rozycki
It seems highly unlikely to me that a slow FPU instruction can
retire *after* a subsequent fxsave, which would need to happen
for this to work.
I meant something else -- a slow FPU instruction can retire after a
task has been switched where the FP context has been left intact,
i.e. in the lazy FP context switching case, where only the MMU
context and GPRs have been replaced.
I don't think that's true, because changing the MMU context and GPRs
also includes changing the instruction pointer, and changing over the
execution to the new task.
We have one traditional special case, which actually did something
like Maciej's nightmare scenario: the completely broken "FPU errors
over irq13" IBM PC/AT FPU linkage.

But since we don't actually support old i386 machines any more, we
don't really need to care. The only way you can get into that
situation is with an external i387. I don't think we need to worry
about it.

But with the old horrid irq13 error handling, you literally could get
into a situation that you got an error "exception" (irq) from the
previous state, *after* you had already switched to the new one. We
had some code to mitigate the problem, but as mentioned, I don't think
it's an issue any more.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andy Lutomirski
2015-02-24 01:10:02 UTC
Permalink
Post by Linus Torvalds
We have one traditional special case, which actually did something
like Maciej's nightmare scenario: the completely broken "FPU errors
over irq13" IBM PC/AT FPU linkage.
But since we don't actually support old i386 machines any more, we
don't really need to care. The only way you can get into that
situation is with an external i387. I don't think we need to worry
about it.
But with the old horrid irq13 error handling, you literally could get
into a situation that you got an error "exception" (irq) from the
previous state, *after* you had already switched to the new one. We
had some code to mitigate the problem, but as mentioned, I don't think
it's an issue any more.
Correct, the horrid hack is gone, it was so horrible (though I understand
why IBM had to do it with their PC/AT) that back in mid 1990s, some 10
years after the inception of the problem, Intel felt so compelled to make
people get the handling of it right as to release a dedicated application
note: "AP-578 Software and Hardware Considerations for FPU Exception
Handlers for Intel Architecture Processors", Order Number 243291-002.
Anyway, my point through this consideration has been about the
performance benefit from continuing the execution of an x87 instruction in
parallel, perhaps until after a context has been fully switched. Which
benefit is lost if a FSAVE/FXSAVE executed eagerly at the context switch
stalls waiting for the instruction to complete instead.
And the save is indeed executed eagerly. Conceivably we could get rid
of that on UP, but that seems to be a lot of complexity for extremely
little gain.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Maciej W. Rozycki
2015-02-24 01:20:01 UTC
Permalink
Post by Linus Torvalds
We have one traditional special case, which actually did something
like Maciej's nightmare scenario: the completely broken "FPU errors
over irq13" IBM PC/AT FPU linkage.
But since we don't actually support old i386 machines any more, we
don't really need to care. The only way you can get into that
situation is with an external i387. I don't think we need to worry
about it.
But with the old horrid irq13 error handling, you literally could get
into a situation that you got an error "exception" (irq) from the
previous state, *after* you had already switched to the new one. We
had some code to mitigate the problem, but as mentioned, I don't think
it's an issue any more.
Correct, the horrid hack is gone, it was so horrible (though I understand
why IBM had to do it with their PC/AT) that back in mid 1990s, some 10
years after the inception of the problem, Intel felt so compelled to make
people get the handling of it right as to release a dedicated application
note: "AP-578 Software and Hardware Considerations for FPU Exception
Handlers for Intel Architecture Processors", Order Number 243291-002.

Anyway, my point through this consideration has been about the
performance benefit from continuing the execution of an x87 instruction in
parallel, perhaps until after a context has been fully switched. Which
benefit is lost if a FSAVE/FXSAVE executed eagerly at the context switch
stalls waiting for the instruction to complete instead.

Maciej
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Maciej W. Rozycki
2015-02-23 22:30:02 UTC
Permalink
Post by Rik van Riel
Post by Maciej W. Rozycki
I meant something else -- a slow FPU instruction can retire after a
task has been switched where the FP context has been left intact,
i.e. in the lazy FP context switching case, where only the MMU
context and GPRs have been replaced.
I don't think that's true, because changing the MMU context and GPRs
also includes changing the instruction pointer, and changing over the
execution to the new task.
That does not matter. The instructions in question only operate on x87
internal registers: the data stack registers, specifically ST(0) and
possibly also ST(1), and consequently the Tag Word register, and the
Status Word register. No CPU resource such as the MMU or GPRs need to be
referred for an x87 instruction to complete. Any unmasked IEEE 754 FPU
exception recorded on the way is only signalled at the next x87
instruction.
Post by Rik van Riel
After a context switch, the instructions from the old task are no
longer in the pipeline.
I'd say it's implementation-specific. As I mentioned the i486 aborted
any transcendental x87 instruction in progress upon taking an exception or
interrupt. That was a model like you refer to, but as I also mentioned it
had its shortcomings.

Any newer implementation I'd expect to, and Pentium class processors
certainly did, continue executing these instructions in parallel in the
FPU pipeline regardless of what the CPU does until completed. If WAIT or
a waiting x87 instruction was encountered while a previous x87 instruction
was still in progress, the CPU pipeline would stall until the earlier x87
instruction has completed. The FPU has no way to determine the CPU
context has been switched and neither it recognises execution privilege
levels.

I can't speak of SIMD instructions, I don't know offhand. OTOH AFAIK
they don't suffer from latencies so long as some x87 instructions that may
be in the range of 400 clock cycles.

Maciej
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andy Lutomirski
2015-02-23 23:50:01 UTC
Permalink
Post by Maciej W. Rozycki
Post by Rik van Riel
Post by Maciej W. Rozycki
I meant something else -- a slow FPU instruction can retire after a
task has been switched where the FP context has been left intact,
i.e. in the lazy FP context switching case, where only the MMU
context and GPRs have been replaced.
I don't think that's true, because changing the MMU context and GPRs
also includes changing the instruction pointer, and changing over the
execution to the new task.
That does not matter. The instructions in question only operate on x87
internal registers: the data stack registers, specifically ST(0) and
possibly also ST(1), and consequently the Tag Word register, and the
Status Word register. No CPU resource such as the MMU or GPRs need to be
referred for an x87 instruction to complete. Any unmasked IEEE 754 FPU
exception recorded on the way is only signalled at the next x87
instruction.
Post by Rik van Riel
After a context switch, the instructions from the old task are no
longer in the pipeline.
I'd say it's implementation-specific. As I mentioned the i486 aborted
any transcendental x87 instruction in progress upon taking an exception or
interrupt. That was a model like you refer to, but as I also mentioned it
had its shortcomings.
IRET is serializing, according to the the docs (I think) and according
to the Intel engineers I asked (I'm absolutely certain about this
part). So FPU ops are entirely done at the end of a normal context
switch.

We also always save the FPU context on every context switch away from
a task that used the FPU, even in lazy mode. This is because we might
switch the task back in on a different CPU, and we don't want to use
an IPI to move the FPU context.

Given that we're only talking about old CPUs here, I sincerely doubt
that there's any relevant case in which an fxsave can usefully wait
for a long-running transcendental op to finish while we continue doing
useful work. *Especially* since there will almost certainly be
several more mfences or atomic ops before the end of the context
switch, even if we're lucky enough to complete the context switching
using sysret.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Maciej W. Rozycki
2015-02-24 02:20:02 UTC
Permalink
Post by Andy Lutomirski
Post by Maciej W. Rozycki
Post by Rik van Riel
After a context switch, the instructions from the old task are no
longer in the pipeline.
I'd say it's implementation-specific. As I mentioned the i486 aborted
any transcendental x87 instruction in progress upon taking an exception or
interrupt. That was a model like you refer to, but as I also mentioned it
had its shortcomings.
IRET is serializing, according to the the docs (I think) and according
to the Intel engineers I asked (I'm absolutely certain about this
part). So FPU ops are entirely done at the end of a normal context
switch.
No question about the serialising property of IRET, it has been like this
since the original Pentium implementation. Do you have an architecture
specification reference to back up your claim though as far as the FPU is
concerned? I'm asking because I am genuinely curious.

The x87 case is so special, there isn't anything there really that is
externally observable or should be affected by IRET or any other
synchronisation barriers apart from WAIT (or a waiting x87 instruction)
that has been there for this purpose since forever. And it would defeat
some documented benefits of running the FP pipeline in the parallel.

And certainly such synchronisation didn't happen in the old days.
Post by Andy Lutomirski
We also always save the FPU context on every context switch away from
a task that used the FPU, even in lazy mode. This is because we might
switch the task back in on a different CPU, and we don't want to use
an IPI to move the FPU context.
That's an interesting case too, although not necessarily related. If you
say that we always save the FP context eagerly for the purpose of process
migration, then sure, that invalidates any benefit we'd have from letting
the x87 proceed.

However I can see different ways to address this case avoiding the need
of eager FP context saving or an IPI:

1. We could bind any currently suspended process with an unsaved FP
context to the CPU it last executed on.

2. We could mark such a process for migration next time and let it execute
on the CPU that holds its FP context once more, and then save the FP
context eagerly on the way out.

In some cases a lazily retained FP context would be preempted by another
process before the process in question would resume anyway. In this case
any temporary binding to a CPU could be given up.
Post by Andy Lutomirski
Given that we're only talking about old CPUs here, I sincerely doubt
that there's any relevant case in which an fxsave can usefully wait
for a long-running transcendental op to finish while we continue doing
useful work. *Especially* since there will almost certainly be
several more mfences or atomic ops before the end of the context
switch, even if we're lucky enough to complete the context switching
using sysret.
I am not sure what you mean by FXSAVE usefully waiting for an op, please
elaborate. At the point you've reached FXSAVE and an earlier x87
instruction hasn't completed, you've already lost. The pipeline will be
stalled until the x87 instruction has completed and it can be hundreds of
cycles. My point therefore has been about avoiding to execute FXSAVE for
the old task until absolutely necessary, that with the lazy FP context
switching would be at the next x87 (or SSE) instruction reached by the new
task.

Likewise I don't see why MFENCE or an atomic operation should affect the
excecution of say FSINCOS. Whether the results of FSINCOS arrive before
or after MFENCE, etc. are not externally observable.

And I'm not sure if this all affects old CPUs only -- I don't know how
much x87 software is out there, but after all these years I'd expect quite
some. Sure, lots of this can be recompiled to use SSE instead, but not
all, and even where it is feasible, that's an extra burden for people,
beyond say a routine hardware or Linux distribution or for that matter
lone kernel upgrade. Therefore I think we need to be careful not to
pessimise things for a subset of people too much and ideally at all.

And to be clear, I am not against removing lazy FP context switching per
se. I am just emphasizing to be careful with that and be absolutely sure
that it does not cause excessive harm.

I still wonder why Intel hasn't addressed some issues around this stuff
-- is that there are not enough people using proper IEEE 754 arithmetic on
x86 hardware to attract interest of hardware architecture maintainers?
After all the same issue applies to enabled IEEE 754 exceptions, a #MF/#XM
exception isn't going to take any less than a #NM fault. Or maybe I'm
just missing something?

Maciej
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andy Lutomirski
2015-02-24 02:40:01 UTC
Permalink
Post by Maciej W. Rozycki
Post by Andy Lutomirski
Post by Maciej W. Rozycki
Post by Rik van Riel
After a context switch, the instructions from the old task are no
longer in the pipeline.
I'd say it's implementation-specific. As I mentioned the i486 aborted
any transcendental x87 instruction in progress upon taking an exception or
interrupt. That was a model like you refer to, but as I also mentioned it
had its shortcomings.
IRET is serializing, according to the the docs (I think) and according
to the Intel engineers I asked (I'm absolutely certain about this
part). So FPU ops are entirely done at the end of a normal context
switch.
No question about the serialising property of IRET, it has been like this
since the original Pentium implementation. Do you have an architecture
specification reference to back up your claim though as far as the FPU is
concerned? I'm asking because I am genuinely curious.
The x87 case is so special, there isn't anything there really that is
externally observable or should be affected by IRET or any other
synchronisation barriers apart from WAIT (or a waiting x87 instruction)
that has been there for this purpose since forever. And it would defeat
some documented benefits of running the FP pipeline in the parallel.
It's plausible that this is special, but I doubt it. Especially since
this optimization would be nuts post-SSE2.
Post by Maciej W. Rozycki
And certainly such synchronisation didn't happen in the old days.
Post by Andy Lutomirski
We also always save the FPU context on every context switch away from
a task that used the FPU, even in lazy mode. This is because we might
switch the task back in on a different CPU, and we don't want to use
an IPI to move the FPU context.
That's an interesting case too, although not necessarily related. If you
say that we always save the FP context eagerly for the purpose of process
migration, then sure, that invalidates any benefit we'd have from letting
the x87 proceed.
However I can see different ways to address this case avoiding the need
1. We could bind any currently suspended process with an unsaved FP
context to the CPU it last executed on.
This would be insane.
Post by Maciej W. Rozycki
2. We could mark such a process for migration next time and let it execute
on the CPU that holds its FP context once more, and then save the FP
context eagerly on the way out.
This would be worse than insane. Now, in order to wake such a process
on a different CPU, we'd have to force a *context switch* on the
source CPU. Now we're replacing a few hundred cycles at worse for a
transcendental function with at least 10k cycles (at a guess) and
possibly many orders of magnitude more if locks are held, plus
priority issues, plus total craziness.
Post by Maciej W. Rozycki
In some cases a lazily retained FP context would be preempted by another
process before the process in question would resume anyway. In this case
any temporary binding to a CPU could be given up.
Post by Andy Lutomirski
Given that we're only talking about old CPUs here, I sincerely doubt
that there's any relevant case in which an fxsave can usefully wait
for a long-running transcendental op to finish while we continue doing
useful work. *Especially* since there will almost certainly be
several more mfences or atomic ops before the end of the context
switch, even if we're lucky enough to complete the context switching
using sysret.
I am not sure what you mean by FXSAVE usefully waiting for an op, please
elaborate. At the point you've reached FXSAVE and an earlier x87
instruction hasn't completed, you've already lost. The pipeline will be
stalled until the x87 instruction has completed and it can be hundreds of
cycles. My point therefore has been about avoiding to execute FXSAVE for
the old task until absolutely necessary, that with the lazy FP context
switching would be at the next x87 (or SSE) instruction reached by the new
task.
Likewise I don't see why MFENCE or an atomic operation should affect the
excecution of say FSINCOS. Whether the results of FSINCOS arrive before
or after MFENCE, etc. are not externally observable.
FSINCOS; FXSAVE; MFENCE had better serialize all the way, no matter
what weird architectural crud is going on.
Post by Maciej W. Rozycki
And I'm not sure if this all affects old CPUs only -- I don't know how
much x87 software is out there, but after all these years I'd expect quite
some. Sure, lots of this can be recompiled to use SSE instead, but not
all, and even where it is feasible, that's an extra burden for people,
beyond say a routine hardware or Linux distribution or for that matter
lone kernel upgrade. Therefore I think we need to be careful not to
pessimise things for a subset of people too much and ideally at all.
And to be clear, I am not against removing lazy FP context switching per
se. I am just emphasizing to be careful with that and be absolutely sure
that it does not cause excessive harm.
We're talking about the unusual case in which we context switch within
~100 cycles of a legacy transcendental operation, and, even so,
there's *still* no regression, since we don't optimize this case
today.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Rik van Riel
2015-02-24 14:50:01 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Mon, Feb 23, 2015 at 6:14 PM, Maciej W. Rozycki
Post by Maciej W. Rozycki
That's an interesting case too, although not necessarily related.
If you say that we always save the FP context eagerly for the
purpose of process migration, then sure, that invalidates any
benefit we'd have from letting the x87 proceed.
However I can see different ways to address this case avoiding
1. We could bind any currently suspended process with an unsaved
FP context to the CPU it last executed on.
This would be insane.
The task would only be bound to the CPU as long as nothing else
ran that saved the FPU state to memory. This means the task
would be bound to the CPU while the CPU is idle, or running a
kernel thread. When another user space thread is about to run,
the FPU state would be saved.

This sounds slightly less insane already.

Of course, once you figure in CPU power saving and CPU hot
plug, the level of insanity is far beyond what you seem to
imply above...

In other words, almost certainly not worth doing :)

- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJU7I38AAoJEM553pKExN6DhKEIAKauUGBx/1sTshdYYZ1aBlLx
xY7afNUjs8PIxjCbdcwrujbtNa9CFgDlRR6TegvmzQA3prXu/0XZ3vas3O/lD2lC
ks8p3RBzIw4dECxZoCvTQ+VrULk07+LCI+AUNKSm/pNlBCSWeeo2nKqoTREh3oHU
EWzJxn5aEfIA4vZQAnFP5TwkCwR2ob5COGx/I9l54brHEwhEqiRFrPwrIP2WJerx
Lc1Wkmv2PtTN/oQkXOCVKN0hVab//eVnkUiTsY1TnfCQsZSbEMWgq6aqXlb/lhUs
hhpNToBVlWF3LsCnGm6LfCrgX+VSBY9LQpYfaY1ltEmxE+nOplbI+JHQG4Yqgag=
=ah5u
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ingo Molnar
2015-02-21 18:40:01 UTC
Permalink
Post by Borislav Petkov
234.681331200 seconds time elapsed ( +- 0.15% )
eagerfpu=ENABLE
234.066525648 seconds time elapsed ( +- 0.19% )
hm, a win of more than 600 milliseconds - more than I'd
have expected.

I really want to believe this, but I worry about the
systematic boot-to-boot noise though, which cannot be
eliminated via --repeat 10.

Would it be possible to add a simple runtime debug switch
to switch between the two FPU modes dynamically via a
sysctl?

It should not crash tasks I think if it's flipped around on
a reasonably idle system, so should allow much better
apples-to-apples comparisons with the same kind of page
cache layout, etc.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Oleg Nesterov
2015-02-23 15:10:01 UTC
Permalink
304bceda6a18 x86, fpu: use non-lazy fpu restore for processors supporting xsave
The result is rather messy. There are two code paths in almost all of the
FPU code, and only one of them (the eager case) is tested frequently, since
most kernel developers have new enough hardware that we use eagerfpu.
glibc uses SSE2, so laziness is probably overoptimistic, and, in any
case, manipulating TS is far slower that saving and restoring the full
state.
To try to shake out any latent issues on old hardware, this changes
the default to eager on all CPUs. If no performance or functionality
problems show up, a subsequent patch could remove lazy mode entirely.
---
arch/x86/kernel/xsave.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 0de1fae2bdf0..1928c8f34ce5 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -637,8 +637,8 @@ static void __init xstate_enable_boot_cpu(void)
prepare_fx_sw_frame();
setup_init_fpu_buf();
- /* Auto enable eagerfpu for xsaveopt */
- if (cpu_has_xsaveopt && eagerfpu != DISABLE)
+ /* Auto enable eagerfpu for everyone */
+ if (eagerfpu != DISABLE)
eagerfpu = ENABLE;
Well, but if we want this change then perhaps we should simply change
the default value? This way "AUTO" still can work.

Oleg.

--- x/arch/x86/kernel/xsave.c
+++ x/arch/x86/kernel/xsave.c
@@ -563,7 +563,7 @@ static void __init setup_init_fpu_buf(vo
xsave_state_booting(init_xstate_buf, -1);
}

-static enum { AUTO, ENABLE, DISABLE } eagerfpu = AUTO;
+static enum { AUTO, ENABLE, DISABLE } eagerfpu = ENABLE;
static int __init eager_fpu_setup(char *s)
{
if (!strcmp(s, "on"))

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-02-23 15:20:02 UTC
Permalink
Post by Oleg Nesterov
Well, but if we want this change then perhaps we should simply change
the default value? This way "AUTO" still can work.
Yeah, sure, let's do some measurements first, to see whether this is
even worth it.

Btw, Mel pointed me at some cleanups he did a while ago in that area
too: https://lkml.org/lkml/2014/8/6/201

If you guys could take a look, that would be awesome :)
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Rik van Riel
2015-02-23 16:00:02 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Borislav Petkov
Post by Oleg Nesterov
Well, but if we want this change then perhaps we should simply
change the default value? This way "AUTO" still can work.
Yeah, sure, let's do some measurements first, to see whether this
is even worth it.
Btw, Mel pointed me at some cleanups he did a while ago in that
area too: https://lkml.org/lkml/2014/8/6/201
If you guys could take a look, that would be awesome :)
I don't think any of those changes would survive the
"defer FPU loading to do_notify_resume" patch series.

- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJU60zqAAoJEM553pKExN6DbIMH/A72/Lid9hho65Wh4notTa7a
QMRsM+XgLvvohmYKWZQ+uS4by9jbLaH827PIQtcW3bSEgtWGjam+pUGC/gWXwZk7
51oYDhyAqvVr6vb3mpl13+N+xGN5TlLoZa7SC73DuSSD/6HGqDryrr7mcfX53Yc+
UOg7MpAx//5tLZ89amMirqqqhqEfQh9qlyjPNLVywpHeSDYfxlinsfJgWOUW0/4A
7wR51ToANH9cMcSXWQ72lUNbJpJLgkcW8AVKXicS/UtFnvgItcvfgy1TjOpuaZom
MRVv0E7NUM10U9mZ13NRDE7JSCk5Euy8XgM+AOk5PXJlb0Lg8IYu1nVTn7Q+6A0=
=JLIM
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Oleg Nesterov
2015-02-23 18:50:01 UTC
Permalink
Post by Rik van Riel
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Borislav Petkov
Post by Oleg Nesterov
Well, but if we want this change then perhaps we should simply
change the default value? This way "AUTO" still can work.
Yeah, sure, let's do some measurements first, to see whether this
is even worth it.
Btw, Mel pointed me at some cleanups he did a while ago in that
area too: https://lkml.org/lkml/2014/8/6/201
If you guys could take a look, that would be awesome :)
I don't think any of those changes would survive the
"defer FPU loading to do_notify_resume" patch series.
Agreed, at least

+static inline void switch_eagerfpu_prepare(struct task_struct *old, struct task_struct *new, int cpu,
+ fpu_switch_t *fpu)
+{
+ fpu->preload = tsk_used_math(new);

doesn't look right if we change the rules.

OTOH, perhaps it makes sense to redo these cleanups later.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Denys Vlasenko
2015-02-24 19:20:01 UTC
Permalink
304bceda6a18 x86, fpu: use non-lazy fpu restore for processors supporting xsave
The result is rather messy. There are two code paths in almost all of the
FPU code, and only one of them (the eager case) is tested frequently, since
most kernel developers have new enough hardware that we use eagerfpu.
glibc uses SSE2, so laziness is probably overoptimistic, and, in any
case, manipulating TS is far slower that saving and restoring the full
state.
To try to shake out any latent issues on old hardware, this changes
the default to eager on all CPUs. If no performance or functionality
problems show up, a subsequent patch could remove lazy mode entirely.
I'm a big fan of simplifying things, but.

SIMD registers were growing in x86, and they are going to grow again,
this time four-fold in Intel MIC:
from sixteen 256-bit registers to thirty two 512-bit registers.

That's 2 kbytes of data. Just moving this data out to/from memory
will take some time.

And some people talk about 1024-bit registers already...

Let's not completely remove lazy FPU saving code just yet.
Maybe we'll be forced to reinstate it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andy Lutomirski
2015-02-25 00:10:02 UTC
Permalink
On Tue, Feb 24, 2015 at 11:15 AM, Denys Vlasenko
Post by Denys Vlasenko
304bceda6a18 x86, fpu: use non-lazy fpu restore for processors supporting xsave
The result is rather messy. There are two code paths in almost all of the
FPU code, and only one of them (the eager case) is tested frequently, since
most kernel developers have new enough hardware that we use eagerfpu.
glibc uses SSE2, so laziness is probably overoptimistic, and, in any
case, manipulating TS is far slower that saving and restoring the full
state.
To try to shake out any latent issues on old hardware, this changes
the default to eager on all CPUs. If no performance or functionality
problems show up, a subsequent patch could remove lazy mode entirely.
I'm a big fan of simplifying things, but.
SIMD registers were growing in x86, and they are going to grow again,
from sixteen 256-bit registers to thirty two 512-bit registers.
That's 2 kbytes of data. Just moving this data out to/from memory
will take some time.
And some people talk about 1024-bit registers already...
Let's not completely remove lazy FPU saving code just yet.
Maybe we'll be forced to reinstate it.
I'd prefer a different partial solution: encourage everyone to clear
the xstate before making syscalls (using e.g. vzeroall). In fact,
maybe user code should aggressively clear newly-unused xstate.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-02-25 10:40:01 UTC
Permalink
Post by Andy Lutomirski
I'd prefer a different partial solution: encourage everyone to clear
the xstate before making syscalls (using e.g. vzeroall). In fact,
maybe user code should aggressively clear newly-unused xstate.
We don't trust userspace.
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ingo Molnar
2015-02-25 11:00:02 UTC
Permalink
Post by Borislav Petkov
Post by Andy Lutomirski
I'd prefer a different partial solution: encourage
everyone to clear the xstate before making syscalls
(using e.g. vzeroall). In fact, maybe user code should
aggressively clear newly-unused xstate.
We don't trust userspace.
We certainly don't, but in this case it's a performance
optimization detail: if user-space clears unused xstate in
a way that the CPU recognizes it (for example VZEROALL)
then it might get faster context switches.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-02-25 17:20:02 UTC
Permalink
- /* Auto enable eagerfpu for xsaveopt */
- if (cpu_has_xsaveopt && eagerfpu != DISABLE)
+ /* Auto enable eagerfpu for everyone */
+ if (eagerfpu != DISABLE)
eagerfpu = ENABLE;
So Mel did run some measurements with it on an old Intel and AMD box.
Attached are netperf results from both. The Intel box is a quad core
very similar to this one:

http://ark.intel.com/products/28020/Intel-Xeon-Processor-5063-4M-Cache-3_20-GHz-1066-MHz-FSB

netperf-udp-rr shows some impact of the eagerfpu patch which is outside
of the noise level. netperf-tcp-rr not so much but eager is still a bit
behind.

The AMD box is an old K8 and results there look like eager is better :-)
Not with all though - pipetest is worse.

netperf-udp-rr is better at almost every data point and tcp-rr looks a
bit useless with those high noise levels.

As a summary, the patch has some, albeit small, impact. We would need
more benchmarks. We have one speccpu run which is currently taking
forever to finish...
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
Loading...