Discussion:
[RFC PATCH 0/5] GHES NMI handler cleanup
(too old to reply)
Borislav Petkov
2015-03-27 09:30:01 UTC
Permalink
From: Borislav Petkov <***@suse.de>

So this patchset is the result of us seeing this while debugging a
customer issue:

[ 118.113136] INFO: NMI handler (ghes_notify_nmi) took too long to run: 1.005 msecs

Looking at that NMI handler, it could use a good scrubbing as it has
grown some needless fat. So let's try it.

First 4 patches simplify it and clean it, and the last one does the
bold move of making the status reader CPU be a single one based on
the not-100-percent-confirmed observation that GHES error sources are
global in the firmware glue and thus only one reader suffices to see all
errors.

This last thing still needs to be confirmed but I'm sending the patches
now so that people can look at them and poke holes. Thus the RFC tag.

Thanks.

Borislav Petkov (4):
GHES: Carve out error queueing in a separate function
GHES: Carve out the panic functionality
GHES: Panic right after detection
GHES: Elliminate double-loop in the NMI handler

Jiri Kosina (1):
GHES: Make NMI handler have a single reader

drivers/acpi/apei/ghes.c | 108 ++++++++++++++++++++++++-----------------------
1 file changed, 55 insertions(+), 53 deletions(-)
--
2.3.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-03-27 09:30:02 UTC
Permalink
From: Borislav Petkov <***@suse.de>

There's no real need to iterate twice over the HW error sources in the
NMI handler. With the previous cleanups, elliminating the second loop is
almost trivial.

Signed-off-by: Borislav Petkov <***@suse.de>
---
drivers/acpi/apei/ghes.c | 11 ++---------
1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 0de3adcca03e..94a44bad5576 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -851,25 +851,18 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
if (sev >= GHES_SEV_PANIC)
__ghes_panic(ghes);

- ret = NMI_HANDLED;
- }
-
- if (ret == NMI_DONE)
- goto out;
-
- list_for_each_entry_rcu(ghes, &ghes_nmi, list) {
if (!(ghes->flags & GHES_TO_CLEAR))
continue;

__process_error(ghes);
ghes_clear_estatus(ghes);
+
+ ret = NMI_HANDLED;
}

#ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
irq_work_queue(&ghes_proc_irq_work);
#endif
-
-out:
raw_spin_unlock(&ghes_nmi_lock);
return ret;
}
--
2.3.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-03-27 09:30:02 UTC
Permalink
From: Borislav Petkov <***@suse.de>

... into another function for more clarity. No functionality change.

Signed-off-by: Borislav Petkov <***@suse.de>
---
drivers/acpi/apei/ghes.c | 24 ++++++++++++++----------
1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index fe1e41bf5609..712ed95b1dca 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -823,6 +823,18 @@ static void __process_error(struct ghes *ghes)
#endif
}

+static void __ghes_panic(struct ghes *ghes)
+{
+ oops_begin();
+ ghes_print_queued_estatus();
+ __ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
+
+ /* reboot to log the error! */
+ if (panic_timeout == 0)
+ panic_timeout = ghes_panic_timeout;
+ panic("Fatal hardware error!");
+}
+
static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
{
struct ghes *ghes, *ghes_global = NULL;
@@ -846,16 +858,8 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
if (ret == NMI_DONE)
goto out;

- if (sev_global >= GHES_SEV_PANIC) {
- oops_begin();
- ghes_print_queued_estatus();
- __ghes_print_estatus(KERN_EMERG, ghes_global->generic,
- ghes_global->estatus);
- /* reboot to log the error! */
- if (panic_timeout == 0)
- panic_timeout = ghes_panic_timeout;
- panic("Fatal hardware error!");
- }
+ if (sev_global >= GHES_SEV_PANIC)
+ __ghes_panic(ghes_global);

list_for_each_entry_rcu(ghes, &ghes_nmi, list) {
if (!(ghes->flags & GHES_TO_CLEAR))
--
2.3.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-03-27 09:30:03 UTC
Permalink
From: Jiri Kosina <***@suse.cz>

Since GHES sources are global, we theoretically need only a single CPU
reading them per NMI instead of a thundering herd of CPUs waiting on a
spinlock in NMI context for no reason at all.

Do that.

Signed-off-by: Jiri Kosina <***@suse.cz>
Signed-off-by: Borislav Petkov <***@suse.de>
---
drivers/acpi/apei/ghes.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 94a44bad5576..2bfd53cbfe80 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -729,10 +729,10 @@ static struct llist_head ghes_estatus_llist;
static struct irq_work ghes_proc_irq_work;

/*
- * NMI may be triggered on any CPU, so ghes_nmi_lock is used for
- * mutual exclusion.
+ * NMI may be triggered on any CPU, so ghes_in_nmi is used for
+ * having only one concurrent reader.
*/
-static DEFINE_RAW_SPINLOCK(ghes_nmi_lock);
+static atomic_t ghes_in_nmi = ATOMIC_INIT(0);

static LIST_HEAD(ghes_nmi);

@@ -840,7 +840,9 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
struct ghes *ghes;
int sev, ret = NMI_DONE;

- raw_spin_lock(&ghes_nmi_lock);
+ if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
+ return ret;
+
list_for_each_entry_rcu(ghes, &ghes_nmi, list) {
if (ghes_read_estatus(ghes, 1)) {
ghes_clear_estatus(ghes);
@@ -863,7 +865,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
#ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
irq_work_queue(&ghes_proc_irq_work);
#endif
- raw_spin_unlock(&ghes_nmi_lock);
+ atomic_dec(&ghes_in_nmi);
return ret;
}
--
2.3.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Jiri Kosina
2015-04-01 07:50:03 UTC
Permalink
Post by Borislav Petkov
Since GHES sources are global, we theoretically need only a single CPU
reading them per NMI instead of a thundering herd of CPUs waiting on a
spinlock in NMI context for no reason at all.
I originally wasn't 100% sure whether GHES sources are global (i.e. if it
really doesn't matter which CPU is reading the registers), but looking at
the code more it actually seems that this is really the right thing to do.

Rationale: ghes_ioremap_pfn_nmi() always ioremaps() (exclusively) the page
with the registers, performs apei_read() (which is ghes-source specific,
but not CPU-specific) and unmaps the page again.

There is nothing that would make this CPU-specific. Adding Huang Ying (the
original author of the code) to confirm this. Huang?
Post by Borislav Petkov
Do that.
I think this should indeed be pushed forward. It fixes horrible spinlock
contention on systems which are under NMI storm (such as when perf is
active) unrelated to GHES.

Thanks,
--
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-04-01 14:00:03 UTC
Permalink
Post by Jiri Kosina
Post by Borislav Petkov
Since GHES sources are global, we theoretically need only a single CPU
reading them per NMI instead of a thundering herd of CPUs waiting on a
spinlock in NMI context for no reason at all.
I originally wasn't 100% sure whether GHES sources are global (i.e. if it
really doesn't matter which CPU is reading the registers), but looking at
the code more it actually seems that this is really the right thing to do.
Rationale: ghes_ioremap_pfn_nmi() always ioremaps() (exclusively) the page
with the registers, performs apei_read() (which is ghes-source specific,
but not CPU-specific) and unmaps the page again.
There is nothing that would make this CPU-specific. Adding Huang Ying (the
original author of the code) to confirm this. Huang?
Post by Borislav Petkov
Do that.
I think this should indeed be pushed forward. It fixes horrible spinlock
contention on systems which are under NMI storm (such as when perf is
active) unrelated to GHES.
Right, so I tested injecting an error without your patch and same
behavior. So it all points at global sources AFAICT. It would be cool,
though, if someone who knows the fw confirms unambiguously.
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-04-23 09:00:01 UTC
Permalink
Three weeks have passed, therefore I find this an appropriate time for a
friendly ping :)
Rafael? Naoya? Huang?
This fixes a contention spinlock problem in NMI observed on a real HW, so
it would be really nice to have it fixed.
I think we should apply this.

Here's why: nothing in the ghes_notify_nmi() handler does CPU-specific
accesses. It iterates over the list of ghes sources which do NMI
notification but those sources are the *same* regardless of which core
does the access as their addresses are per-source, i.e. in that

struct acpi_generic_address error_status_address;

thing.

And it is a safe bet to say that all that error information is
serialized in the firmware for the error source to consume.

So I'm going to route this through the RAS tree unless Rafael wants to
take it.

Ok?

Tony, objections?
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Luck, Tony
2015-04-23 18:10:02 UTC
Permalink
Post by Borislav Petkov
I think we should apply this.
Here's why: nothing in the ghes_notify_nmi() handler does CPU-specific
accesses
This looks to be true.
Post by Borislav Petkov
Tony, objections?
No objections.

-Tony
N�����r��y����b�X��ǧv�^�)޺{.n�+����{����zX����ܨ}���Ơz�&j:+v�������zZ+��+zf���h���~����i���z��w���?�����&�)ߢf
Borislav Petkov
2015-04-27 20:30:02 UTC
Permalink
Post by Luck, Tony
Post by Borislav Petkov
I think we should apply this.
Here's why: nothing in the ghes_notify_nmi() handler does CPU-specific
accesses
This looks to be true.
Post by Borislav Petkov
Tony, objections?
No objections.
Thanks, queued for 4.2, pending one last test I'm doing on a machine
which says

[ 24.332560] INFO: NMI handler (ghes_notify_nmi) took too long to run: 3.265 msecs
[ 24.332567] INFO: NMI handler (ghes_notify_nmi) took too long to run: 5.946 msecs
[ 24.332568] INFO: NMI handler (ghes_notify_nmi) took too long to run: 5.948 msecs

during boot.
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Don Zickus
2015-04-28 14:40:03 UTC
Permalink
Post by Jiri Kosina
Post by Borislav Petkov
Since GHES sources are global, we theoretically need only a single CPU
reading them per NMI instead of a thundering herd of CPUs waiting on a
spinlock in NMI context for no reason at all.
I originally wasn't 100% sure whether GHES sources are global (i.e. if it
really doesn't matter which CPU is reading the registers), but looking at
the code more it actually seems that this is really the right thing to do.
Rationale: ghes_ioremap_pfn_nmi() always ioremaps() (exclusively) the page
with the registers, performs apei_read() (which is ghes-source specific,
but not CPU-specific) and unmaps the page again.
There is nothing that would make this CPU-specific. Adding Huang Ying (the
original author of the code) to confirm this. Huang?
Hi,

I believe the answer to this question is no, they are not global but
instead external. All external NMIs are routed to one cpu, normally cpu0.
This spinlock was made global to handle the 'someday' case of hotplugging
the bsp cpu (cpu0).

The other external NMIs (IO_CHECK and SERR) suffer from the same spinlock
problem. I tried using an irq_workqueue to work around quirks there and
PeterZ wasn't very fond of it (though he couldn't think of a better way to
solve it).

This patch seems interesting but you might still run into the thundering
herd problem with the global spinlock in
arch/x86/kernel/nmi.c::default_do_nmi(). That functions grabs a global
spinlock before processing the external NMI list (which GHES is a part of).


So I am assuming this patch solves the 'thundering herd' problem by
minimizing all the useless writes the spinlock would do for each cpu that
noticed it had no work to do?

In that case, I am in favor of this solution and would like to apply a
similar solution to arch/x86/kernel/nmi.c, to see if that helps there too.

Cheers,
Don
Post by Jiri Kosina
Post by Borislav Petkov
Do that.
I think this should indeed be pushed forward. It fixes horrible spinlock
contention on systems which are under NMI storm (such as when perf is
active) unrelated to GHES.
Thanks,
--
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Don Zickus
2015-04-28 14:50:03 UTC
Permalink
Post by Don Zickus
Post by Jiri Kosina
Post by Borislav Petkov
Since GHES sources are global, we theoretically need only a single CPU
reading them per NMI instead of a thundering herd of CPUs waiting on a
spinlock in NMI context for no reason at all.
I originally wasn't 100% sure whether GHES sources are global (i.e. if it
really doesn't matter which CPU is reading the registers), but looking at
the code more it actually seems that this is really the right thing to do.
Rationale: ghes_ioremap_pfn_nmi() always ioremaps() (exclusively) the page
with the registers, performs apei_read() (which is ghes-source specific,
but not CPU-specific) and unmaps the page again.
There is nothing that would make this CPU-specific. Adding Huang Ying (the
original author of the code) to confirm this. Huang?
Hi,
I believe the answer to this question is no, they are not global but
instead external. All external NMIs are routed to one cpu, normally cpu0.
This spinlock was made global to handle the 'someday' case of hotplugging
the bsp cpu (cpu0).
The other external NMIs (IO_CHECK and SERR) suffer from the same spinlock
problem. I tried using an irq_workqueue to work around quirks there and
PeterZ wasn't very fond of it (though he couldn't think of a better way to
solve it).
This patch seems interesting but you might still run into the thundering
herd problem with the global spinlock in
arch/x86/kernel/nmi.c::default_do_nmi(). That functions grabs a global
spinlock before processing the external NMI list (which GHES is a part of).
Grr, I mispoke. I sent a patchset a year ago to split out internal and
external NMIs to simplify the problem. So I wrote the above paragraph
thinking the GHES NMI handler was wrapped with the external NMI spinlock,
when in fact it isn't. However, with perf running with lots of events, it
is possible to start 'swallowing' NMIs which requires passing through the
spinlock I just mentioned. This might cause random delays in your
measurements and is still worth modifying.

Cheers,
Don
Post by Don Zickus
So I am assuming this patch solves the 'thundering herd' problem by
minimizing all the useless writes the spinlock would do for each cpu that
noticed it had no work to do?
In that case, I am in favor of this solution and would like to apply a
similar solution to arch/x86/kernel/nmi.c, to see if that helps there too.
Cheers,
Don
Post by Jiri Kosina
Post by Borislav Petkov
Do that.
I think this should indeed be pushed forward. It fixes horrible spinlock
contention on systems which are under NMI storm (such as when perf is
active) unrelated to GHES.
Thanks,
--
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-04-28 15:00:01 UTC
Permalink
Post by Don Zickus
Post by Jiri Kosina
Post by Borislav Petkov
Since GHES sources are global, we theoretically need only a single CPU
reading them per NMI instead of a thundering herd of CPUs waiting on a
spinlock in NMI context for no reason at all.
I originally wasn't 100% sure whether GHES sources are global (i.e. if it
really doesn't matter which CPU is reading the registers), but looking at
the code more it actually seems that this is really the right thing to do.
Rationale: ghes_ioremap_pfn_nmi() always ioremaps() (exclusively) the page
with the registers, performs apei_read() (which is ghes-source specific,
but not CPU-specific) and unmaps the page again.
There is nothing that would make this CPU-specific. Adding Huang Ying (the
original author of the code) to confirm this. Huang?
Hi,
I believe the answer to this question is no, they are not global but
instead external. All external NMIs are routed to one cpu, normally cpu0.
Actually, the things we're talking about here are not global NMIs but
global error sources. I.e., the GHES sources which use an NMI to report
errors with and which we noodle over in ghes_notify_nmi() twice:

list_for_each_entry_rcu(ghes, &ghes_nmi, list) {
...
Post by Don Zickus
This spinlock was made global to handle the 'someday' case of hotplugging
the bsp cpu (cpu0).
The other external NMIs (IO_CHECK and SERR) suffer from the same spinlock
problem. I tried using an irq_workqueue to work around quirks there and
PeterZ wasn't very fond of it (though he couldn't think of a better way to
solve it).
This patch seems interesting but you might still run into the thundering
herd problem with the global spinlock in
arch/x86/kernel/nmi.c::default_do_nmi(). That functions grabs a global
spinlock before processing the external NMI list (which GHES is a part of).
nmi_reason_lock? And our handler is registered with
register_nmi_handler() and so we iterate over it in nmi_handle(). It'd
be interesting to know what NMI reason we get for those GHES NMIs so
that we know whether nmi_handle() is called under the lock or not...
Post by Don Zickus
So I am assuming this patch solves the 'thundering herd' problem by
minimizing all the useless writes the spinlock would do for each cpu that
noticed it had no work to do?
Not that spinlock. It used to take another one in ghes_notify_nmi().
Removing it solved the issue. There were even boxes which would do:

[ 24.332560] INFO: NMI handler (ghes_notify_nmi) took too long to run: 3.265 msecs
[ 24.332567] INFO: NMI handler (ghes_notify_nmi) took too long to run: 5.946 msecs
[ 24.332568] INFO: NMI handler (ghes_notify_nmi) took too long to run: 5.948 msecs

just like that during boot.

It was basically a lot of purely useless lock grabbing for no reason
whatsoever.

Thanks.
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Don Zickus
2015-04-28 15:40:01 UTC
Permalink
Post by Borislav Petkov
Post by Don Zickus
Post by Jiri Kosina
Post by Borislav Petkov
Since GHES sources are global, we theoretically need only a single CPU
reading them per NMI instead of a thundering herd of CPUs waiting on a
spinlock in NMI context for no reason at all.
I originally wasn't 100% sure whether GHES sources are global (i.e. if it
really doesn't matter which CPU is reading the registers), but looking at
the code more it actually seems that this is really the right thing to do.
Rationale: ghes_ioremap_pfn_nmi() always ioremaps() (exclusively) the page
with the registers, performs apei_read() (which is ghes-source specific,
but not CPU-specific) and unmaps the page again.
There is nothing that would make this CPU-specific. Adding Huang Ying (the
original author of the code) to confirm this. Huang?
Hi,
I believe the answer to this question is no, they are not global but
instead external. All external NMIs are routed to one cpu, normally cpu0.
Actually, the things we're talking about here are not global NMIs but
global error sources. I.e., the GHES sources which use an NMI to report
list_for_each_entry_rcu(ghes, &ghes_nmi, list) {
Sure, most systems use NMI to report them I believe (at least the ones I
have played with). Regardless, I have hit performance problems on this lock
while doing perf testing.

I tried to work around it by splitting the NMI list into an
nmi_register(INTERNAL,...) and nmi_register(EXTERNAL,...) to allow perf to
avoid hitting this lock. But the fallout of that change included missed
NMIs and various solutions from that resulted in complexities that blocked
inclusion last year.

Your solution seems much simpler. :-)
Post by Borislav Petkov
...
Post by Don Zickus
This spinlock was made global to handle the 'someday' case of hotplugging
the bsp cpu (cpu0).
The other external NMIs (IO_CHECK and SERR) suffer from the same spinlock
problem. I tried using an irq_workqueue to work around quirks there and
PeterZ wasn't very fond of it (though he couldn't think of a better way to
solve it).
This patch seems interesting but you might still run into the thundering
herd problem with the global spinlock in
arch/x86/kernel/nmi.c::default_do_nmi(). That functions grabs a global
spinlock before processing the external NMI list (which GHES is a part of).
nmi_reason_lock? And our handler is registered with
register_nmi_handler() and so we iterate over it in nmi_handle(). It'd
be interesting to know what NMI reason we get for those GHES NMIs so
that we know whether nmi_handle() is called under the lock or not...
I followed up in another email stating I mis-spoke. I forgot this still
uses the NMI_LOCAL shared NMI. So every perf NMI, will also call the GHES
handler to make sure NMIs did not piggy back each other. So I don't believe
the NMI reason lock is called a majority of the time (except when the NMI is
swallowed, but that is under heavy perf load...).
Post by Borislav Petkov
Post by Don Zickus
So I am assuming this patch solves the 'thundering herd' problem by
minimizing all the useless writes the spinlock would do for each cpu that
noticed it had no work to do?
Not that spinlock. It used to take another one in ghes_notify_nmi().
[ 24.332560] INFO: NMI handler (ghes_notify_nmi) took too long to run: 3.265 msecs
[ 24.332567] INFO: NMI handler (ghes_notify_nmi) took too long to run: 5.946 msecs
[ 24.332568] INFO: NMI handler (ghes_notify_nmi) took too long to run: 5.948 msecs
just like that during boot.
It was basically a lot of purely useless lock grabbing for no reason
whatsoever.
Yes, I meant the 'raw_spin_lock(&ghes_nmi_lock);' one. I poorly typed my
email and confused you into thinking I was referring to the wrong one.

Well, it isn't exactly no reason, but more along the lines of 'we do not
know who gets the external NMI, so everyone tries to talk with the GHES
firmware'. But I think that is just me squabbling.

We both agree the mechanics of the spinlock are overkill here and cause much
cache contention. Simplifying it to just 'reads' and return removes most of
the problem.

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-04-28 16:30:01 UTC
Permalink
Post by Don Zickus
Your solution seems much simpler. :-)
... and I love simpler :-)
Post by Don Zickus
I followed up in another email stating I mis-spoke. I forgot this still
uses the NMI_LOCAL shared NMI. So every perf NMI, will also call the GHES
handler to make sure NMIs did not piggy back each other. So I don't believe
And this is something we should really fix - perf and RAS should
not have anything to do with each other. But I don't know the NMI
code to even have an idea how. I don't even know whether we can
differentiate NMIs, hell, I can't imagine the hardware giving us a
different NMI reason through get_nmi_reason(). Maybe that byte returned
from NMI_REASON_PORT is too small and hangs on too much legacy crap to
even be usable. Questions over questions...
Post by Don Zickus
the NMI reason lock is called a majority of the time (except when the NMI is
swallowed, but that is under heavy perf load...).
..
Post by Don Zickus
We both agree the mechanics of the spinlock are overkill here and cause much
cache contention. Simplifying it to just 'reads' and return removes most of
the problem.
Right.
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Don Zickus
2015-04-28 18:50:01 UTC
Permalink
Post by Borislav Petkov
Post by Don Zickus
Your solution seems much simpler. :-)
... and I love simpler :-)
Post by Don Zickus
I followed up in another email stating I mis-spoke. I forgot this still
uses the NMI_LOCAL shared NMI. So every perf NMI, will also call the GHES
handler to make sure NMIs did not piggy back each other. So I don't believe
And this is something we should really fix - perf and RAS should
not have anything to do with each other. But I don't know the NMI
code to even have an idea how. I don't even know whether we can
differentiate NMIs, hell, I can't imagine the hardware giving us a
different NMI reason through get_nmi_reason(). Maybe that byte returned
from NMI_REASON_PORT is too small and hangs on too much legacy crap to
even be usable. Questions over questions...
:-) Well, let me first clear up some of your questions.

RAS doesn't go through the legacy ports (ie get_nmi_reason()). Instead it
triggers the external NMI through a different bit (ioapic I think).

The nmi code has no idea what io_remap'ed address apei is using to map its
error handling register that GHES uses. Unlike the legacy port which is
always port 0x61.

So, with NMI being basically a shared interrupt, with no ability to discern
who sent the interrupt (and even worse no ability to know how _many_ were sent as
the NMI is edge triggered instead of level triggered). As a result we rely
on the NMI handlers to talk to their address space/registers to determine if
they were they source of the interrupt.

Now I can agree that perf and RAS have nothing to do with each other, but
they both use NMI to interrupt. Perf is fortunate enough to be internal to
each cpu and therefore needs no global lock unlike GHES (hence part of the
problem).

The only way to determine who sent the NMI is to have each handler read its
register, which is time consuming for GHES.

Of course, we could go back to playing tricks knowing that external NMIs
like GHES and IO_CHECK/SERR are only routed to one cpu (cpu0 mainly) and
optimize things that way, but that inhibits the bsp cpu hotplugging folks.



I also played tricks like last year's patchset that split out the
nmi_handlers into LOCAL and EXTERNAL queues. Perf would be part of the
LOCAL queue while GHES was part of the EXTERNAL queue. The thought was to
never touch the EXTERNAL queue if perf claimed an NMI. This lead to all
sorts of missed external NMIs, so it didn't work out.


Anyway, any ideas or thoughts for improvement are always welcomed. :-)


Cheers,
Don
Post by Borislav Petkov
Post by Don Zickus
the NMI reason lock is called a majority of the time (except when the NMI is
swallowed, but that is under heavy perf load...).
..
Post by Don Zickus
We both agree the mechanics of the spinlock are overkill here and cause much
cache contention. Simplifying it to just 'reads' and return removes most of
the problem.
Right.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-05-04 15:50:01 UTC
Permalink
Post by Don Zickus
RAS doesn't go through the legacy ports (ie get_nmi_reason()). Instead it
triggers the external NMI through a different bit (ioapic I think).
Well, I see it getting registered with __register_nmi_handler() which
adds it to the NMI_LOCAL type, i.e., ghes_notify_nmi() gets called by

default_do_nmi
|-> nmi_handle(NMI_LOCAL, regs, b2b);

AFAICT.

Which explains also the issue we were seeing as that handler is called
on each NMI, even when the machine is running a perf workload.
Post by Don Zickus
The nmi code has no idea what io_remap'ed address apei is using to map its
error handling register that GHES uses. Unlike the legacy port which is
always port 0x61.
So, with NMI being basically a shared interrupt, with no ability to discern
who sent the interrupt (and even worse no ability to know how _many_ were sent as
the NMI is edge triggered instead of level triggered). As a result we rely
on the NMI handlers to talk to their address space/registers to determine if
they were they source of the interrupt.
I was afraid it would be something like that. We probably should poke hw
people to extend that NMI fun so that we can know who caused it.

<snip stuff I agree with>
Post by Don Zickus
Anyway, any ideas or thoughts for improvement are always welcomed. :-)
Yeah, I'm afraid without hw support, that won't be doable. We need the
hw to tell us who caused the NMI. Otherwise we'll be round-robining
(:-)) through handlers like nuts.
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Zheng, Lv
2015-04-27 03:20:01 UTC
Permalink
Hi,
Sent: Friday, March 27, 2015 5:23 PM
Since GHES sources are global, we theoretically need only a single CPU
reading them per NMI instead of a thundering herd of CPUs waiting on a
spinlock in NMI context for no reason at all.
Do that.
---
drivers/acpi/apei/ghes.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 94a44bad5576..2bfd53cbfe80 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -729,10 +729,10 @@ static struct llist_head ghes_estatus_llist;
static struct irq_work ghes_proc_irq_work;
/*
- * NMI may be triggered on any CPU, so ghes_nmi_lock is used for
- * mutual exclusion.
+ * NMI may be triggered on any CPU, so ghes_in_nmi is used for
+ * having only one concurrent reader.
*/
-static DEFINE_RAW_SPINLOCK(ghes_nmi_lock);
+static atomic_t ghes_in_nmi = ATOMIC_INIT(0);
static LIST_HEAD(ghes_nmi);
@@ -840,7 +840,9 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
struct ghes *ghes;
int sev, ret = NMI_DONE;
- raw_spin_lock(&ghes_nmi_lock);
+ if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
+ return ret;
+
Just a simple question.
Why not just using cmpxchg here instead of atomic_add_unless so that no atomic_dec will be needed.

Thanks and best regards
-Lv
list_for_each_entry_rcu(ghes, &ghes_nmi, list) {
if (ghes_read_estatus(ghes, 1)) {
ghes_clear_estatus(ghes);
@@ -863,7 +865,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
#ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
irq_work_queue(&ghes_proc_irq_work);
#endif
- raw_spin_unlock(&ghes_nmi_lock);
+ atomic_dec(&ghes_in_nmi);
return ret;
}
--
2.3.3
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-04-27 08:50:02 UTC
Permalink
Post by Zheng, Lv
Post by Borislav Petkov
@@ -840,7 +840,9 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
struct ghes *ghes;
int sev, ret = NMI_DONE;
- raw_spin_lock(&ghes_nmi_lock);
+ if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
+ return ret;
+
Just a simple question.
Why not just using cmpxchg here instead of atomic_add_unless so that no atomic_dec will be needed.
What do you think atomic_add_unless ends up doing:

#APP
# 177 "./arch/x86/include/asm/atomic.h" 1
.pushsection .smp_locks,"a"
.balign 4
.long 671f - .
.popsection
671:
lock; cmpxchgl %edx,ghes_in_nmi(%rip) # D.37056, MEM[(volatile u32 *)&ghes_in_nmi]
# 0 "" 2
#NO_APP

And you need to atomic_dec() so that another reader can enter, i.e. how
the exclusion primitive works.

Or did you have something else in mind?
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Zheng, Lv
2015-04-28 00:50:01 UTC
Permalink
Hi,
Sent: Monday, April 27, 2015 4:47 PM
Post by Zheng, Lv
Post by Borislav Petkov
@@ -840,7 +840,9 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
struct ghes *ghes;
int sev, ret = NMI_DONE;
- raw_spin_lock(&ghes_nmi_lock);
+ if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
+ return ret;
+
Just a simple question.
Why not just using cmpxchg here instead of atomic_add_unless so that no atomic_dec will be needed.
#APP
# 177 "./arch/x86/include/asm/atomic.h" 1
.pushsection .smp_locks,"a"
.balign 4
.long 671f - .
.popsection
lock; cmpxchgl %edx,ghes_in_nmi(%rip) # D.37056, MEM[(volatile u32 *)&ghes_in_nmi]
# 0 "" 2
#NO_APP
And you need to atomic_dec() so that another reader can enter, i.e. how
the exclusion primitive works.
Or did you have something else in mind?
My mistake.
I mean cmpxchg() and xchg() (or atomic_cmpxchg() and atomic_xchg()) pair here, so nothing can be reduced.

But IMO, atomic_add_unless() is implemented via cmpxchg on many architectures.
And it might be better to use it directly here which is a bit faster as you actually only need one value switch here.

Thanks and best regards
-Lv
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
--
��{.n�+�������+%��lzwm��b�맲��r��zX���w��{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�m��������zZ+�����ݢj"��!�i
Zheng, Lv
2015-04-28 02:30:02 UTC
Permalink
Hi,
From: Zheng, Lv
Sent: Tuesday, April 28, 2015 8:44 AM
Hi,
Sent: Monday, April 27, 2015 4:47 PM
Post by Zheng, Lv
Post by Borislav Petkov
@@ -840,7 +840,9 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
struct ghes *ghes;
int sev, ret = NMI_DONE;
- raw_spin_lock(&ghes_nmi_lock);
+ if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
+ return ret;
+
Just a simple question.
Why not just using cmpxchg here instead of atomic_add_unless so that no atomic_dec will be needed.
#APP
# 177 "./arch/x86/include/asm/atomic.h" 1
.pushsection .smp_locks,"a"
.balign 4
.long 671f - .
.popsection
lock; cmpxchgl %edx,ghes_in_nmi(%rip) # D.37056, MEM[(volatile u32 *)&ghes_in_nmi]
# 0 "" 2
#NO_APP
And you need to atomic_dec() so that another reader can enter, i.e. how
the exclusion primitive works.
Or did you have something else in mind?
My mistake.
I mean cmpxchg() and xchg() (or atomic_cmpxchg() and atomic_xchg()) pair here, so nothing can be reduced.
Let me correct, it should be atomic_cmpxchg() and atomic_set() here as you only need to switch between 0 and 1.
Sorry for the noise.

Thanks and best regards
-Lv
But IMO, atomic_add_unless() is implemented via cmpxchg on many architectures.
And it might be better to use it directly here which is a bit faster as you actually only need one value switch here.
Thanks and best regards
-Lv
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
--
N�����r��y����b�X��ǧv�^�)޺{.n�+����{����zX����ܨ}���Ơz�&j:+v�������zZ+��+zf���h���~����i���z��w���?�����&�)ߢf
Borislav Petkov
2015-04-28 07:40:02 UTC
Permalink
Post by Zheng, Lv
Post by Zheng, Lv
Post by Borislav Petkov
#APP
# 177 "./arch/x86/include/asm/atomic.h" 1
.pushsection .smp_locks,"a"
.balign 4
.long 671f - .
.popsection
lock; cmpxchgl %edx,ghes_in_nmi(%rip) # D.37056, MEM[(volatile u32 *)&ghes_in_nmi]
# 0 "" 2
#NO_APP
And you need to atomic_dec() so that another reader can enter, i.e. how
the exclusion primitive works.
Or did you have something else in mind?
My mistake.
I mean cmpxchg() and xchg() (or atomic_cmpxchg() and atomic_xchg()) pair here, so nothing can be reduced.
Let me correct, it should be atomic_cmpxchg() and atomic_set() here as you only need to switch between 0 and 1.
Sorry for the noise.
I still don't understand what you want from me here. You need to go into
more detail.
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Zheng, Lv
2015-04-28 13:40:07 UTC
Permalink
Hi,

I was talking about this patch.
Sent: Friday, March 27, 2015 5:23 PM
Subject: [RFC PATCH 5/5] GHES: Make NMI handler have a single reader
Since GHES sources are global, we theoretically need only a single CPU
reading them per NMI instead of a thundering herd of CPUs waiting on a
spinlock in NMI context for no reason at all.
Do that.
---
drivers/acpi/apei/ghes.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 94a44bad5576..2bfd53cbfe80 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -729,10 +729,10 @@ static struct llist_head ghes_estatus_llist;
static struct irq_work ghes_proc_irq_work;
/*
- * NMI may be triggered on any CPU, so ghes_nmi_lock is used for
- * mutual exclusion.
+ * NMI may be triggered on any CPU, so ghes_in_nmi is used for
+ * having only one concurrent reader.
*/
-static DEFINE_RAW_SPINLOCK(ghes_nmi_lock);
+static atomic_t ghes_in_nmi = ATOMIC_INIT(0);
static LIST_HEAD(ghes_nmi);
@@ -840,7 +840,9 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
struct ghes *ghes;
int sev, ret = NMI_DONE;
- raw_spin_lock(&ghes_nmi_lock);
+ if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
+ return ret;
+
if (atomic_cmpxchg(&ghes_in_nmi, 0, 1))
return ret;
list_for_each_entry_rcu(ghes, &ghes_nmi, list) {
if (ghes_read_estatus(ghes, 1)) {
ghes_clear_estatus(ghes);
@@ -863,7 +865,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
#ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
irq_work_queue(&ghes_proc_irq_work);
#endif
- raw_spin_unlock(&ghes_nmi_lock);
+ atomic_dec(&ghes_in_nmi);
atomic_set(&ghes_in_nmi, 0);

It seems most of the drivers (under drivers/) are written in this way.
While the user of atomic_add_unless() is rare.

Can this work for you?

Thanks and best regards
-Lv
return ret;
}
--
2.3.3
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Borislav Petkov
2015-04-28 14:00:03 UTC
Permalink
Post by Zheng, Lv
Post by Borislav Petkov
- raw_spin_lock(&ghes_nmi_lock);
+ if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
+ return ret;
+
if (atomic_cmpxchg(&ghes_in_nmi, 0, 1))
return ret;
Ok, now I understand what you mean.

We absolutely want to use atomic_add_unless() because we get to save us
the expensive

LOCK; CMPXCHG

if the value was already 1. Which is exactly what this patch is trying
to avoid - a thundering herd of cores CMPXCHGing a global variable.

I.e.,

movl ghes_in_nmi(%rip), %ecx # MEM[(const int *)&ghes_in_nmi], c
cmpl $1, %ecx #, c
je .L311 #, <--- exit here if ghes_in_nmi == 1.
leal 1(%rcx), %edx #, D.37163
movl %ecx, %eax # c, c
#APP
# 177 "./arch/x86/include/asm/atomic.h" 1
.pushsection .smp_locks,"a"
.balign 4
.long 671f - .
.popsection
671:
lock; cmpxchgl %edx,ghes_in_nmi(%rip) # D.37163, MEM[(volatile u32 *)&ghes_in_nmi]
# 0 "" 2
#NO_APP
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Zheng, Lv
2015-04-29 00:30:01 UTC
Permalink
Hi,
Sent: Tuesday, April 28, 2015 9:59 PM
Subject: Re: [RFC PATCH 5/5] GHES: Make NMI handler have a single reader
Post by Zheng, Lv
Post by Borislav Petkov
- raw_spin_lock(&ghes_nmi_lock);
+ if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
+ return ret;
+
if (atomic_cmpxchg(&ghes_in_nmi, 0, 1))
return ret;
Ok, now I understand what you mean.
We absolutely want to use atomic_add_unless() because we get to save us
the expensive
LOCK; CMPXCHG
if the value was already 1. Which is exactly what this patch is trying
to avoid - a thundering herd of cores CMPXCHGing a global variable.
IMO, on most architectures, the "cmp" part should work just like what you've done with "if".
And on some architectures, if the "xchg" doesn't happen, the "cmp" part even won't cause a pipe line hazard.

Thanks and best regards
-Lv
I.e.,
movl ghes_in_nmi(%rip), %ecx # MEM[(const int *)&ghes_in_nmi], c
cmpl $1, %ecx #, c
je .L311 #, <--- exit here if ghes_in_nmi == 1.
leal 1(%rcx), %edx #, D.37163
movl %ecx, %eax # c, c
#APP
# 177 "./arch/x86/include/asm/atomic.h" 1
.pushsection .smp_locks,"a"
.balign 4
.long 671f - .
.popsection
lock; cmpxchgl %edx,ghes_in_nmi(%rip) # D.37163, MEM[(volatile u32 *)&ghes_in_nmi]
# 0 "" 2
#NO_APP
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
--
��칻�&�~�&���+-��ݶ��w��˛���m�b��dz�ޖ)����w*jg��������ݢj/���z�ޖ��2�ޙ����&�)ߡ�a�����G���h��j:+v���w��٥
Zheng, Lv
2015-04-29 01:00:01 UTC
Permalink
Hi,
From: Zheng, Lv
Sent: Wednesday, April 29, 2015 8:25 AM
Subject: RE: [RFC PATCH 5/5] GHES: Make NMI handler have a single reader
Hi,
Sent: Tuesday, April 28, 2015 9:59 PM
Subject: Re: [RFC PATCH 5/5] GHES: Make NMI handler have a single reader
Post by Zheng, Lv
Post by Borislav Petkov
- raw_spin_lock(&ghes_nmi_lock);
+ if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
+ return ret;
+
if (atomic_cmpxchg(&ghes_in_nmi, 0, 1))
return ret;
Ok, now I understand what you mean.
We absolutely want to use atomic_add_unless() because we get to save us
the expensive
LOCK; CMPXCHG
if the value was already 1. Which is exactly what this patch is trying
to avoid - a thundering herd of cores CMPXCHGing a global variable.
IMO, on most architectures, the "cmp" part should work just like what you've done with "if".
And on some architectures, if the "xchg" doesn't happen, the "cmp" part even won't cause a pipe line hazard.
If you man the LOCK prefix, I understand now.

Thanks and best regards
-Lv
Thanks and best regards
-Lv
I.e.,
movl ghes_in_nmi(%rip), %ecx # MEM[(const int *)&ghes_in_nmi], c
cmpl $1, %ecx #, c
je .L311 #, <--- exit here if ghes_in_nmi == 1.
leal 1(%rcx), %edx #, D.37163
movl %ecx, %eax # c, c
#APP
# 177 "./arch/x86/include/asm/atomic.h" 1
.pushsection .smp_locks,"a"
.balign 4
.long 671f - .
.popsection
lock; cmpxchgl %edx,ghes_in_nmi(%rip) # D.37163, MEM[(volatile u32 *)&ghes_in_nmi]
# 0 "" 2
#NO_APP
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
--
N�����r��y����b�X��ǧv�^�)޺{.n�+����{����zX����ܨ}���Ơz�&j:+v�������zZ+��+zf���h���~����i���z��w���?�����&�)ߢf
Borislav Petkov
2015-04-29 08:20:02 UTC
Permalink
Post by Zheng, Lv
Post by Zheng, Lv
Post by Borislav Petkov
We absolutely want to use atomic_add_unless() because we get to save us
the expensive
LOCK; CMPXCHG
if the value was already 1. Which is exactly what this patch is trying
to avoid - a thundering herd of cores CMPXCHGing a global variable.
IMO, on most architectures, the "cmp" part should work just like what you've done with "if".
And on some architectures, if the "xchg" doesn't happen, the "cmp" part even won't cause a pipe line hazard.
Even if CMPXCHG is being split into several microops, they all still
need to flow down the pipe and require resources and tracking. And you
only know at retire time what the CMP result is and can "discard" the
XCHG part. Provided the uarch is smart enough to do that.

This is probably why CMPXCHG needs 5,6,7,10,22,... cycles depending on
uarch and vendor, if I can trust Agner Fog's tables. And I bet those
numbers are best-case only and in real-life they probably tend to fall
out even worse.

CMP needs only 1. On almost every uarch and vendor. And even that cycle
probably gets hidden with a good branch predictor.
Post by Zheng, Lv
If you man the LOCK prefix, I understand now.
And that makes several times worse: 22, 40, 80, ... cycles.
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Zheng, Lv
2015-04-30 08:10:01 UTC
Permalink
Hi,
Sent: Wednesday, April 29, 2015 4:14 PM
Subject: Re: [RFC PATCH 5/5] GHES: Make NMI handler have a single reader
Post by Zheng, Lv
Post by Zheng, Lv
Post by Borislav Petkov
We absolutely want to use atomic_add_unless() because we get to save us
the expensive
LOCK; CMPXCHG
if the value was already 1. Which is exactly what this patch is trying
to avoid - a thundering herd of cores CMPXCHGing a global variable.
IMO, on most architectures, the "cmp" part should work just like what you've done with "if".
And on some architectures, if the "xchg" doesn't happen, the "cmp" part even won't cause a pipe line hazard.
Even if CMPXCHG is being split into several microops, they all still
need to flow down the pipe and require resources and tracking. And you
only know at retire time what the CMP result is and can "discard" the
XCHG part. Provided the uarch is smart enough to do that.
This is probably why CMPXCHG needs 5,6,7,10,22,... cycles depending on
uarch and vendor, if I can trust Agner Fog's tables. And I bet those
numbers are best-case only and in real-life they probably tend to fall
out even worse.
CMP needs only 1. On almost every uarch and vendor. And even that cycle
probably gets hidden with a good branch predictor.
Are there any such data around the SC and LL (MIPS)?
Post by Zheng, Lv
If you man the LOCK prefix, I understand now.
And that makes several times worse: 22, 40, 80, ... cycles.
I'm OK if the code still keeps the readability then.

Thanks and best regards
-Lv
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
--
��칻�&�~�&���+-��ݶ��w��˛���m�b��dz�ޖ)����w*jg��������ݢj/���z�ޖ��2�ޙ����&�)ߡ�a�����G���h��j:+v���w��٥
Borislav Petkov
2015-04-30 08:50:01 UTC
Permalink
Post by Zheng, Lv
Are there any such data around the SC and LL (MIPS)?
What, you can't search the internet yourself?
--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Zheng, Lv
2015-05-02 00:40:02 UTC
Permalink
Hi,
Sent: Thursday, April 30, 2015 4:49 PM
Post by Zheng, Lv
Are there any such data around the SC and LL (MIPS)?
What, you can't search the internet yourself?
I mean if LL can do what you exactly did with "if" and no ABA problem can break the atomic_add_unless() users.
Then no atomic_cmpxchg() users might be broken for the same reason.
Why don't you put an "if" in the atomic_cmpxchg() to allow other users to have the same benefit because of the data?

But this is out of the context for this patch.
I actually don't have any API usage preference now.

Thanks
-Lv
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
--
N�����r��y����b�X��ǧv�^�)޺{.n�+����{����zX����ܨ}���Ơz�&j:+v�������zZ+��+zf���h���~����i���z��w���?�����&�)ߢf
Loading...