Discussion:
[PATCH v3 2/2] irqchip/gic-v3-its: Balance initial LPI affinity across CPUs
(too old to reply)
John Garry
2020-03-16 13:02:48 UTC
Permalink
When mapping a LPI, the ITS driver picks the first possible
affinity, which is in most cases CPU0, assuming that if
that's not suitable, someone will come and set the affinity
to something more interesting.
It apparently isn't the case, and people complain of poor
performance when many interrupts are glued to the same CPU.
So let's place the interrupts by finding the "least loaded"
CPU (that is, the one that has the fewer LPIs mapped to it).
So called 'managed' interrupts are an interesting case where
the affinity is actually dictated by the kernel itself, and
we should honor this.
---
drivers/irqchip/irq-gic-v3-its.c | 118 ++++++++++++++++++++++++-------
1 file changed, 92 insertions(+), 26 deletions(-)
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 941786e1e8f7..7f1b731c04bb 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -1531,31 +1531,107 @@ static void its_dec_lpi_count(struct irq_data *d, int cpu)
atomic_dec(&per_cpu_ptr(&cpu_lpi_count, cpu)->unmanaged);
}
+static unsigned int cpumask_pick_least_loaded(struct irq_data *d,
+ const struct cpumask *cpu_mask)
+{
+ unsigned int cpu = nr_cpu_ids, tmp;
+ int count = S32_MAX;
+
+ for_each_cpu(tmp, cpu_mask) {
Hi Marc,
+ int this_count = its_read_lpi_count(d, tmp);
Not sure if it's intentional, but now there seems to be a subtle
difference to what Thomas described for non-managed interrupts - for
non-managed interrupts, x86 selects the CPU based on the total interrupt
load per CPU (or, more specifically, lowest vector allocation count),
and not just the non-managed load. Or maybe I misread it.

Anyway, we can test this now for NVMe with its managed interrupts.

Cheers,
John
+ if (this_count < count) {
+ cpu = tmp;
+ count = this_count;
+ }
+ }
+
+ return cpu;
+}
+
+/*
+ */
+static int its_select_cpu(struct irq_data *d,
+ const struct cpumask *aff_mask)
+{
+ struct its_device *its_dev = irq_data_get_irq_chip_data(d);
+ cpumask_var_t tmpmask;
+ int cpu, node;
+
+ if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL))
+ return -ENOMEM;
+
+ node = its_dev->its->numa_node;
+
+ if (!irqd_affinity_is_managed(d)) {
+ /* First try the NUMA node */
+ if (node != NUMA_NO_NODE) {
+ /*
+ * Try the intersection of the affinity mask and the
+ * node mask (and the online mask, just to be safe).
+ */
+ cpumask_and(tmpmask, cpumask_of_node(node), aff_mask);
+ cpumask_and(tmpmask, tmpmask, cpu_online_mask);
+
+ /* If that doesn't work, try the nodemask itself */
+ if (cpumask_empty(tmpmask))
+ cpumask_and(tmpmask, cpumask_of_node(node), cpu_online_mask);
+
+ cpu = cpumask_pick_least_loaded(d, tmpmask);
+ if (cpu < nr_cpu_ids)
+ goto out;
+
+ /* If we can't cross sockets, give up */
+ if ((its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_23144))
+ goto out;
+
+ /* If the above failed, expand the search */
+ }
+
+ /* Try the intersection of the affinity and online masks */
+ cpumask_and(tmpmask, aff_mask, cpu_online_mask);
+
+ /* If that doesn't fly, the online mask is the last resort */
+ if (cpumask_empty(tmpmask))
+ cpumask_copy(tmpmask, cpu_online_mask);
+
+ cpu = cpumask_pick_least_loaded(d, tmpmask);
+ } else {
+ cpumask_and(tmpmask, irq_data_get_affinity_mask(d), cpu_online_mask);
+
+ /* If we cannot cross sockets, limit the search to that node */
+ if ((its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_23144) &&
+ node != NUMA_NO_NODE)
+ cpumask_and(tmpmask, tmpmask, cpumask_of_node(node));
+
+ cpu = cpumask_pick_least_loaded(d, tmpmask);
+ }
+ free_cpumask_var(tmpmask);
+
+ pr_debug("IRQ%d -> %*pbl CPU%d\n", d->irq, cpumask_pr_args(aff_mask), cpu);
+ return cpu;
+}
+
static int its_set_affinity(struct irq_data *d, const struct cpumask *mask_val,
bool force)
{
- unsigned int cpu;
- const struct cpumask *cpu_mask = cpu_online_mask;
struct its_device *its_dev = irq_data_get_irq_chip_data(d);
struct its_collection *target_col;
u32 id = its_get_event_id(d);
+ int cpu;
/* A forwarded interrupt should use irq_set_vcpu_affinity */
if (irqd_is_forwarded_to_vcpu(d))
return -EINVAL;
- /* lpi cannot be routed to a redistributor that is on a foreign node */
- if (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_23144) {
- if (its_dev->its->numa_node >= 0) {
- cpu_mask = cpumask_of_node(its_dev->its->numa_node);
- if (!cpumask_intersects(mask_val, cpu_mask))
- return -EINVAL;
- }
- }
-
- cpu = cpumask_any_and(mask_val, cpu_mask);
+ if (!force)
+ cpu = its_select_cpu(d, mask_val);
+ else
+ cpu = cpumask_pick_least_loaded(d, mask_val);
- if (cpu >= nr_cpu_ids)
+ if (cpu < 0 || cpu >= nr_cpu_ids)
return -EINVAL;
/* don't set the affinity when the target cpu is same as current one */
@@ -3455,21 +3531,11 @@ static int its_irq_domain_activate(struct irq_domain *domain,
{
struct its_device *its_dev = irq_data_get_irq_chip_data(d);
u32 event = its_get_event_id(d);
- const struct cpumask *cpu_mask = cpu_online_mask;
int cpu;
- /* get the cpu_mask of local node */
- if (its_dev->its->numa_node >= 0)
- cpu_mask = cpumask_of_node(its_dev->its->numa_node);
-
- /* Bind the LPI to the first possible CPU */
- cpu = cpumask_first_and(cpu_mask, cpu_online_mask);
- if (cpu >= nr_cpu_ids) {
- if (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_23144)
- return -EINVAL;
-
- cpu = cpumask_first(cpu_online_mask);
- }
+ cpu = its_select_cpu(d, cpu_online_mask);
+ if (cpu < 0 || cpu >= nr_cpu_ids)
+ return -EINVAL;
its_inc_lpi_count(d, cpu);
its_dev->event_map.col_map[event] = cpu;
Marc Zyngier
2020-03-16 13:14:24 UTC
Permalink
On 2020-03-16 13:02, John Garry wrote:

Hi John,
Post by John Garry
Hi Marc,
+ int this_count = its_read_lpi_count(d, tmp);
Not sure if it's intentional, but now there seems to be a subtle
difference to what Thomas described for non-managed interrupts - for
non-managed interrupts, x86 selects the CPU based on the total
interrupt load per CPU (or, more specifically, lowest vector
allocation count), and not just the non-managed load. Or maybe I
misread it.
So far, I'm trying to keep the two allocation paths separate, as the
two systems I have access to have very different behaviours: D05 has
no managed interrupts to speak of, and my top-secret work machine
has almost no unmanaged interrupts, so the two sets are almost
completely disjoint.

Also, it all depends on the interrupt allocation order, and whether
something will rebalance the non-managed interrupts at a later time.
At least, these two patches make it easy to alter the placement policy
(the behaviour you describe above is a 2 line change).
Post by John Garry
Anyway, we can test this now for NVMe with its managed interrupts.
Looking forward to hearing from you!

M.
--
Jazz is not dead. It just smells funny...
John Garry
2020-03-17 18:43:01 UTC
Permalink
Post by Marc Zyngier
Post by John Garry
+        int this_count = its_read_lpi_count(d, tmp);
Not sure if it's intentional, but now there seems to be a subtle
difference to what Thomas described for non-managed interrupts - for
non-managed interrupts, x86 selects the CPU based on the total
interrupt load per CPU (or, more specifically, lowest vector
allocation count), and not just the non-managed load. Or maybe I
misread it.
So far, I'm trying to keep the two allocation paths separate, as the
two systems I have access to have very different behaviours: D05 has
no managed interrupts to speak of, and my top-secret work machine
has almost no unmanaged interrupts, so the two sets are almost
completely disjoint.
Sure, but I'd say that it would be a more common scenario to have a
mixture of both.
Post by Marc Zyngier
Also, it all depends on the interrupt allocation order, and whether
something will rebalance the non-managed interrupts at a later time.
At least, these two patches make it easy to alter the placement policy
(the behaviour you describe above is a 2 line change).
Post by John Garry
Anyway, we can test this now for NVMe with its managed interrupts.
Looking forward to hearing from you!
On my D06CS board (128 core), there seems to be something wrong, as the
q0 affinity mask looks incorrect:

PCI name is 81:00.0: nvme0n1


irq 322, cpu list 69, effective list 69


irq 325, cpu list 32-38, effective list 32


irq 326, cpu list 39-45, effective list 40


irq 327, cpu list 46-51, effective list 47


irq 328, cpu list 52-57, effective list 53


irq 329, cpu list 58-63, effective list 59


And something stranger for my colleague Luo Jiaxing, specifically the
effective affinity:

PCI name is 85:00.0: nvme2n1
irq 196, cpu list 0-31, effective list 82
irq 377, cpu list 32-38, effective list 32
irq 378, cpu list 39-45, effective list 39
irq 379, cpu list 46-51, effective list 46

But then v5.6-rc5 vanilla also looks to have this issue when I tested on
my board:

***@ubuntu:~$ more /proc/irq/322/smp_affinity_list


69

My D06ES (96 core) board looks sensible for the affinity in this regard
(I did not try vanilla v5.6-rc5, but only with your patches on top).
I'll need to debug this.

Cheers,
John

Loading...