[PATCH 2/2] workqueue: update per cpu workqueue's numa affinity when cpu preparing online

Discussion:

[PATCH 2/2] workqueue: update per cpu workqueue's numa affinity when cpu preparing online

(too old to reply)

Gu Zheng

2015-03-26 02:40:02 UTC

Update the per cpu workqueue's numa affinity when cpu preparing online to
create the worker on the correct node.

Signed-off-by: Gu Zheng <***@cn.fujitsu.com>
---
kernel/workqueue.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 41ff75b..4c65953 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4618,6 +4618,7 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
switch (action & ~CPU_TASKS_FROZEN) {
case CPU_UP_PREPARE:
for_each_cpu_worker_pool(pool, cpu) {
+ pool->node = cpu_to_node(cpu);
if (pool->nr_workers)
continue;
if (!create_worker(pool))

--
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Gu Zheng

2015-03-26 02:40:02 UTC

Previously, we build the apicid <--> cpuid mapping when the cpu is present, but
the relationship will be changed if the cpu/node hotplug happenned, because we
always choose the first free cpuid for the hot added cpu (whether it is new-add
or re-add), so this the cpuid <--> node mapping changed if node hot plug
occurred, and it causes the wq sub-system allocation failture:
==
SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
cache: kmalloc-192, object size: 192, buffer size: 192, default
order:
1, min order: 0
node 0: slabs: 6172, objs: 259224, free: 245741
node 1: slabs: 3261, objs: 136962, free: 127656
==
So here we build the persistent [lapic id] <--> cpuid mapping when the cpu first
present, and never change it.

Suggested-by: KAMEZAWA Hiroyuki <***@jp.fujitsu.com>
Signed-off-by: Gu Zheng <***@cn.fujitsu.com>
---
arch/x86/kernel/apic/apic.c | 31 ++++++++++++++++++++++++++++++-
1 files changed, 30 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index ad3639a..d539ebc 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
}

+/*
+ * Logic cpu number(cpuid) to local APIC id persistent mappings.
+ * Do not clear the mapping even if cpu hot removed.
+ * */
+static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = {
+ [0 ... MAX_LOCAL_APIC - 1] = -1,
+};
+
+/*
+ * Internal cpu id bits, set the bit once cpu present, and never clear it.
+ * */
+static cpumask_t cpuid_mask = CPU_MASK_NONE;
+
+static int get_cpuid(int apicid)
+{
+ int cpuid;
+
+ cpuid = apicid_to_x86_cpu[apicid];
+ if (cpuid == -1)
+ cpuid = cpumask_next_zero(-1, &cpuid_mask);
+
+ return cpuid;
+}
+
int generic_processor_info(int apicid, int version)
{
int cpu, max = nr_cpu_ids;
@@ -2115,7 +2139,10 @@ int generic_processor_info(int apicid, int version)
*/
cpu = 0;
} else
- cpu = cpumask_next_zero(-1, cpu_present_mask);
+ cpu = get_cpuid(apicid);
+
+ /* Store the mapping */
+ apicid_to_x86_cpu[apicid] = cpu;

/*
* Validate version
@@ -2144,6 +2171,8 @@ int generic_processor_info(int apicid, int version)
early_per_cpu(x86_cpu_to_logical_apicid, cpu) =
apic->x86_32_early_logical_apicid(cpu);
#endif
+ /* Mark this cpu id as uesed (already mapping a local apic id) */
+ cpumask_set_cpu(cpu, &cpuid_mask);
set_cpu_possible(cpu, true);
set_cpu_present(cpu, true);

--
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Kamezawa Hiroyuki

2015-03-26 03:30:01 UTC

Post by Gu Zheng
Previously, we build the apicid <--> cpuid mapping when the cpu is present, but
the relationship will be changed if the cpu/node hotplug happenned, because we
always choose the first free cpuid for the hot added cpu (whether it is new-add
or re-add), so this the cpuid <--> node mapping changed if node hot plug
==
SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
cache: kmalloc-192, object size: 192, buffer size: 192, default
1, min order: 0
node 0: slabs: 6172, objs: 259224, free: 245741
node 1: slabs: 3261, objs: 136962, free: 127656
==
So here we build the persistent [lapic id] <--> cpuid mapping when the cpu first
present, and never change it.
---
arch/x86/kernel/apic/apic.c | 31 ++++++++++++++++++++++++++++++-
1 files changed, 30 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index ad3639a..d539ebc 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
}
+/*
+ * Logic cpu number(cpuid) to local APIC id persistent mappings.
+ * Do not clear the mapping even if cpu hot removed.
+ * */
+static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = {
+ [0 ... MAX_LOCAL_APIC - 1] = -1,
+};

This patch cannot handle x2apic, which is 32bit.

As far as I understand, it depends on CPU's spec and the newest cpu has 9bit apicid, at least.

But you can't create inifinit array.

If you can't allocate the array dynamically, How about adding

static int cpuid_to_apicid[MAX_CPU] = {}

or using idr library ? (please see lib/idr.c)

I guess you can update this map after boot(after mm initialization)
and make use of idr library.

About this patch, Nack.

-Kame

Post by Gu Zheng
+
+/*
+ * Internal cpu id bits, set the bit once cpu present, and never clear it.
+ * */
+static cpumask_t cpuid_mask = CPU_MASK_NONE;
+
+static int get_cpuid(int apicid)
+{
+ int cpuid;
+
+ cpuid = apicid_to_x86_cpu[apicid];
+ if (cpuid == -1)
+ cpuid = cpumask_next_zero(-1, &cpuid_mask);
+
+ return cpuid;
+}
+
int generic_processor_info(int apicid, int version)
{
int cpu, max = nr_cpu_ids;
@@ -2115,7 +2139,10 @@ int generic_processor_info(int apicid, int version)
*/
cpu = 0;
} else
- cpu = cpumask_next_zero(-1, cpu_present_mask);
+ cpu = get_cpuid(apicid);
+
+ /* Store the mapping */
+ apicid_to_x86_cpu[apicid] = cpu;
/*
* Validate version
@@ -2144,6 +2171,8 @@ int generic_processor_info(int apicid, int version)
early_per_cpu(x86_cpu_to_logical_apicid, cpu) =
apic->x86_32_early_logical_apicid(cpu);
#endif
+ /* Mark this cpu id as uesed (already mapping a local apic id) */
+ cpumask_set_cpu(cpu, &cpuid_mask);
set_cpu_possible(cpu, true);
set_cpu_present(cpu, true);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Gu Zheng

2015-03-26 05:20:01 UTC

Hi Kame-san,

Post by Kamezawa Hiroyuki

Post by Gu Zheng
Previously, we build the apicid <--> cpuid mapping when the cpu is present, but
the relationship will be changed if the cpu/node hotplug happenned, because we
always choose the first free cpuid for the hot added cpu (whether it is new-add
or re-add), so this the cpuid <--> node mapping changed if node hot plug
==
SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
cache: kmalloc-192, object size: 192, buffer size: 192, default
1, min order: 0
node 0: slabs: 6172, objs: 259224, free: 245741
node 1: slabs: 3261, objs: 136962, free: 127656
==
So here we build the persistent [lapic id] <--> cpuid mapping when the cpu first
present, and never change it.
---
arch/x86/kernel/apic/apic.c | 31 ++++++++++++++++++++++++++++++-
1 files changed, 30 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index ad3639a..d539ebc 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
}
+/*
+ * Logic cpu number(cpuid) to local APIC id persistent mappings.
+ * Do not clear the mapping even if cpu hot removed.
+ * */
+static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = {
+ [0 ... MAX_LOCAL_APIC - 1] = -1,
+};

This patch cannot handle x2apic, which is 32bit.

IMO, if the apicid is too big (larger than MAX_LOCAL_APIC), we will skip
generating a logic cpu number for it, so it seems no problem here.

Post by Kamezawa Hiroyuki
As far as I understand, it depends on CPU's spec and the newest cpu has 9bit apicid, at least.
But you can't create inifinit array.
If you can't allocate the array dynamically, How about adding
static int cpuid_to_apicid[MAX_CPU] = {}
or using idr library ? (please see lib/idr.c)
I guess you can update this map after boot(after mm initialization)
and make use of idr library.
About this patch, Nack.
-Kame

Post by Gu Zheng
+
+/*
+ * Internal cpu id bits, set the bit once cpu present, and never clear it.
+ * */
+static cpumask_t cpuid_mask = CPU_MASK_NONE;
+
+static int get_cpuid(int apicid)
+{
+ int cpuid;
+
+ cpuid = apicid_to_x86_cpu[apicid];
+ if (cpuid == -1)
+ cpuid = cpumask_next_zero(-1, &cpuid_mask);
+
+ return cpuid;
+}
+
int generic_processor_info(int apicid, int version)
{
int cpu, max = nr_cpu_ids;
@@ -2115,7 +2139,10 @@ int generic_processor_info(int apicid, int version)
*/
cpu = 0;
} else
- cpu = cpumask_next_zero(-1, cpu_present_mask);
+ cpu = get_cpuid(apicid);
+
+ /* Store the mapping */
+ apicid_to_x86_cpu[apicid] = cpu;
/*
* Validate version
@@ -2144,6 +2171,8 @@ int generic_processor_info(int apicid, int version)
early_per_cpu(x86_cpu_to_logical_apicid, cpu) =
apic->x86_32_early_logical_apicid(cpu);
#endif
+ /* Mark this cpu id as uesed (already mapping a local apic id) */
+ cpumask_set_cpu(cpu, &cpuid_mask);
set_cpu_possible(cpu, true);
set_cpu_present(cpu, true);

.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Tejun Heo

2015-03-26 15:20:02 UTC

Post by Gu Zheng

Post by Kamezawa Hiroyuki

Post by Gu Zheng
+static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = {
+ [0 ... MAX_LOCAL_APIC - 1] = -1,
+};

This patch cannot handle x2apic, which is 32bit.

IMO, if the apicid is too big (larger than MAX_LOCAL_APIC), we will skip
generating a logic cpu number for it, so it seems no problem here.

You're defining 128k array on 64bit most of which will be unused on
most machines. There is a problem with that.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Gu Zheng

2015-03-30 10:20:02 UTC

Hi Kame-san,

Post by Gu Zheng
Hi Kame-san,

Post by Kamezawa Hiroyuki

Post by Gu Zheng
Previously, we build the apicid <--> cpuid mapping when the cpu is present, but
the relationship will be changed if the cpu/node hotplug happenned, because we
always choose the first free cpuid for the hot added cpu (whether it is new-add
or re-add), so this the cpuid <--> node mapping changed if node hot plug
==
SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
cache: kmalloc-192, object size: 192, buffer size: 192, default
1, min order: 0
node 0: slabs: 6172, objs: 259224, free: 245741
node 1: slabs: 3261, objs: 136962, free: 127656
==
So here we build the persistent [lapic id] <--> cpuid mapping when the cpu first
present, and never change it.
---
arch/x86/kernel/apic/apic.c | 31 ++++++++++++++++++++++++++++++-
1 files changed, 30 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index ad3639a..d539ebc 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
}
+/*
+ * Logic cpu number(cpuid) to local APIC id persistent mappings.
+ * Do not clear the mapping even if cpu hot removed.
+ * */
+static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = {
+ [0 ... MAX_LOCAL_APIC - 1] = -1,
+};

This patch cannot handle x2apic, which is 32bit.

IMO, if the apicid is too big (larger than MAX_LOCAL_APIC), we will skip
generating a logic cpu number for it, so it seems no problem here.

you mean MAX_LOCAL_APIC=32768 ? ....isn't it too wasting ?

I use the big array here to keep the same format with the existed ones:
int apic_version[MAX_LOCAL_APIC];

s16 __apicid_to_node[MAX_LOCAL_APIC] = {
[0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE
};
Or we should also say "NO" to them?

Regards,
Gu

Anyway, APIC IDs are sparse values. Please use proper structure.
Thanks,
-Kame

Post by Gu Zheng

Post by Kamezawa Hiroyuki
As far as I understand, it depends on CPU's spec and the newest cpu has 9bit apicid, at least.
But you can't create inifinit array.
If you can't allocate the array dynamically, How about adding
static int cpuid_to_apicid[MAX_CPU] = {}
or using idr library ? (please see lib/idr.c)
I guess you can update this map after boot(after mm initialization)
and make use of idr library.
About this patch, Nack.
-Kame

Post by Gu Zheng
+
+/*
+ * Internal cpu id bits, set the bit once cpu present, and never clear it.
+ * */
+static cpumask_t cpuid_mask = CPU_MASK_NONE;
+
+static int get_cpuid(int apicid)
+{
+ int cpuid;
+
+ cpuid = apicid_to_x86_cpu[apicid];
+ if (cpuid == -1)
+ cpuid = cpumask_next_zero(-1, &cpuid_mask);
+
+ return cpuid;
+}
+
int generic_processor_info(int apicid, int version)
{
int cpu, max = nr_cpu_ids;
@@ -2115,7 +2139,10 @@ int generic_processor_info(int apicid, int version)
*/
cpu = 0;
} else
- cpu = cpumask_next_zero(-1, cpu_present_mask);
+ cpu = get_cpuid(apicid);
+
+ /* Store the mapping */
+ apicid_to_x86_cpu[apicid] = cpu;
/*
* Validate version
@@ -2144,6 +2171,8 @@ int generic_processor_info(int apicid, int version)
early_per_cpu(x86_cpu_to_logical_apicid, cpu) =
apic->x86_32_early_logical_apicid(cpu);
#endif
+ /* Mark this cpu id as uesed (already mapping a local apic id) */
+ cpumask_set_cpu(cpu, &cpuid_mask);
set_cpu_possible(cpu, true);
set_cpu_present(cpu, true);

.

.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Kamezawa Hiroyuki

2015-04-01 03:00:02 UTC

Post by Gu Zheng
Hi Kame-san,

Post by Gu Zheng
Hi Kame-san,

Post by Kamezawa Hiroyuki

Post by Gu Zheng
Previously, we build the apicid <--> cpuid mapping when the cpu is present, but
the relationship will be changed if the cpu/node hotplug happenned, because we
always choose the first free cpuid for the hot added cpu (whether it is new-add
or re-add), so this the cpuid <--> node mapping changed if node hot plug
==
SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
cache: kmalloc-192, object size: 192, buffer size: 192, default
1, min order: 0
node 0: slabs: 6172, objs: 259224, free: 245741
node 1: slabs: 3261, objs: 136962, free: 127656
==
So here we build the persistent [lapic id] <--> cpuid mapping when the cpu first
present, and never change it.
---
arch/x86/kernel/apic/apic.c | 31 ++++++++++++++++++++++++++++++-
1 files changed, 30 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index ad3639a..d539ebc 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
}
+/*
+ * Logic cpu number(cpuid) to local APIC id persistent mappings.
+ * Do not clear the mapping even if cpu hot removed.
+ * */
+static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = {
+ [0 ... MAX_LOCAL_APIC - 1] = -1,
+};

This patch cannot handle x2apic, which is 32bit.

IMO, if the apicid is too big (larger than MAX_LOCAL_APIC), we will skip
generating a logic cpu number for it, so it seems no problem here.

you mean MAX_LOCAL_APIC=32768 ? ....isn't it too wasting ?

int apic_version[MAX_LOCAL_APIC];
s16 __apicid_to_node[MAX_LOCAL_APIC] = {
[0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE
};
Or we should also say "NO" to them?

This will not survive when someone try to enlarge MAX_LOCAL_APIC for
introdcing new cpu model because x2apic is 32bit by definition.

Please create long surviving infrastructure.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Kamezawa Hiroyuki

2015-03-26 03:20:01 UTC

Yasuaki Ishimatsu found that with node online/offline, cpu<->node
relationship is established. Because workqueue uses a info which was
established at boot time, but it may be changed by node hotpluging.
Once pool->node points to a stale node, following allocation failure
happens.
==
SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
cache: kmalloc-192, object size: 192, buffer size: 192, default
1, min order: 0
node 0: slabs: 6172, objs: 259224, free: 245741
node 1: slabs: 3261, objs: 136962, free: 127656
==
As the apicid <--> node relationship is persistent, so the root cause is the

$B!!!!!!!!(B ^^^^^^^
pxm.

cpu-id <-> lapicid mapping is not persistent (because the currently implementation
always choose the first free cpu id for the new added cpu), so if we can build
persistent cpu-id <-> lapicid relationship, this problem will be fixed.
Please refer to https://lkml.org/lkml/2015/2/27/145 for the previous discussion.
x86/cpu hotplug: make lapicid <-> cpuid mapping persistent
workqueue: update per cpu workqueue's numa affinity when cpu
preparing online

why patch(2/2) required ?

Thanks,
-Kame

arch/x86/kernel/apic/apic.c | 31 ++++++++++++++++++++++++++++++-
kernel/workqueue.c | 1 +
2 files changed, 31 insertions(+), 1 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Gu Zheng

2015-03-26 06:10:02 UTC

Hi Kame-san,

Post by Kamezawa Hiroyuki

Yasuaki Ishimatsu found that with node online/offline, cpu<->node
relationship is established. Because workqueue uses a info which was
established at boot time, but it may be changed by node hotpluging.
Once pool->node points to a stale node, following allocation failure
happens.
==
SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
cache: kmalloc-192, object size: 192, buffer size: 192, default
1, min order: 0
node 0: slabs: 6172, objs: 259224, free: 245741
node 1: slabs: 3261, objs: 136962, free: 127656
==
As the apicid <--> node relationship is persistent, so the root cause is the

$B!!!!!!!!(B ^^^^^^^
pxm.

cpu-id <-> lapicid mapping is not persistent (because the currently implementation
always choose the first free cpu id for the new added cpu), so if we can build
persistent cpu-id <-> lapicid relationship, this problem will be fixed.
Please refer to https://lkml.org/lkml/2015/2/27/145 for the previous discussion.
x86/cpu hotplug: make lapicid <-> cpuid mapping persistent
workqueue: update per cpu workqueue's numa affinity when cpu
preparing online

why patch(2/2) required ?

wq generates the numa affinity (pool->node) for all the possible cpu's
per cpu workqueue at init stage, that means the affinity of currently un-present
ones' may be incorrect, so we need to update the pool->node for the new added cpu
to the correct node when preparing online, otherwise it will try to create worker
on invalid node if node hotplug occurred.

Regards,
Gu

Post by Kamezawa Hiroyuki
Thanks,
-Kame

arch/x86/kernel/apic/apic.c | 31 ++++++++++++++++++++++++++++++-
kernel/workqueue.c | 1 +
2 files changed, 31 insertions(+), 1 deletions(-)

.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Kamezawa Hiroyuki

2015-03-26 16:50:03 UTC

Hello,

Post by Gu Zheng
wq generates the numa affinity (pool->node) for all the possible cpu's
per cpu workqueue at init stage, that means the affinity of currently un-present
ones' may be incorrect, so we need to update the pool->node for the new added cpu
to the correct node when preparing online, otherwise it will try to create worker
on invalid node if node hotplug occurred.

If the mapping is gonna be static once the cpus show up, any chance we
can initialize that for all possible cpus during boot?

I think the kernel can define all possible

cpuid <-> lapicid <-> pxm <-> nodeid

mapping at boot with using firmware table information.

One concern is current x86 logic for memory-less node v.s. memory hotplug.
(as I explained before)

My idea is
step1. build all possible mapping at boot cpuid <-> apicid <-> pxm <-> node id at boot.

But this may be overwritten by x86's memory less node logic. So,
step2. check node is online or not before calling kmalloc. If offline, use -1.
rather than updating workqueue's attribute.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Gu Zheng

2015-03-30 10:10:02 UTC

Hi Kame-san,

Post by Kamezawa Hiroyuki

Hello,

Post by Gu Zheng
wq generates the numa affinity (pool->node) for all the possible cpu's
per cpu workqueue at init stage, that means the affinity of currently un-present
ones' may be incorrect, so we need to update the pool->node for the new added cpu
to the correct node when preparing online, otherwise it will try to create worker
on invalid node if node hotplug occurred.

If the mapping is gonna be static once the cpus show up, any chance we
can initialize that for all possible cpus during boot?

I think the kernel can define all possible
cpuid <-> lapicid <-> pxm <-> nodeid
mapping at boot with using firmware table information.

Could you explain more?

Post by Kamezawa Hiroyuki
One concern is current x86 logic for memory-less node v.s. memory hotplug.
(as I explained before)
My idea is
step1. build all possible mapping at boot cpuid <-> apicid <-> pxm <-> node id at boot.
But this may be overwritten by x86's memory less node logic. So,
step2. check node is online or not before calling kmalloc. If offline, use -1.
rather than updating workqueue's attribute.
Thanks,
-Kame
.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Kamezawa Hiroyuki

2015-03-31 06:10:02 UTC

Post by Gu Zheng
Hi Kame-san,

Post by Kamezawa Hiroyuki

Hello,

Post by Gu Zheng
wq generates the numa affinity (pool->node) for all the possible cpu's
per cpu workqueue at init stage, that means the affinity of currently un-present
ones' may be incorrect, so we need to update the pool->node for the new added cpu
to the correct node when preparing online, otherwise it will try to create worker
on invalid node if node hotplug occurred.

If the mapping is gonna be static once the cpus show up, any chance we
can initialize that for all possible cpus during boot?

I think the kernel can define all possible
cpuid <-> lapicid <-> pxm <-> nodeid
mapping at boot with using firmware table information.

Could you explain more?

you can scan firmware's table and store all provided information for possible cpus/pxms
somewhere even while future cpus/mems are not onlined rather than determine them
on demand.
But this may be considered as API change for most hot-add users.

So, for now, I vote for detemining ids at online but record it is a good way.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Tejun Heo

2015-03-31 15:30:01 UTC

Hello, Kamezawa.

Post by Kamezawa Hiroyuki
But this may be considered as API change for most hot-add users.

Hmm... Why would it be? What can that possibly break?

Post by Kamezawa Hiroyuki
So, for now, I vote for detemining ids at online but record it is a good way.

If we know the information during boot, let's please do it at boot
time by all means.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Tejun Heo

2015-04-01 03:10:01 UTC

Ugh... so, cpu number allocation on hot-add is part of userland
interface that we're locked into? Tying hotplug and id allocation
order together usually isn't a good idea. What if the cpu up fails
while running the notifiers? The ID is already allocated and the next
cpu being brought up will be after a hole anyway. Is this even
actually gonna affect userland?

At any rate, this isn't big a deal. If the mapping has to be dynamic,
it'd be great to move the mapping code from workqueue to cpu hotplug
proper tho.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Tejun Heo

2015-04-01 03:10:02 UTC

Now, hot-added cpus will have the lowest free cpu id.
Because of this, in most of systems which has only cpu-hot-add, cpu-ids are always
contiguous even after cpu hot add.
In enterprise, this would be considered as imcompatibility.
determining cpuid <-> lapicid at boot will make cpuids sparse. That may corrupt
exisiting script or configuration/resource management software.

Ugh... so, cpu number allocation on hot-add is part of userland
interface that we're locked into? Tying hotplug and id allocation
order together usually isn't a good idea. What if the cpu up fails
while running the notifiers? The ID is already allocated and the next
cpu being brought up will be after a hole anyway. Is this even
actually gonna affect userland?

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Kamezawa Hiroyuki

2015-04-01 08:40:02 UTC

Now, hot-added cpus will have the lowest free cpu id.
Because of this, in most of systems which has only cpu-hot-add, cpu-ids are always
contiguous even after cpu hot add.
In enterprise, this would be considered as imcompatibility.
determining cpuid <-> lapicid at boot will make cpuids sparse. That may corrupt
exisiting script or configuration/resource management software.

Ugh... so, cpu number allocation on hot-add is part of userland
interface that we're locked into?

We checked most of RHEL7 packages and didn't find a problem yet.
But, for examle, we know some performance test team's test program assumed contiguous
cpuids and it failed. It was an easy case because we can ask them to fix the application
but I guess there will be some amount of customers that cpuids are contiguous.

Tying hotplug and id allocation
order together usually isn't a good idea. What if the cpu up fails
while running the notifiers? The ID is already allocated and the next
cpu being brought up will be after a hole anyway. Is this even
actually gonna affect userland?

Maybe. It's not fail-safe but....

In general, all kernel engineers (and skilled userland engineers) knows that
cpuids cannot be always contiguous and cpuids/nodeids should be checked before
running programs. I think most of engineers should be aware of that but many
users have their own assumption :(

Basically, I don't have strong objections, you're right technically.

In summary...
- users should not assume cpuids are contiguous.
- all possible ids should be fixed at boot time.
- For uses, some clarification document should be somewhere in Documenatation.

So, Gu-san
1) determine all possible ids at boot.
2) clarify cpuid/nodeid can have hole because of 1) in Documenation.
3) It would be good if other guys give us ack.

In future,
I myself thinks naming system like udev for cpuid/numaid is necessary, at last.
Can that renaming feature can be cgroup/namespace feature ? If possible,
all container can have cpuids starting from 0.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Gu Zheng

2015-04-02 02:00:01 UTC

Hi Kame, TJ,

Post by Kamezawa Hiroyuki

Now, hot-added cpus will have the lowest free cpu id.
Because of this, in most of systems which has only cpu-hot-add, cpu-ids are always
contiguous even after cpu hot add.
In enterprise, this would be considered as imcompatibility.
determining cpuid <-> lapicid at boot will make cpuids sparse. That may corrupt
exisiting script or configuration/resource management software.

Ugh... so, cpu number allocation on hot-add is part of userland
interface that we're locked into?

We checked most of RHEL7 packages and didn't find a problem yet.
But, for examle, we know some performance test team's test program assumed contiguous
cpuids and it failed. It was an easy case because we can ask them to fix the application
but I guess there will be some amount of customers that cpuids are contiguous.

Tying hotplug and id allocation
order together usually isn't a good idea. What if the cpu up fails
while running the notifiers? The ID is already allocated and the next
cpu being brought up will be after a hole anyway. Is this even
actually gonna affect userland?

Maybe. It's not fail-safe but....
In general, all kernel engineers (and skilled userland engineers) knows that
cpuids cannot be always contiguous and cpuids/nodeids should be checked before
running programs. I think most of engineers should be aware of that but many
users have their own assumption :(
Basically, I don't have strong objections, you're right technically.
In summary...
- users should not assume cpuids are contiguous.
- all possible ids should be fixed at boot time.
- For uses, some clarification document should be somewhere in Documenatation.

Fine to me.

Post by Kamezawa Hiroyuki
So, Gu-san
1) determine all possible ids at boot.
2) clarify cpuid/nodeid can have hole because of 1) in Documenation.
3) It would be good if other guys give us ack.

Also fine.
But before this going, could you please reconsider determining the ids when firstly
present (the implementation on this patchset)?
Though it is not the perfect one in some words, but we can ignore the doubts that
mentioned above as the cpu/node hotplug is not frequent behaviours, and there seems
not anything harmful to us if we go this way.

Regards,
Gu

Post by Kamezawa Hiroyuki
In future,
I myself thinks naming system like udev for cpuid/numaid is necessary, at last.
Can that renaming feature can be cgroup/namespace feature ? If possible,
all container can have cpuids starting from 0.
Thanks,
-Kame
.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Kamezawa Hiroyuki

2015-04-02 03:00:02 UTC

Post by Gu Zheng
Hi Kame, TJ,

Post by Kamezawa Hiroyuki

Now, hot-added cpus will have the lowest free cpu id.
Because of this, in most of systems which has only cpu-hot-add, cpu-ids are always
contiguous even after cpu hot add.
In enterprise, this would be considered as imcompatibility.
determining cpuid <-> lapicid at boot will make cpuids sparse. That may corrupt
exisiting script or configuration/resource management software.

Ugh... so, cpu number allocation on hot-add is part of userland
interface that we're locked into?

We checked most of RHEL7 packages and didn't find a problem yet.
But, for examle, we know some performance test team's test program assumed contiguous
cpuids and it failed. It was an easy case because we can ask them to fix the application
but I guess there will be some amount of customers that cpuids are contiguous.

Tying hotplug and id allocation
order together usually isn't a good idea. What if the cpu up fails
while running the notifiers? The ID is already allocated and the next
cpu being brought up will be after a hole anyway. Is this even
actually gonna affect userland?

Maybe. It's not fail-safe but....
In general, all kernel engineers (and skilled userland engineers) knows that
cpuids cannot be always contiguous and cpuids/nodeids should be checked before
running programs. I think most of engineers should be aware of that but many
users have their own assumption :(
Basically, I don't have strong objections, you're right technically.
In summary...
- users should not assume cpuids are contiguous.
- all possible ids should be fixed at boot time.
- For uses, some clarification document should be somewhere in Documenatation.

Fine to me.

Post by Kamezawa Hiroyuki
So, Gu-san
1) determine all possible ids at boot.
2) clarify cpuid/nodeid can have hole because of 1) in Documenation.
3) It would be good if other guys give us ack.

Also fine.
But before this going, could you please reconsider determining the ids when firstly
present (the implementation on this patchset)?
Though it is not the perfect one in some words, but we can ignore the doubts that
mentioned above as the cpu/node hotplug is not frequent behaviours, and there seems
not anything harmful to us if we go this way.

Is it so heavy work ? Hmm. My requests are

Implement your patches as
- Please don't change current behavior at boot.
- Remember all possible apicids and give them future cpuids if not assigned.
as step 1.

Please fix dynamic pxm<->node detection in step2.

In future, memory-less node handling in x86 should be revisited.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

17 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

Gu Zheng 2015-03-26 02:40:02 UTC

Gu Zheng 2015-03-26 02:40:02 UTC

Kamezawa Hiroyuki 2015-03-26 03:30:01 UTC

Gu Zheng 2015-03-26 05:20:01 UTC

Tejun Heo 2015-03-26 15:20:02 UTC

Gu Zheng 2015-03-30 10:20:02 UTC

Kamezawa Hiroyuki 2015-04-01 03:00:02 UTC

Kamezawa Hiroyuki 2015-03-26 03:20:01 UTC

Gu Zheng 2015-03-26 06:10:02 UTC

Kamezawa Hiroyuki 2015-03-26 16:50:03 UTC

Gu Zheng 2015-03-30 10:10:02 UTC

Kamezawa Hiroyuki 2015-03-31 06:10:02 UTC

Tejun Heo 2015-03-31 15:30:01 UTC

Tejun Heo 2015-04-01 03:10:01 UTC

Tejun Heo 2015-04-01 03:10:02 UTC

Kamezawa Hiroyuki 2015-04-01 08:40:02 UTC

Gu Zheng 2015-04-02 02:00:01 UTC

Kamezawa Hiroyuki 2015-04-02 03:00:02 UTC

about - legalese

Loading...