Joel Fernandes
2020-03-17 00:55:21 UTC
Hi Julien, Peter, all,
running linpack), all affined to an 18 cores NUMA node (36 hardware
threads). Each VM is running in its own cgroup/tag with core scheduling
enabled. We know it already performed much better than nosmt, so for
- how much time the process spends co-scheduled with idle, a compatible
or an incompatible task
- how long does the process spends running in a inefficient
configuration (more than 1 thread running alone on a core)
And I am very happy to report than even though the 9 VMs were configured
to float on the whole NUMA node, the scheduler / load-balancer did a
- total runtime: 46451472309 ns,
- unknown neighbors (total: 92042503 ns, 0.198 % of process runtime)
- number of periods: 48
- min period duration: 1424 ns
- max period duration: 116988 ns
- average period duration: 9684.000 ns
- stdev: 19282.130
I thought you would enjoy seeing this :-)
Looks quite interesting. We are trying apply this work to ChromeOS. What we
want to do is selectively marking tasks, instead of grouping sets of trusted
tasks. I have a patch that adds a prctl which a task can call, and it works
well (task calls prctl and gets a cookie which gives it a dedicated core).
However, I have the following questions, in particular there are 4 scenarios
where I feel the current patches do not resolve MDS/L1TF, would you guys
please share your thoughts?
1. HT1 is running either hostile guest or host code.
HT2 is running an interrupt handler (victim).
In this case I see there is a possible MDS issue between HT1 and HT2.
2. HT1 is executing hostile host code, and gets interrupted by a victim
interrupt. HT2 is idle.
In this case, I see there is a possible MDS issue between interrupt and
the host code on the same HT1.
3. HT1 is executing hostile guest code, HT2 is executing a victim interrupt
handler on the host.
In this case, I see there is a possible L1TF issue between HT1 and HT2.
This issue does not happen if HT1 is running host code, since the host
kernel takes care of inverting PTE bits.
4. HT1 is idle, and HT2 is running a victim process. Now HT1 starts running
hostile code on guest or host. HT2 is being forced idle. However, there is
an overlap between HT1 starting to execute hostile code and HT2's victim
process getting scheduled out.
Speaking to Vineeth, we discussed an idea to monitor the core_sched_seq
counter of the sibling being idled to detect that it is now idle.
However we discussed today that looking at this data, it is not really an
issue since it is such a small window.
My concern is now cases 1, 2 to which there does not seem a good solution,
short of disabling interrupts. For 3, we could still possibly do something on
the guest side, such as using shadow page tables. Any thoughts on all this?
thanks,
- Joel
Yes, this makes sense, patch updated at here, I put your name there if
you don't mind.
https://github.com/aubreyli/linux/tree/coresched_v4-v5.5.2-rc2
Thanks Aubrey!
Just a quick note, I ran a very cpu-intensive benchmark (9x12 vcpus VMsyou don't mind.
https://github.com/aubreyli/linux/tree/coresched_v4-v5.5.2-rc2
Thanks Aubrey!
running linpack), all affined to an 18 cores NUMA node (36 hardware
threads). Each VM is running in its own cgroup/tag with core scheduling
enabled. We know it already performed much better than nosmt, so for
- how much time the process spends co-scheduled with idle, a compatible
or an incompatible task
- how long does the process spends running in a inefficient
configuration (more than 1 thread running alone on a core)
And I am very happy to report than even though the 9 VMs were configured
to float on the whole NUMA node, the scheduler / load-balancer did a
- total runtime: 46451472309 ns,
- unknown neighbors (total: 92042503 ns, 0.198 % of process runtime)
- number of periods: 48
- min period duration: 1424 ns
- max period duration: 116988 ns
- average period duration: 9684.000 ns
- stdev: 19282.130
I thought you would enjoy seeing this :-)
want to do is selectively marking tasks, instead of grouping sets of trusted
tasks. I have a patch that adds a prctl which a task can call, and it works
well (task calls prctl and gets a cookie which gives it a dedicated core).
However, I have the following questions, in particular there are 4 scenarios
where I feel the current patches do not resolve MDS/L1TF, would you guys
please share your thoughts?
1. HT1 is running either hostile guest or host code.
HT2 is running an interrupt handler (victim).
In this case I see there is a possible MDS issue between HT1 and HT2.
2. HT1 is executing hostile host code, and gets interrupted by a victim
interrupt. HT2 is idle.
In this case, I see there is a possible MDS issue between interrupt and
the host code on the same HT1.
3. HT1 is executing hostile guest code, HT2 is executing a victim interrupt
handler on the host.
In this case, I see there is a possible L1TF issue between HT1 and HT2.
This issue does not happen if HT1 is running host code, since the host
kernel takes care of inverting PTE bits.
4. HT1 is idle, and HT2 is running a victim process. Now HT1 starts running
hostile code on guest or host. HT2 is being forced idle. However, there is
an overlap between HT1 starting to execute hostile code and HT2's victim
process getting scheduled out.
Speaking to Vineeth, we discussed an idea to monitor the core_sched_seq
counter of the sibling being idled to detect that it is now idle.
However we discussed today that looking at this data, it is not really an
issue since it is such a small window.
My concern is now cases 1, 2 to which there does not seem a good solution,
short of disabling interrupts. For 3, we could still possibly do something on
the guest side, such as using shadow page tables. Any thoughts on all this?
thanks,
- Joel