Discussion:
[RFC PATCH v4 00/19] Core scheduling v4
(too old to reply)
Joel Fernandes
2020-03-17 00:55:21 UTC
Permalink
Hi Julien, Peter, all,
Yes, this makes sense, patch updated at here, I put your name there if
you don't mind.
https://github.com/aubreyli/linux/tree/coresched_v4-v5.5.2-rc2
Thanks Aubrey!
Just a quick note, I ran a very cpu-intensive benchmark (9x12 vcpus VMs
running linpack), all affined to an 18 cores NUMA node (36 hardware
threads). Each VM is running in its own cgroup/tag with core scheduling
enabled. We know it already performed much better than nosmt, so for
- how much time the process spends co-scheduled with idle, a compatible
or an incompatible task
- how long does the process spends running in a inefficient
configuration (more than 1 thread running alone on a core)
And I am very happy to report than even though the 9 VMs were configured
to float on the whole NUMA node, the scheduler / load-balancer did a
- total runtime: 46451472309 ns,
- unknown neighbors (total: 92042503 ns, 0.198 % of process runtime)
- number of periods: 48
- min period duration: 1424 ns
- max period duration: 116988 ns
- average period duration: 9684.000 ns
- stdev: 19282.130
I thought you would enjoy seeing this :-)
Looks quite interesting. We are trying apply this work to ChromeOS. What we
want to do is selectively marking tasks, instead of grouping sets of trusted
tasks. I have a patch that adds a prctl which a task can call, and it works
well (task calls prctl and gets a cookie which gives it a dedicated core).

However, I have the following questions, in particular there are 4 scenarios
where I feel the current patches do not resolve MDS/L1TF, would you guys
please share your thoughts?

1. HT1 is running either hostile guest or host code.
HT2 is running an interrupt handler (victim).

In this case I see there is a possible MDS issue between HT1 and HT2.

2. HT1 is executing hostile host code, and gets interrupted by a victim
interrupt. HT2 is idle.

In this case, I see there is a possible MDS issue between interrupt and
the host code on the same HT1.

3. HT1 is executing hostile guest code, HT2 is executing a victim interrupt
handler on the host.

In this case, I see there is a possible L1TF issue between HT1 and HT2.
This issue does not happen if HT1 is running host code, since the host
kernel takes care of inverting PTE bits.

4. HT1 is idle, and HT2 is running a victim process. Now HT1 starts running
hostile code on guest or host. HT2 is being forced idle. However, there is
an overlap between HT1 starting to execute hostile code and HT2's victim
process getting scheduled out.
Speaking to Vineeth, we discussed an idea to monitor the core_sched_seq
counter of the sibling being idled to detect that it is now idle.
However we discussed today that looking at this data, it is not really an
issue since it is such a small window.

My concern is now cases 1, 2 to which there does not seem a good solution,
short of disabling interrupts. For 3, we could still possibly do something on
the guest side, such as using shadow page tables. Any thoughts on all this?

thanks,

- Joel
Tim Chen
2020-03-17 19:07:43 UTC
Permalink
Joel,
Post by Joel Fernandes
Looks quite interesting. We are trying apply this work to ChromeOS. What we
want to do is selectively marking tasks, instead of grouping sets of trusted
tasks. I have a patch that adds a prctl which a task can call, and it works
well (task calls prctl and gets a cookie which gives it a dedicated core).
However, I have the following questions, in particular there are 4 scenarios
where I feel the current patches do not resolve MDS/L1TF, would you guys
please share your thoughts?
1. HT1 is running either hostile guest or host code.
HT2 is running an interrupt handler (victim).
In this case I see there is a possible MDS issue between HT1 and HT2.
Core scheduling mitigates the userspace to userspace attacks via MDS between the HT.
It does not prevent the userspace to kernel space attack. That will
have to be mitigated via other means, e.g. redirecting interrupts to a core
that don't run potentially unsafe code.
Post by Joel Fernandes
2. HT1 is executing hostile host code, and gets interrupted by a victim
interrupt. HT2 is idle.
Similar to above.
Post by Joel Fernandes
In this case, I see there is a possible MDS issue between interrupt and
the host code on the same HT1.
The cpu buffers are cleared before return to the hostile host code. So
MDS shouldn't be an issue if interrupt handler and hostile code
runs on the same HT thread.
Post by Joel Fernandes
3. HT1 is executing hostile guest code, HT2 is executing a victim interrupt
handler on the host.
In this case, I see there is a possible L1TF issue between HT1 and HT2.
This issue does not happen if HT1 is running host code, since the host
kernel takes care of inverting PTE bits.
The interrupt handler will be run with PTE inverted. So I don't think
there's a leak via L1TF in this scenario.
Post by Joel Fernandes
4. HT1 is idle, and HT2 is running a victim process. Now HT1 starts running
hostile code on guest or host. HT2 is being forced idle. However, there is
an overlap between HT1 starting to execute hostile code and HT2's victim
process getting scheduled out.
Speaking to Vineeth, we discussed an idea to monitor the core_sched_seq
counter of the sibling being idled to detect that it is now idle.
However we discussed today that looking at this data, it is not really an
issue since it is such a small window.
My concern is now cases 1, 2 to which there does not seem a good solution,
short of disabling interrupts. For 3, we could still possibly do something on
the guest side, such as using shadow page tables. Any thoughts on all this?
+ Tony who may have more insights on L1TF and MDS.

Thanks.

Tim
Tim Chen
2020-03-17 20:18:45 UTC
Permalink
BTW Joel,

Did you guys have a chance to try out v5 core scheduler?
- Joel(ChromeOS) found a deadlock and crash on PREEMPT kernel in the
coreshed idle balance logic
We did some patches to fix a few stability issues in v4. I wonder
if v5 still has the deadlock that you saw before?

Tim
Thomas Gleixner
2020-03-17 21:17:47 UTC
Permalink
Tim,
Post by Tim Chen
Post by Joel Fernandes
However, I have the following questions, in particular there are 4 scenarios
where I feel the current patches do not resolve MDS/L1TF, would you guys
please share your thoughts?
1. HT1 is running either hostile guest or host code.
HT2 is running an interrupt handler (victim).
In this case I see there is a possible MDS issue between HT1 and HT2.
Core scheduling mitigates the userspace to userspace attacks via MDS between the HT.
It does not prevent the userspace to kernel space attack. That will
have to be mitigated via other means, e.g. redirecting interrupts to a core
that don't run potentially unsafe code.
Which is in some cases simply impossible. Think multiqueue devices with
managed interrupts. You can't change the affinity of those. Neither can
you do that for the per cpu timer interrupt.
Post by Tim Chen
Post by Joel Fernandes
2. HT1 is executing hostile host code, and gets interrupted by a victim
interrupt. HT2 is idle.
Similar to above.
No. It's the same HT so not similar at all.
Post by Tim Chen
Post by Joel Fernandes
In this case, I see there is a possible MDS issue between interrupt and
the host code on the same HT1.
The cpu buffers are cleared before return to the hostile host code. So
MDS shouldn't be an issue if interrupt handler and hostile code
runs on the same HT thread.
OTOH, thats mostly correct. Aside of the shouldn't wording:

MDS _is_ no issue in this case when the full mitigation is enabled.

Assumed that I have not less information about MDS than you have :)
Post by Tim Chen
Post by Joel Fernandes
3. HT1 is executing hostile guest code, HT2 is executing a victim interrupt
handler on the host.
In this case, I see there is a possible L1TF issue between HT1 and HT2.
This issue does not happen if HT1 is running host code, since the host
kernel takes care of inverting PTE bits.
The interrupt handler will be run with PTE inverted. So I don't think
there's a leak via L1TF in this scenario.
How so?

Host memory is attackable, when one of the sibling SMT threads runs in
host OS (hypervisor) context and the other in guest context.

HT1 is in guest mode and attacking (has control over PTEs). HT2 is
running in host mode and executes an interrupt handler. The host PTE
inversion does not matter in this scenario at all.

So HT1 can very well see data which is brought into the shared L1 by
HT2.

The only way to mitigate that aside of disabling HT is disabling EPT.
Post by Tim Chen
Post by Joel Fernandes
4. HT1 is idle, and HT2 is running a victim process. Now HT1 starts running
hostile code on guest or host. HT2 is being forced idle. However, there is
an overlap between HT1 starting to execute hostile code and HT2's victim
process getting scheduled out.
Speaking to Vineeth, we discussed an idea to monitor the core_sched_seq
counter of the sibling being idled to detect that it is now idle.
However we discussed today that looking at this data, it is not really an
issue since it is such a small window.
If the victim HT is kicked out of execution with an IPI then the overlap
depends on the contexts:

HT1 (attack) HT2 (victim)

A idle -> user space user space -> idle

B idle -> user space guest -> idle

C idle -> guest user space -> idle

D idle -> guest guest -> idle

The IPI from HT1 brings HT2 immediately into the kernel when HT2 is in
host user mode or brings it immediately into VMEXIT when HT2 is in guest
mode.

#A On return from handling the IPI HT2 immediately reschedules to idle.
To have an overlap the return to user space on HT1 must be faster.

#B Coming back from VEMXIT into schedule/idle might take slightly longer
than #A.

#C Similar to #A, but reentering guest mode in HT1 after sending the IPI
will probably take longer.

#D Similar to #C if you make the assumption that VMEXIT on HT2 and
rescheduling into idle is not significantly slower than reaching
VMENTER after sending the IPI.

In all cases the data exposed by a potential overlap shouldn't be that
interesting (e.g. scheduler state), but that obviously depends on what
the attacker is looking for.

But all of them are still problematic vs. interrupts / softinterrupts
which can happen on HT2 on the way to idle or while idling. i.e. #3 of
the original case list. #A and #B are only affected my MDS, #C and #D by
both MDS and L1TF (if EPT is in use).
Post by Tim Chen
Post by Joel Fernandes
My concern is now cases 1, 2 to which there does not seem a good solution,
short of disabling interrupts. For 3, we could still possibly do something on
the guest side, such as using shadow page tables. Any thoughts on all this?
#1 can be partially mitigated by changing interrupt affinities, which is
not always possible and in the case of the local timer interrupt
completely impossible. It's not only the timer interrupt itself, the
timer callbacks which can run in the softirq on return from interrupt
might be valuable attack surface depending on the nature of the
callbacks, the random entropy timer just being a random example.

#2 is a non issue if MDS mitigation is on, i.e. buffers are flushed
before returning to user space. It's pretty much a non SMT case,
i.e. same CPU user to kernel attack.

#3 Can only be fully mitigated by disabling EPT

#4 Assumed that my assumptions about transition times are correct, which
I think they are, #4 is pretty much redirected to #1

Hope that helps.

Thanks,

tglx
Tim Chen
2020-03-17 21:58:28 UTC
Permalink
Post by Thomas Gleixner
Post by Tim Chen
The interrupt handler will be run with PTE inverted. So I don't think
there's a leak via L1TF in this scenario.
How so?
Host memory is attackable, when one of the sibling SMT threads runs in
host OS (hypervisor) context and the other in guest context.
HT1 is in guest mode and attacking (has control over PTEs). HT2 is
running in host mode and executes an interrupt handler. The host PTE
inversion does not matter in this scenario at all.
So HT1 can very well see data which is brought into the shared L1 by
HT2.
The only way to mitigate that aside of disabling HT is disabling EPT.
I had a brain lapse. Yes, PTE inversion is for mitigating against malicious
user space code, not for malicious guest.

Thanks for the correction.

Tim

Continue reading on narkive:
Loading...