Discussion:
[PATCH 0/4] Alter steal-time reporting in the guest
(too old to reply)
Michael Wolf
2013-02-05 21:50:01 UTC
In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat. This can
cause confusion for the end user. To ease the confusion this patch set
adds the idea of consigned (expected steal) time. The host will separate
the consigned time from the steal time. Tthe steal time will only be altered
if hard limits (cfs bandwidth control) is used. The period and the quota used
to separate the consigned time (expected steal) from the steal time are taken
from the cfs bandwidth control settings. Any other steal time accruing during
that period will show as the traditional steal time.

Changes from V2:
* Dropped the ioctl that allowed qemu to send the entitlement value to
the guest.
* Added code to get the entitlement period and quota from cfs bandwidth.

Changes from V1:
* Removed the steal time allowed percentage from the guest
* Moved the separation of consigned (expected steal) and steal time to the
host.
* No longer include a sysctl interface.

---

Michael Wolf (4):
Alter the amount of steal time reported by the guest.
Expand the steal time msr to also contain the consigned time.
Add the code to send the consigned time from the host to the guest
Add a timer to allow the separation of consigned from steal time.


arch/x86/include/asm/kvm_host.h | 10 +++++
arch/x86/include/asm/paravirt.h | 4 +-
arch/x86/include/asm/paravirt_types.h | 2 +
arch/x86/include/uapi/asm/kvm_para.h | 3 +-
arch/x86/kernel/kvm.c | 8 ++--
arch/x86/kernel/paravirt.c | 4 +-
arch/x86/kvm/x86.c | 64 ++++++++++++++++++++++++++++++++-
fs/proc/stat.c | 9 ++++-
include/linux/kernel_stat.h | 2 +
kernel/sched/core.c | 30 +++++++++++++++
kernel/sched/cputime.c | 21 ++++++++++-
kernel/sched/sched.h | 2 +
12 files changed, 142 insertions(+), 17 deletions(-)
--
Signature

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Michael Wolf
2013-02-05 21:50:01 UTC
Modify the amount of stealtime that the kernel reports via the /proc interface.
Steal time will now be broken down into steal_time and consigned_time.
Consigned_time will represent the amount of time that is expected to be lost
due to overcommitment of the physical cpu or by using cpu hard capping.

Signed-off-by: Michael Wolf <***@linux.vnet.ibm.com>
---
fs/proc/stat.c | 9 +++++++--
include/linux/kernel_stat.h | 1 +
2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index e296572..cb7fe80 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -82,7 +82,7 @@ static int show_stat(struct seq_file *p, void *v)
int i, j;
unsigned long jif;
u64 user, nice, system, idle, iowait, irq, softirq, steal;
- u64 guest, guest_nice;
+ u64 guest, guest_nice, consign;
u64 sum = 0;
u64 sum_softirq = 0;
unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
@@ -90,10 +90,11 @@ static int show_stat(struct seq_file *p, void *v)

user = nice = system = idle = iowait =
irq = softirq = steal = 0;
- guest = guest_nice = 0;
+ guest = guest_nice = consign = 0;
getboottime(&boottime);
jif = boottime.tv_sec;

+
for_each_possible_cpu(i) {
user += kcpustat_cpu(i).cpustat[CPUTIME_USER];
nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE];
@@ -105,6 +106,7 @@ static int show_stat(struct seq_file *p, void *v)
steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+ consign += kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN];
sum += kstat_cpu_irqs_sum(i);
sum += arch_irq_stat_cpu(i);

@@ -128,6 +130,7 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+ seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign));
seq_putc(p, '\n');

for_each_online_cpu(i) {
@@ -142,6 +145,7 @@ static int show_stat(struct seq_file *p, void *v)
steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+ consign = kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN];
seq_printf(p, "cpu%d", i);
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice));
@@ -153,6 +157,7 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+ seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign));
seq_putc(p, '\n');
}
seq_printf(p, "intr %llu", (unsigned long long)sum);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 66b7078..e352052 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -28,6 +28,7 @@ enum cpu_usage_stat {
CPUTIME_STEAL,
CPUTIME_GUEST,
CPUTIME_GUEST_NICE,
+ CPUTIME_CONSIGN,
NR_STATS,
};


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Michael Wolf
2013-02-05 21:50:01 UTC
Expand the steal time msr to also contain the consigned time.

Signed-off-by: Michael Wolf <***@linux.vnet.ibm.com>
---
arch/x86/include/asm/paravirt.h | 4 ++--
arch/x86/include/asm/paravirt_types.h | 2 +-
arch/x86/kernel/kvm.c | 7 ++-----
kernel/sched/core.c | 10 +++++++++-
kernel/sched/cputime.c | 2 +-
5 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 5edd174..9b753ea 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
extern struct static_key paravirt_steal_enabled;
extern struct static_key paravirt_steal_rq_enabled;

-static inline u64 paravirt_steal_clock(int cpu)
+static inline void paravirt_steal_clock(int cpu, u64 *steal)
{
- return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+ PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
}

static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 142236e..5d4fc8b 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -95,7 +95,7 @@ struct pv_lazy_ops {

struct pv_time_ops {
unsigned long long (*sched_clock)(void);
- unsigned long long (*steal_clock)(int cpu);
+ void (*steal_clock)(int cpu, unsigned long long *steal);
unsigned long (*get_tsc_khz)(void);
};

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index fe75a28..89e5468 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -386,9 +386,8 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
};

-static u64 kvm_steal_clock(int cpu)
+static void kvm_steal_clock(int cpu, u64 *steal)
{
- u64 steal;
struct kvm_steal_time *src;
int version;

@@ -396,11 +395,9 @@ static u64 kvm_steal_clock(int cpu)
do {
version = src->version;
rmb();
- steal = src->steal;
+ *steal = src->steal;
rmb();
} while ((version & 1) || (version != src->version));
-
- return steal;
}

void kvm_disable_steal_time(void)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 26058d0..efc2652 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -757,6 +757,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
*/
#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
s64 steal = 0, irq_delta = 0;
+ u64 consigned = 0;
#endif
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
@@ -785,8 +786,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((&paravirt_steal_rq_enabled))) {
u64 st;
+ u64 cs;

- steal = paravirt_steal_clock(cpu_of(rq));
+ paravirt_steal_clock(cpu_of(rq), &steal, &consigned);
+ /*
+ * since we are not assigning the steal time to cpustats
+ * here, just combine the steal and consigned times to
+ * do the rest of the calculations.
+ */
+ steal += consigned;
steal -= rq->prev_steal_time_rq;

if (unlikely(steal > delta))
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 825a956..0b4f1ec 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void)
if (static_key_false(&paravirt_steal_enabled)) {
u64 steal, st = 0;

- steal = paravirt_steal_clock(smp_processor_id());
+ paravirt_steal_clock(smp_processor_id(), &steal);
steal -= this_rq()->prev_steal_time;

st = steal_ticks(steal);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Rik van Riel
2013-02-06 21:20:01 UTC
Post by Michael Wolf
Expand the steal time msr to also contain the consigned time.
---
arch/x86/include/asm/paravirt.h | 4 ++--
arch/x86/include/asm/paravirt_types.h | 2 +-
arch/x86/kernel/kvm.c | 7 ++-----
kernel/sched/core.c | 10 +++++++++-
kernel/sched/cputime.c | 2 +-
5 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 5edd174..9b753ea 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
extern struct static_key paravirt_steal_enabled;
extern struct static_key paravirt_steal_rq_enabled;
-static inline u64 paravirt_steal_clock(int cpu)
+static inline void paravirt_steal_clock(int cpu, u64 *steal)
{
- return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+ PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
}
This may be a stupid question, but what happens if a KVM
guest with this change, runs on a kernel that still has
the old steal time interface?

What happens if the host has the new steal time interface,
but the guest uses the old interface?

Will both cases continue to work as expected with your
patch series?

If so, could you document (in the source code) why things
continue to work?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Michael Wolf
2013-02-07 14:30:02 UTC
Post by Rik van Riel
Post by Michael Wolf
Expand the steal time msr to also contain the consigned time.
---
arch/x86/include/asm/paravirt.h | 4 ++--
arch/x86/include/asm/paravirt_types.h | 2 +-
arch/x86/kernel/kvm.c | 7 ++-----
kernel/sched/core.c | 10 +++++++++-
kernel/sched/cputime.c | 2 +-
5 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h
b/arch/x86/include/asm/paravirt.h
index 5edd174..9b753ea 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
extern struct static_key paravirt_steal_enabled;
extern struct static_key paravirt_steal_rq_enabled;
-static inline u64 paravirt_steal_clock(int cpu)
+static inline void paravirt_steal_clock(int cpu, u64 *steal)
{
- return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+ PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
}
This may be a stupid question, but what happens if a KVM
guest with this change, runs on a kernel that still has
the old steal time interface?
What happens if the host has the new steal time interface,
but the guest uses the old interface?
Will both cases continue to work as expected with your
patch series?
If so, could you document (in the source code) why things
continue to work?
I will test the scenarios you suggest and will report back the results.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Michael Wolf
2013-02-05 21:50:02 UTC
Change the paravirt calls that retrieve the steal-time information
from the host. Add to it getting the consigned value as well as
the steal time.

Signed-off-by: Michael Wolf <***@linux.vnet.ibm.com>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/include/asm/paravirt.h | 4 ++--
arch/x86/include/uapi/asm/kvm_para.h | 3 ++-
arch/x86/kernel/kvm.c | 3 ++-
arch/x86/kernel/paravirt.c | 4 ++--
arch/x86/kvm/x86.c | 2 ++
include/linux/kernel_stat.h | 1 +
kernel/sched/cputime.c | 21 +++++++++++++++++++--
kernel/sched/sched.h | 2 ++
9 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dc87b65..fe5a37b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -428,6 +428,7 @@ struct kvm_vcpu_arch {
u64 msr_val;
u64 last_steal;
u64 accum_steal;
+ u64 accum_consigned;
struct gfn_to_hva_cache stime;
struct kvm_steal_time steal;
} st;
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 9b753ea..77f05e7 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
extern struct static_key paravirt_steal_enabled;
extern struct static_key paravirt_steal_rq_enabled;

-static inline void paravirt_steal_clock(int cpu, u64 *steal)
+static inline void paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
{
- PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
+ PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned);
}

static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 06fdbd9..55d617f 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -42,9 +42,10 @@

struct kvm_steal_time {
__u64 steal;
+ __u64 consigned;
__u32 version;
__u32 flags;
- __u32 pad[12];
+ __u32 pad[10];
};

#define KVM_STEAL_ALIGNMENT_BITS 5
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 89e5468..fb52f8a 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -386,7 +386,7 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
};

-static void kvm_steal_clock(int cpu, u64 *steal)
+static void kvm_steal_clock(int cpu, u64 *steal, u64 *consigned)
{
struct kvm_steal_time *src;
int version;
@@ -396,6 +396,7 @@ static void kvm_steal_clock(int cpu, u64 *steal)
version = src->version;
rmb();
*steal = src->steal;
+ *consigned = src->consigned;
rmb();
} while ((version & 1) || (version != src->version));
}
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 17fff18..3797683 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -207,9 +207,9 @@ static void native_flush_tlb_single(unsigned long addr)
struct static_key paravirt_steal_enabled;
struct static_key paravirt_steal_rq_enabled;

-static u64 native_steal_clock(int cpu)
+static void native_steal_clock(int cpu, u64 *steal, u64 *consigned)
{
- return 0;
+ *steal = *consigned = 0;
}

/* These are in entry.S */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c243b81..51b63d1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1867,8 +1867,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
return;

vcpu->arch.st.steal.steal += vcpu->arch.st.accum_steal;
+ vcpu->arch.st.steal.consigned += vcpu->arch.st.accum_consigned;
vcpu->arch.st.steal.version += 2;
vcpu->arch.st.accum_steal = 0;
+ vcpu->arch.st.accum_consigned = 0;

kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index e352052..f58ed0f 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -126,6 +126,7 @@ extern unsigned long long task_delta_exec(struct task_struct *);
extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t);
extern void account_steal_time(cputime_t);
+extern void account_consigned_time(cputime_t);
extern void account_idle_time(cputime_t);

#ifdef CONFIG_VIRT_CPU_ACCOUNTING
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0b4f1ec..2a2d4be 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -244,6 +244,18 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
}

/*
+ * This accounts for the time that is split out of steal time.
+ * Consigned time represents the amount of time that the cpu was
+ * expected to be somewhere else.
+ */
+void account_consigned_time(cputime_t cputime)
+{
+ u64 *cpustat = kcpustat_this_cpu->cpustat;
+
+ cpustat[CPUTIME_CONSIGN] += (__force u64) cputime;
+}
+
+/*
* Account for involuntary wait time.
* @cputime: the cpu time spent in involuntary wait
*/
@@ -274,15 +286,20 @@ static __always_inline bool steal_account_process_tick(void)
#ifdef CONFIG_PARAVIRT
if (static_key_false(&paravirt_steal_enabled)) {
u64 steal, st = 0;
+ u64 consigned, cs = 0;

- paravirt_steal_clock(smp_processor_id(), &steal);
+ paravirt_steal_clock(smp_processor_id(), &steal, &consigned);
steal -= this_rq()->prev_steal_time;
+ consigned -= this_rq()->prev_consigned_time;

st = steal_ticks(steal);
+ cs = steal_ticks(consigned);
this_rq()->prev_steal_time += st * TICK_NSEC;
+ this_rq()->prev_consigned_time += cs * TICK_NSEC;

account_steal_time(st);
- return st;
+ account_consigned_time(cs);
+ return st || cs;
}
#endif
return false;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fc88644..73a9ef2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -442,9 +442,11 @@ struct rq {
#endif
#ifdef CONFIG_PARAVIRT
u64 prev_steal_time;
+ u64 prev_consigned_time;
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
u64 prev_steal_time_rq;
+ u64 prev_consigned_time_rq;
#endif

/* calc_load related fields */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Rik van Riel
2013-02-06 21:20:02 UTC
Post by Michael Wolf
Change the paravirt calls that retrieve the steal-time information
from the host. Add to it getting the consigned value as well as
the steal time.
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 06fdbd9..55d617f 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -42,9 +42,10 @@
struct kvm_steal_time {
__u64 steal;
+ __u64 consigned;
__u32 version;
__u32 flags;
- __u32 pad[12];
+ __u32 pad[10];
};
The function kvm_register_steal_time passes the address of such
a structure to the host kernel, which then does something with
it.

Could running a guest with the above patch, on top of a host
with the old code, result in the values for "version" and
"flags" being written into "consigned"?

Could that result in confusing the guest kernel to no end,
and generally breaking things?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Michael Wolf
2013-02-07 14:30:02 UTC
Post by Rik van Riel
Post by Michael Wolf
Change the paravirt calls that retrieve the steal-time information
from the host. Add to it getting the consigned value as well as
the steal time.
diff --git a/arch/x86/include/uapi/asm/kvm_para.h
b/arch/x86/include/uapi/asm/kvm_para.h
index 06fdbd9..55d617f 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -42,9 +42,10 @@
struct kvm_steal_time {
__u64 steal;
+ __u64 consigned;
__u32 version;
__u32 flags;
- __u32 pad[12];
+ __u32 pad[10];
};
The function kvm_register_steal_time passes the address of such
a structure to the host kernel, which then does something with
it.
Could running a guest with the above patch, on top of a host
with the old code, result in the values for "version" and
"flags" being written into "consigned"?
yes, good point.
Post by Rik van Riel
Could that result in confusing the guest kernel to no end,
and generally breaking things?
Ok I will move the consigned field to be after the flags.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Michael Wolf
2013-02-05 22:00:02 UTC
Add a helper routine to scheduler/core.c to allow the kvm module
to retrieve the cpu hardlimit settings. The values will be used
to set up a timer that is used to separate the consigned from the
steal time.

Signed-off-by: Michael Wolf <***@linux.vnet.ibm.com>
---
arch/x86/include/asm/kvm_host.h | 9 ++++++
arch/x86/kvm/x86.c | 62 ++++++++++++++++++++++++++++++++++++++-
kernel/sched/core.c | 20 +++++++++++++
3 files changed, 90 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fe5a37b..9518613 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -355,6 +355,15 @@ struct kvm_vcpu_arch {
bool tpr_access_reporting;

/*
+ * timer used to determine if the time should be counted as
+ * steal time or consigned time.
+ */
+ struct hrtimer steal_timer;
+ u64 current_consigned;
+ s64 consigned_quota;
+ s64 consigned_period;
+
+ /*
* Paging state of the vcpu
*
* If the vcpu runs in guest mode with two level paging this still saves
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 51b63d1..79d144d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1848,13 +1848,32 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
static void accumulate_steal_time(struct kvm_vcpu *vcpu)
{
u64 delta;
+ u64 steal_delta;
+ u64 consigned_delta;

if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
return;

delta = current->sched_info.run_delay - vcpu->arch.st.last_steal;
vcpu->arch.st.last_steal = current->sched_info.run_delay;
- vcpu->arch.st.accum_steal = delta;
+
+ /* split the delta into steal and consigned */
+ if (vcpu->arch.current_consigned < vcpu->arch.consigned_quota) {
+ vcpu->arch.current_consigned += delta;
+ if (vcpu->arch.current_consigned > vcpu->arch.consigned_quota) {
+ steal_delta = vcpu->arch.current_consigned
+ - vcpu->arch.consigned_quota;
+ consigned_delta = delta - steal_delta;
+ } else {
+ consigned_delta = delta;
+ steal_delta = 0;
+ }
+ } else {
+ consigned_delta = 0;
+ steal_delta = delta;
+ }
+ vcpu->arch.st.accum_steal = steal_delta;
+ vcpu->arch.st.accum_consigned = consigned_delta;
}

static void record_steal_time(struct kvm_vcpu *vcpu)
@@ -2629,8 +2648,35 @@ static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
!(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY);
}

+extern int sched_use_hard_capping(int cpuid, int num_vcpus, s64 *quota,
+ s64 *period);
+enum hrtimer_restart steal_timer_fn(struct hrtimer *data)
+{
+ struct kvm_vcpu *vcpu;
+ struct kvm *kvm;
+ int num_vcpus;
+ ktime_t now;
+
+ vcpu = container_of(data, struct kvm_vcpu, arch.steal_timer);
+ kvm = vcpu->kvm;
+ num_vcpus = atomic_read(&kvm->online_vcpus);
+ sched_use_hard_capping(vcpu->cpu, num_vcpus,
+ &vcpu->arch.consigned_quota,
+ &vcpu->arch.consigned_period);
+ vcpu->arch.current_consigned = 0;
+ now = ktime_get();
+ hrtimer_forward(&vcpu->arch.steal_timer, now,
+ ktime_set(0, vcpu->arch.consigned_period));
+
+ return HRTIMER_RESTART;
+}
+
void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
+ struct kvm *kvm;
+ int num_vcpus;
+ ktime_t ktime;
+
/* Address WBINVD may be executed by guest */
if (need_emulate_wbinvd(vcpu)) {
if (kvm_x86_ops->has_wbinvd_exit())
@@ -2670,6 +2716,18 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
kvm_migrate_timers(vcpu);
vcpu->cpu = cpu;
}
+ /* Initialize and start a timer to capture steal and consigned time */
+ kvm = vcpu->kvm;
+ num_vcpus = atomic_read(&kvm->online_vcpus);
+ num_vcpus = (num_vcpus == 0) ? 1 : num_vcpus;
+ sched_use_hard_capping(vcpu->cpu, num_vcpus,
+ &vcpu->arch.consigned_quota,
+ &vcpu->arch.consigned_period);
+ hrtimer_init(&vcpu->arch.steal_timer, CLOCK_MONOTONIC,
+ HRTIMER_MODE_REL);
+ vcpu->arch.steal_timer.function = &steal_timer_fn;
+ ktime = ktime_set(0, vcpu->arch.consigned_period);
+ hrtimer_start(&vcpu->arch.steal_timer, ktime, HRTIMER_MODE_REL);

accumulate_steal_time(vcpu);
kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
@@ -2680,6 +2738,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
kvm_x86_ops->vcpu_put(vcpu);
kvm_put_guest_fpu(vcpu);
vcpu->arch.last_host_tsc = native_read_tsc();
+ hrtimer_cancel(&vcpu->arch.steal_timer);
}

static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu,
@@ -6685,6 +6744,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
{
int idx;

+ hrtimer_cancel(&vcpu->arch.steal_timer);
kvm_pmu_destroy(vcpu);
kfree(vcpu->arch.mce_banks);
kvm_free_lapic(vcpu);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index efc2652..133ee47 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8154,6 +8154,26 @@ void cpuacct_charge(struct task_struct *tsk, u64 cputime)

rcu_read_unlock();
}
+/*
+ * return 1 if the scheduler is using some form of hard capping
+ * return 0 if there is no capping configured.
+ */
+int sched_use_hard_capping(int cpuid, int num_cpus, long *quota, long *period)
+{
+ struct rq *rq = cpu_rq(cpuid);
+ struct task_struct *curr = rq->curr;
+ struct task_group *tg = curr->sched_task_group;
+ long total_time;
+
+ *period = tg_get_cfs_period(tg);
+ if (*quota == RUNTIME_INF || *quota == -1)
+ return 0;
+ *quota = jiffies_to_usecs(tg_get_cfs_quota(tg)) / num_cpus;
+ total_time = jiffies_to_usecs(*period);
+ *quota = total_time - *quota;
+ return 1;
+}
+EXPORT_SYMBOL_GPL(sched_use_hard_capping);

struct cgroup_subsys cpuacct_subsys = {
.name = "cpuacct",

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Glauber Costa
2013-02-06 14:40:02 UTC
Post by Michael Wolf
Add a helper routine to scheduler/core.c to allow the kvm module
to retrieve the cpu hardlimit settings. The values will be used
to set up a timer that is used to separate the consigned from the
steal time.
Sorry: What is the business of a timer in here?
Whenever we read steal time, we know how much time has passed and with
that information we can know the entitlement for the period. This breaks
if we suspend, but we know that we suspended, so this is not a problem.

Everything bigger the entitlement is steal time.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Michael Wolf
2013-02-06 18:10:01 UTC
Post by Glauber Costa
Post by Michael Wolf
Add a helper routine to scheduler/core.c to allow the kvm module
to retrieve the cpu hardlimit settings. The values will be used
to set up a timer that is used to separate the consigned from the
steal time.
Sorry: What is the business of a timer in here?
Whenever we read steal time, we know how much time has passed and with
that information we can know the entitlement for the period. This breaks
if we suspend, but we know that we suspended, so this is not a problem.
I may be missing something, but how do we know how much time has
passed? That is why
I had the timer in there. I will go look again at the code but I
thought the data was collected
as ticks and passed at random times. The ticks are also accumulating so
we are looking at the
difference in the count between reads.....
Post by Glauber Costa
Everything bigger the entitlement is steal time.
I agree provided I know the amount of total time that the steal time was
accumulated.
Post by Glauber Costa
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Glauber Costa
2013-02-07 08:50:02 UTC
Post by Michael Wolf
Post by Glauber Costa
Post by Michael Wolf
Add a helper routine to scheduler/core.c to allow the kvm module
to retrieve the cpu hardlimit settings. The values will be used
to set up a timer that is used to separate the consigned from the
steal time.
Sorry: What is the business of a timer in here?
Whenever we read steal time, we know how much time has passed and with
that information we can know the entitlement for the period. This breaks
if we suspend, but we know that we suspended, so this is not a problem.
I may be missing something, but how do we know how much time has
passed? That is why
I had the timer in there. I will go look again at the code but I
thought the data was collected
as ticks and passed at random times. The ticks are also accumulating so
we are looking at the
difference in the count between reads.....
They can be collected at random times, but you can of course record the
time in which it happened.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Michael Wolf
2013-02-07 14:30:02 UTC
Post by Glauber Costa
Post by Michael Wolf
Post by Glauber Costa
Post by Michael Wolf
Add a helper routine to scheduler/core.c to allow the kvm module
to retrieve the cpu hardlimit settings. The values will be used
to set up a timer that is used to separate the consigned from the
steal time.
Sorry: What is the business of a timer in here?
Whenever we read steal time, we know how much time has passed and with
that information we can know the entitlement for the period. This breaks
if we suspend, but we know that we suspended, so this is not a problem.
I may be missing something, but how do we know how much time has
passed? That is why
I had the timer in there. I will go look again at the code but I
thought the data was collected
as ticks and passed at random times. The ticks are also accumulating so
we are looking at the
difference in the count between reads.....
They can be collected at random times, but you can of course record the
time in which it happened.
ok. Let me add a previous_read field and take out the timer.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Marcelo Tosatti
2013-02-19 01:00:02 UTC
Post by Michael Wolf
Add a helper routine to scheduler/core.c to allow the kvm module
to retrieve the cpu hardlimit settings. The values will be used
to set up a timer that is used to separate the consigned from the
steal time.
1) Can you please describe, in english, the mechanics of subtracting cpu
hardlimit values from steal time reported via run_delay supposed to
work?

"The period and the quota used to separate the consigned time
(expected steal) from the steal time are taken
from the cfs bandwidth control settings. Any other steal time
accruing during that period will show as the traditional steal time."

There is no "expected steal time" over a fixed period of real time.

2) From the description of patch 1: "In the case of where you have
a system that is running in a capped or overcommitted environment
the user may see steal time being reported in accounting tools
such as top or vmstat."

This is outdated, right? Because overcommitted environment is exactly
what steal time should report.


Thanks

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Michael Wolf
2013-03-05 20:20:01 UTC
Sorry for the delay in the response. I did not see your question.
Post by Marcelo Tosatti
Post by Michael Wolf
Add a helper routine to scheduler/core.c to allow the kvm module
to retrieve the cpu hardlimit settings. The values will be used
to set up a timer that is used to separate the consigned from the
steal time.
1) Can you please describe, in english, the mechanics of subtracting cpu
hardlimit values from steal time reported via run_delay supposed to
work?
"The period and the quota used to separate the consigned time
(expected steal) from the steal time are taken
from the cfs bandwidth control settings. Any other steal time
accruing during that period will show as the traditional steal time."
There is no "expected steal time" over a fixed period of real time.
There is expected steal time in the sense that the administrator of the
system sets up guests on the host so that there will be cpu
overcommitment. The end user who is using the guest does not know this,
they only know they have been guaranteed a certain level of performance.
So if steal time shows up the end user typically thinks they are not
getting their guaranteed performance. So this patchset is meant to allow
top to show 100% utilization and ONLY show steal time if it is over the
level of steal time that the host administrator setup. So take a simple
example of a host with 1 cpu and two guest on it. If each guest is
fully utilized a user will see 50% utilization and 50% steal in either
of the guests. In this case the amount of steal time that the host
administrator would expect to see is 50%. As long as the steal in the
guest does not exceed 50% the guest is running as expected. If for some
reason the steal increases to 60%, now something is wrong and the steal
time needs to be reported and the end user will make inquiries?
Post by Marcelo Tosatti
2) From the description of patch 1: "In the case of where you have
a system that is running in a capped or overcommitted environment
the user may see steal time being reported in accounting tools
such as top or vmstat."
This is outdated, right? Because overcommitted environment is exactly
what steal time should report.
I hope I'm not missing your point here. But again this comes down to
the point of view. The end user is guaranteed a capability/level of
performance that may not be a whole cpu. So only show steal time if the
amount of steal time exceeds what the host admin expected when the guest
was set up.
Post by Marcelo Tosatti
Thanks
thanks
Mike Wolf

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Marcelo Tosatti
2013-03-06 02:00:01 UTC
Post by Michael Wolf
Sorry for the delay in the response. I did not see your question.
Post by Marcelo Tosatti
Post by Michael Wolf
Add a helper routine to scheduler/core.c to allow the kvm module
to retrieve the cpu hardlimit settings. The values will be used
to set up a timer that is used to separate the consigned from the
steal time.
1) Can you please describe, in english, the mechanics of subtracting cpu
hardlimit values from steal time reported via run_delay supposed to
work?
"The period and the quota used to separate the consigned time
(expected steal) from the steal time are taken
from the cfs bandwidth control settings. Any other steal time
accruing during that period will show as the traditional steal time."
There is no "expected steal time" over a fixed period of real time.
There is expected steal time in the sense that the administrator of the
system sets up guests on the host so that there will be cpu
overcommitment.
I refer to

+ /* split the delta into steal and consigned */
+ if (vcpu->arch.current_consigned < vcpu->arch.consigned_quota) {
+ vcpu->arch.current_consigned += delta;
+ if (vcpu->arch.current_consigned > vcpu->arch.consigned_quota) {
+ steal_delta = vcpu->arch.current_consigned
+ - vcpu->arch.consigned_quota;
+ consigned_delta = delta - steal_delta;
+ } else {

You can't expect there to be any amount of stolen time over a fixed
period of time.
Post by Michael Wolf
The end user who is using the guest does not know this,
they only know they have been guaranteed a certain level of performance.
So if steal time shows up the end user typically thinks they are not
getting their guaranteed performance. So this patchset is meant to allow
top to show 100% utilization and ONLY show steal time if it is over the
level of steal time that the host administrator setup. So take a simple
example of a host with 1 cpu and two guest on it. If each guest is
fully utilized a user will see 50% utilization and 50% steal in either
of the guests. In this case the amount of steal time that the host
administrator would expect to see is 50%. As long as the steal in the
guest does not exceed 50% the guest is running as expected. If for some
reason the steal increases to 60%, now something is wrong and the steal
time needs to be reported and the end user will make inquiries?
This is the purpose of stolen time: to report the amount of time guest
vcpu was runnable, but not running (IOW: starved).
Post by Michael Wolf
Post by Marcelo Tosatti
2) From the description of patch 1: "In the case of where you have
a system that is running in a capped or overcommitted environment
the user may see steal time being reported in accounting tools
such as top or vmstat."
This is outdated, right? Because overcommitted environment is exactly
what steal time should report.
I hope I'm not missing your point here. But again this comes down to
the point of view. The end user is guaranteed a capability/level of
performance that may not be a whole cpu. So only show steal time if the
amount of steal time exceeds what the host admin expected when the guest
was set up.
The real values must be reported. If the host system becomes suddenly
loaded beyond what the host can provide to the guest, should the system
report an incorrect value, to avoid users from complaining? Sounds
incorrect.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Frederic Weisbecker
2013-02-18 16:50:02 UTC
Post by Michael Wolf
In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat. This can
cause confusion for the end user.
Sorry, I'm no expert in this area. But I don't really understand what
is confusing for the end user here.
Post by Michael Wolf
To ease the confusion this patch set
adds the idea of consigned (expected steal) time. The host will separate
the consigned time from the steal time. Tthe steal time will only be altered
if hard limits (cfs bandwidth control) is used. The period and the quota used
to separate the consigned time (expected steal) from the steal time are taken
from the cfs bandwidth control settings. Any other steal time accruing during
that period will show as the traditional steal time.
I'm also a bit confused here. steal time will then only account the
cpu time lost due to quotas from cfs bandwidth control? Also what do
you exactly mean by "expected steal time"? Is it steal time due to
overcommitting minus scheduler quotas?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Marcelo Tosatti
2013-02-19 01:30:01 UTC
Post by Frederic Weisbecker
Post by Michael Wolf
In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat. This can
cause confusion for the end user.
Sorry, I'm no expert in this area. But I don't really understand what
is confusing for the end user here.
I suppose that what is wanted is to subtract stolen time due to 'known
reasons' from steal time reporting. 'Known reasons' being, for example,
hard caps. So a vcpu executing instructions with no halt, but limited to
80% of available bandwidth, would not have 20% of stolen time reported.

But yes, a description of the scenario that is being dealt with, with
details, is important.
Post by Frederic Weisbecker
Post by Michael Wolf
To ease the confusion this patch set
adds the idea of consigned (expected steal) time. The host will separate
the consigned time from the steal time. Tthe steal time will only be altered
if hard limits (cfs bandwidth control) is used. The period and the quota used
to separate the consigned time (expected steal) from the steal time are taken
from the cfs bandwidth control settings. Any other steal time accruing during
that period will show as the traditional steal time.
I'm also a bit confused here. steal time will then only account the
cpu time lost due to quotas from cfs bandwidth control? Also what do
you exactly mean by "expected steal time"? Is it steal time due to
overcommitting minus scheduler quotas?
Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Michael Wolf
2013-03-05 20:40:02 UTC
Sorry for the delay in the response. I did not see the email
right away.
Post by Marcelo Tosatti
Post by Frederic Weisbecker
Post by Michael Wolf
In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat. This can
cause confusion for the end user.
Sorry, I'm no expert in this area. But I don't really understand what
is confusing for the end user here.
I suppose that what is wanted is to subtract stolen time due to 'known
reasons' from steal time reporting. 'Known reasons' being, for example,
hard caps. So a vcpu executing instructions with no halt, but limited to
80% of available bandwidth, would not have 20% of stolen time reported.
Yes exactly and the end user many times did not set up the guest and is
not aware of the capping. The end user is only aware of the performance
level that they were told they would get with the guest.
Post by Marcelo Tosatti
But yes, a description of the scenario that is being dealt with, with
details, is important.
I will add more detail to the description next time I submit the
patches. How about something like,"In a cloud environment the user of a
kvm guest is not aware of the underlying hardware or how many other
guests are running on it. The end user is only aware of a level of
performance that they should see." or does that just muddy the picture
more??
Post by Marcelo Tosatti
Post by Frederic Weisbecker
Post by Michael Wolf
To ease the confusion this patch set
adds the idea of consigned (expected steal) time. The host will separate
the consigned time from the steal time. Tthe steal time will only be altered
if hard limits (cfs bandwidth control) is used. The period and the quota used
to separate the consigned time (expected steal) from the steal time are taken
from the cfs bandwidth control settings. Any other steal time accruing during
that period will show as the traditional steal time.
I'm also a bit confused here. steal time will then only account the
cpu time lost due to quotas from cfs bandwidth control? Also what do
you exactly mean by "expected steal time"? Is it steal time due to
overcommitting minus scheduler quotas?
Thanks.
Thanks
Mike Wolf

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Marcelo Tosatti
2013-03-06 02:00:01 UTC
Post by Michael Wolf
Sorry for the delay in the response. I did not see the email
right away.
Post by Marcelo Tosatti
Post by Frederic Weisbecker
Post by Michael Wolf
In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat. This can
cause confusion for the end user.
Sorry, I'm no expert in this area. But I don't really understand what
is confusing for the end user here.
I suppose that what is wanted is to subtract stolen time due to 'known
reasons' from steal time reporting. 'Known reasons' being, for example,
hard caps. So a vcpu executing instructions with no halt, but limited to
80% of available bandwidth, would not have 20% of stolen time reported.
Yes exactly and the end user many times did not set up the guest and is
not aware of the capping. The end user is only aware of the performance
level that they were told they would get with the guest.
Post by Marcelo Tosatti
But yes, a description of the scenario that is being dealt with, with
details, is important.
I will add more detail to the description next time I submit the
patches. How about something like,"In a cloud environment the user of a
kvm guest is not aware of the underlying hardware or how many other
guests are running on it. The end user is only aware of a level of
performance that they should see." or does that just muddy the picture
more??
So the feature aims for is to report stolen time relative to hard
capping. That is: stolen time should be counted as time stolen from
the guest _beyond_ hard capping. Yes?

Probably don't need to report new data to the guest for that.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Glauber Costa
2013-03-06 08:20:02 UTC
Post by Marcelo Tosatti
Post by Michael Wolf
Sorry for the delay in the response. I did not see the email
right away.
Post by Marcelo Tosatti
Post by Frederic Weisbecker
Post by Michael Wolf
In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat. This can
cause confusion for the end user.
Sorry, I'm no expert in this area. But I don't really understand what
is confusing for the end user here.
I suppose that what is wanted is to subtract stolen time due to 'known
reasons' from steal time reporting. 'Known reasons' being, for example,
hard caps. So a vcpu executing instructions with no halt, but limited to
80% of available bandwidth, would not have 20% of stolen time reported.
Yes exactly and the end user many times did not set up the guest and is
not aware of the capping. The end user is only aware of the performance
level that they were told they would get with the guest.
Post by Marcelo Tosatti
But yes, a description of the scenario that is being dealt with, with
details, is important.
I will add more detail to the description next time I submit the
patches. How about something like,"In a cloud environment the user of a
kvm guest is not aware of the underlying hardware or how many other
guests are running on it. The end user is only aware of a level of
performance that they should see." or does that just muddy the picture
more??
So the feature aims for is to report stolen time relative to hard
capping. That is: stolen time should be counted as time stolen from
the guest _beyond_ hard capping. Yes?
Probably don't need to report new data to the guest for that.
If we take into account that 1 second always have one second, I believe
that you can just subtract the consigned time from the steal time the
host passes to the guest.

During each second, the numbers won't sum up to 100. The delta to 100 is
the consigned time, if anyone cares.

Adopting this would simplify this a lot. All you need to do, really, is
to get your calculation right from the bandwidth given by the cpu
controller. Subtract it in the host, and voila.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Michael Wolf
2013-03-06 16:40:01 UTC
Post by Glauber Costa
Post by Marcelo Tosatti
Post by Michael Wolf
Sorry for the delay in the response. I did not see the email
right away.
Post by Marcelo Tosatti
Post by Frederic Weisbecker
Post by Michael Wolf
In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat. This can
cause confusion for the end user.
Sorry, I'm no expert in this area. But I don't really understand what
is confusing for the end user here.
I suppose that what is wanted is to subtract stolen time due to 'known
reasons' from steal time reporting. 'Known reasons' being, for example,
hard caps. So a vcpu executing instructions with no halt, but limited to
80% of available bandwidth, would not have 20% of stolen time reported.
Yes exactly and the end user many times did not set up the guest and is
not aware of the capping. The end user is only aware of the performance
level that they were told they would get with the guest.
Post by Marcelo Tosatti
But yes, a description of the scenario that is being dealt with, with
details, is important.
I will add more detail to the description next time I submit the
patches. How about something like,"In a cloud environment the user of a
kvm guest is not aware of the underlying hardware or how many other
guests are running on it. The end user is only aware of a level of
performance that they should see." or does that just muddy the picture
more??
So the feature aims for is to report stolen time relative to hard
capping. That is: stolen time should be counted as time stolen from
the guest _beyond_ hard capping. Yes?
Probably don't need to report new data to the guest for that.
If we take into account that 1 second always have one second, I believe
that you can just subtract the consigned time from the steal time the
host passes to the guest.
During each second, the numbers won't sum up to 100. The delta to 100 is
the consigned time, if anyone cares.
Adopting this would simplify this a lot. All you need to do, really, is
to get your calculation right from the bandwidth given by the cpu
controller. Subtract it in the host, and voila.
I looked at doing that once but was told that I was changing the
interface in an unacceptable way, because now I was not reporting all of
the elapsed time. I agree it would make things simpler.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Marcelo Tosatti
2013-03-07 02:40:01 UTC
Post by Michael Wolf
Post by Glauber Costa
Post by Marcelo Tosatti
Post by Michael Wolf
Sorry for the delay in the response. I did not see the email
right away.
Post by Marcelo Tosatti
Post by Frederic Weisbecker
Post by Michael Wolf
In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat. This can
cause confusion for the end user.
Sorry, I'm no expert in this area. But I don't really understand what
is confusing for the end user here.
I suppose that what is wanted is to subtract stolen time due to 'known
reasons' from steal time reporting. 'Known reasons' being, for example,
hard caps. So a vcpu executing instructions with no halt, but limited to
80% of available bandwidth, would not have 20% of stolen time reported.
Yes exactly and the end user many times did not set up the guest and is
not aware of the capping. The end user is only aware of the performance
level that they were told they would get with the guest.
Post by Marcelo Tosatti
But yes, a description of the scenario that is being dealt with, with
details, is important.
I will add more detail to the description next time I submit the
patches. How about something like,"In a cloud environment the user of a
kvm guest is not aware of the underlying hardware or how many other
guests are running on it. The end user is only aware of a level of
performance that they should see." or does that just muddy the picture
more??
So the feature aims for is to report stolen time relative to hard
capping. That is: stolen time should be counted as time stolen from
the guest _beyond_ hard capping. Yes?
Probably don't need to report new data to the guest for that.
If we take into account that 1 second always have one second, I believe
that you can just subtract the consigned time from the steal time the
host passes to the guest.
During each second, the numbers won't sum up to 100. The delta to 100 is
the consigned time, if anyone cares.
Adopting this would simplify this a lot. All you need to do, really, is
to get your calculation right from the bandwidth given by the cpu
controller. Subtract it in the host, and voila.
I looked at doing that once but was told that I was changing the
interface in an unacceptable way, because now I was not reporting all of
the elapsed time. I agree it would make things simpler.
Pointer to that claim, please?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Paul Mackerras
2013-03-07 03:20:02 UTC
Post by Marcelo Tosatti
Post by Michael Wolf
I looked at doing that once but was told that I was changing the
interface in an unacceptable way, because now I was not reporting all of
the elapsed time. I agree it would make things simpler.
Pointer to that claim, please?
Back in about 2004 or 2005 or so I was looking at changing how user
and system times were calculated (in the context of trying to find a
better way to report resources used by a thread in an SMT processor).
I found that utilities such as top expected the deltas in the
/proc/stat numbers to add up to elapsed time, and would report strange
and inconsistent results if that wasn't the case. Unfortunately at
this distance I don't recall the exact details. I don't know whether
the expectation that the deltas in the /proc/stat numbers over a
period of time add up to the elapsed real time is documented anywhere,
but I wouldn't be at all surprised if some programs depend on it, so
it's better to maintain that property.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Michael Wolf
2013-03-07 20:30:02 UTC
Post by Paul Mackerras
Post by Marcelo Tosatti
Post by Michael Wolf
I looked at doing that once but was told that I was changing the
interface in an unacceptable way, because now I was not reporting all of
the elapsed time. I agree it would make things simpler.
Pointer to that claim, please?
Back in about 2004 or 2005 or so I was looking at changing how user
and system times were calculated (in the context of trying to find a
better way to report resources used by a thread in an SMT processor).
I found that utilities such as top expected the deltas in the
/proc/stat numbers to add up to elapsed time, and would report strange
and inconsistent results if that wasn't the case. Unfortunately at
this distance I don't recall the exact details. I don't know whether
the expectation that the deltas in the /proc/stat numbers over a
period of time add up to the elapsed real time is documented anywhere,
but I wouldn't be at all surprised if some programs depend on it, so
it's better to maintain that property.
I will have to look at this again. When looking at the cpu data where
steal time is reported there isn't a problem today. I will have to run
it and see if there is anything incorrect with the time being reported
for the individual processes.

My real concern here was that in changing the /proc/stat interface am I
going to mess private tools that look at that information. When I've
looked at vmstat and top they report the cpu information fine, but I may
end up creating problems for home grown scripts and tools.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Michael Wolf
2013-03-06 16:30:01 UTC
Post by Marcelo Tosatti
Post by Michael Wolf
Sorry for the delay in the response. I did not see the email
right away.
Post by Marcelo Tosatti
Post by Frederic Weisbecker
Post by Michael Wolf
In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat. This can
cause confusion for the end user.
Sorry, I'm no expert in this area. But I don't really understand what
is confusing for the end user here.
I suppose that what is wanted is to subtract stolen time due to 'known
reasons' from steal time reporting. 'Known reasons' being, for example,
hard caps. So a vcpu executing instructions with no halt, but limited to
80% of available bandwidth, would not have 20% of stolen time reported.
Yes exactly and the end user many times did not set up the guest and is
not aware of the capping. The end user is only aware of the performance
level that they were told they would get with the guest.
Post by Marcelo Tosatti
But yes, a description of the scenario that is being dealt with, with
details, is important.
I will add more detail to the description next time I submit the
patches. How about something like,"In a cloud environment the user of a
kvm guest is not aware of the underlying hardware or how many other
guests are running on it. The end user is only aware of a level of
performance that they should see." or does that just muddy the picture
more??
So the feature aims for is to report stolen time relative to hard
capping. That is: stolen time should be counted as time stolen from
the guest _beyond_ hard capping. Yes?
Yes, that is the goal.
Post by Marcelo Tosatti
Probably don't need to report new data to the guest for that.
Not sure I understand what you are saying here. Do you mean that I don't
need to report the expected steal from the guest? If I don't do that
then I'm not reporting all of the time and changing /proc/stat in a
bigger way than adding another catagory. Also I thought I would need to
provide the consigned time and the steal time for debugging purposes.
Maybe I'm missing your point.....
Post by Marcelo Tosatti
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Marcelo Tosatti
2013-03-07 02:40:01 UTC
Post by Michael Wolf
Post by Marcelo Tosatti
Post by Michael Wolf
Sorry for the delay in the response. I did not see the email
right away.
Post by Marcelo Tosatti
Post by Frederic Weisbecker
Post by Michael Wolf
In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat. This can
cause confusion for the end user.
Sorry, I'm no expert in this area. But I don't really understand what
is confusing for the end user here.
I suppose that what is wanted is to subtract stolen time due to 'known
reasons' from steal time reporting. 'Known reasons' being, for example,
hard caps. So a vcpu executing instructions with no halt, but limited to
80% of available bandwidth, would not have 20% of stolen time reported.
Yes exactly and the end user many times did not set up the guest and is
not aware of the capping. The end user is only aware of the performance
level that they were told they would get with the guest.
Post by Marcelo Tosatti
But yes, a description of the scenario that is being dealt with, with
details, is important.
I will add more detail to the description next time I submit the
patches. How about something like,"In a cloud environment the user of a
kvm guest is not aware of the underlying hardware or how many other
guests are running on it. The end user is only aware of a level of
performance that they should see." or does that just muddy the picture
more??
So the feature aims for is to report stolen time relative to hard
capping. That is: stolen time should be counted as time stolen from
the guest _beyond_ hard capping. Yes?
Yes, that is the goal.
Post by Marcelo Tosatti
Probably don't need to report new data to the guest for that.
Not sure I understand what you are saying here. Do you mean that I don't
need to report the expected steal from the guest? If I don't do that
then I'm not reporting all of the time and changing /proc/stat in a
bigger way than adding another catagory. Also I thought I would need to
provide the consigned time and the steal time for debugging purposes.
Maybe I'm missing your point.....
OK so the usefulness of steal time comes from the ability to measure
CPU cycles that the guest is being deprived of, relative to some unit
(implicitly the CPU frequency presented to the VM). That way, it becomes
easier to properly allocate resources.

From top man page:
st : time stolen from this vm by the hypervisor

Not only its a problem for the lender, it is also confusing for the user
(who has to subtract from the reported value himself), the hardcapping
from reported steal time.


The problem with the algorithm in the patchset is the following
(practical example):

- Hard capping set to 80% of available CPU.
- vcpu does not exceed its threshold, say workload with 40%
CPU utilization.
- Under this scenario it is possible for vcpu to be deprived
of cycles (because out of the 40% that workload uses, only 30% of
actual CPU time are being provided).
- The algorithm in this patchset will not report any stolen time
because it assumes 20% of stolen time reported via 'run_delay'
is fixed at all times (which is false), therefore any valid
stolen time below 20% will not be reported.

Makes sense?

Not sure what the concrete way to report stolen time relative to hard
capping is (probably easier inside the scheduler, where run_delay is
calculated).

Reporting the hard capping to the guest is a good idea (which saves the
user from having to measure it themselves), but better done separately
via new field.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Michael Wolf
2013-03-07 21:20:01 UTC
Post by Marcelo Tosatti
Post by Michael Wolf
Post by Marcelo Tosatti
Post by Michael Wolf
Sorry for the delay in the response. I did not see the email
right away.
Post by Marcelo Tosatti
Post by Frederic Weisbecker
Post by Michael Wolf
In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat. This can
cause confusion for the end user.
Sorry, I'm no expert in this area. But I don't really understand what
is confusing for the end user here.
I suppose that what is wanted is to subtract stolen time due to 'known
reasons' from steal time reporting. 'Known reasons' being, for example,
hard caps. So a vcpu executing instructions with no halt, but limited to
80% of available bandwidth, would not have 20% of stolen time reported.
Yes exactly and the end user many times did not set up the guest and is
not aware of the capping. The end user is only aware of the performance
level that they were told they would get with the guest.
Post by Marcelo Tosatti
But yes, a description of the scenario that is being dealt with, with
details, is important.
I will add more detail to the description next time I submit the
patches. How about something like,"In a cloud environment the user of a
kvm guest is not aware of the underlying hardware or how many other
guests are running on it. The end user is only aware of a level of
performance that they should see." or does that just muddy the picture
more??
So the feature aims for is to report stolen time relative to hard
capping. That is: stolen time should be counted as time stolen from
the guest _beyond_ hard capping. Yes?
Yes, that is the goal.
Post by Marcelo Tosatti
Probably don't need to report new data to the guest for that.
Not sure I understand what you are saying here. Do you mean that I don't
need to report the expected steal from the guest? If I don't do that
then I'm not reporting all of the time and changing /proc/stat in a
bigger way than adding another catagory. Also I thought I would need to
provide the consigned time and the steal time for debugging purposes.
Maybe I'm missing your point.....
OK so the usefulness of steal time comes from the ability to measure
CPU cycles that the guest is being deprived of, relative to some unit
(implicitly the CPU frequency presented to the VM). That way, it becomes
easier to properly allocate resources.
st : time stolen from this vm by the hypervisor
Not only its a problem for the lender, it is also confusing for the user
(who has to subtract from the reported value himself), the hardcapping
from reported steal time.
The problem with the algorithm in the patchset is the following
- Hard capping set to 80% of available CPU.
- vcpu does not exceed its threshold, say workload with 40%
CPU utilization.
- Under this scenario it is possible for vcpu to be deprived
of cycles (because out of the 40% that workload uses, only 30% of
actual CPU time are being provided).
- The algorithm in this patchset will not report any stolen time
because it assumes 20% of stolen time reported via 'run_delay'
is fixed at all times (which is false), therefore any valid
stolen time below 20% will not be reported.
Makes sense?
I understand the scenerio. I will have to go back and look at the
CFS bandwidth code and run some tests. The question I have to look at is
how is everything reported in your scenerio above.

This will depend on how the cfs bandwidth is configured, are there
uncapped processes on the system and how cpu intensive are they.

I will run some tests and report back.
Post by Marcelo Tosatti
Not sure what the concrete way to report stolen time relative to hard
capping is (probably easier inside the scheduler, where run_delay is
calculated).
Reporting the hard capping to the guest is a good idea (which saves the
user from having to measure it themselves), but better done separately
via new field.
I looked at doing something like this. If bandwidth controls are
configured there is a throttled flag. So in effect if the throttled
flag is set, don't add the time spent on the runqueue. But this will
fail to work in some cases.

For example
you sent up cfs bandwidth controls. set up the group to get 50% of the
processor

Have 1 physical cpu

Have 2 guests each with 1 vcpu.

Have each guest running to its full entitlement.

So in this case each guest will have time on the runqueue but neither
will ever be throttled since they will not exceed their quota in the
defined period. So now just trying to do this in the scheduler doesn't
work because you cannot rely on the throttled flag. In either case the
time is accumulated as time on the runqueue.

This is why my patchset had included a timer. It was basically
mimicking the bandwidth controller by using a timer set to the same
period. So in a given period of time a fixed quota of time on the
runqueue can be expected. If the amount of time on the runqueue exceeds
the expected, then report it.
Post by Marcelo Tosatti
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Michael Wolf
2013-03-07 21:20:02 UTC
Post by Marcelo Tosatti
Post by Michael Wolf
Post by Marcelo Tosatti
Post by Michael Wolf
Sorry for the delay in the response. I did not see the email
right away.
Post by Marcelo Tosatti
Post by Frederic Weisbecker
Post by Michael Wolf
In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat. This can
cause confusion for the end user.
Sorry, I'm no expert in this area. But I don't really understand what
is confusing for the end user here.
I suppose that what is wanted is to subtract stolen time due to 'known
reasons' from steal time reporting. 'Known reasons' being, for example,
hard caps. So a vcpu executing instructions with no halt, but limited to
80% of available bandwidth, would not have 20% of stolen time reported.
Yes exactly and the end user many times did not set up the guest and is
not aware of the capping. The end user is only aware of the performance
level that they were told they would get with the guest.
Post by Marcelo Tosatti
But yes, a description of the scenario that is being dealt with, with
details, is important.
I will add more detail to the description next time I submit the
patches. How about something like,"In a cloud environment the user of a
kvm guest is not aware of the underlying hardware or how many other
guests are running on it. The end user is only aware of a level of
performance that they should see." or does that just muddy the picture
more??
So the feature aims for is to report stolen time relative to hard
capping. That is: stolen time should be counted as time stolen from
the guest _beyond_ hard capping. Yes?
Yes, that is the goal.
Post by Marcelo Tosatti
Probably don't need to report new data to the guest for that.
Not sure I understand what you are saying here. Do you mean that I don't
need to report the expected steal from the guest? If I don't do that
then I'm not reporting all of the time and changing /proc/stat in a
bigger way than adding another catagory. Also I thought I would need to
provide the consigned time and the steal time for debugging purposes.
Maybe I'm missing your point.....
OK so the usefulness of steal time comes from the ability to measure
CPU cycles that the guest is being deprived of, relative to some unit
(implicitly the CPU frequency presented to the VM). That way, it becomes
easier to properly allocate resources.
st : time stolen from this vm by the hypervisor
Not only its a problem for the lender, it is also confusing for the user
(who has to subtract from the reported value himself), the hardcapping
from reported steal time.
The problem with the algorithm in the patchset is the following
- Hard capping set to 80% of available CPU.
- vcpu does not exceed its threshold, say workload with 40%
CPU utilization.
- Under this scenario it is possible for vcpu to be deprived
of cycles (because out of the 40% that workload uses, only 30% of
actual CPU time are being provided).
- The algorithm in this patchset will not report any stolen time
because it assumes 20% of stolen time reported via 'run_delay'
is fixed at all times (which is false), therefore any valid
stolen time below 20% will not be reported.
Makes sense?
Not sure what the concrete way to report stolen time relative to hard
capping is (probably easier inside the scheduler, where run_delay is
calculated).
Reporting the hard capping to the guest is a good idea (which saves the
user from having to measure it themselves), but better done separately
via new field.
didnt respond to this in the previous response. I'm not sure I'm
following you here. I thought this is what I was doing by having a
consigned (expected steal) field add to the /proc/stat output. Are you
looking for something else or a better naming convention?
Post by Marcelo Tosatti
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Marcelo Tosatti
2013-03-07 21:30:02 UTC
Post by Michael Wolf
Post by Marcelo Tosatti
Makes sense?
Not sure what the concrete way to report stolen time relative to hard
capping is (probably easier inside the scheduler, where run_delay is
calculated).
Reporting the hard capping to the guest is a good idea (which saves the
user from having to measure it themselves), but better done separately
via new field.
didnt respond to this in the previous response. I'm not sure I'm
following you here. I thought this is what I was doing by having a
consigned (expected steal) field add to the /proc/stat output. Are you
looking for something else or a better naming convention?
Expected steal is not a good measure to use (because as mentioned in the
previous email there is no expected steal over a fixed period of time).

It is fine to report 'maximum percentage of underlying physical CPU'
(what percentage of the physical CPU time guest VM is allowed to make
use of).

And then steal time is relative to maximum percentage of underlying
physical CPU time allowed.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Michael Wolf
2013-03-07 22:40:01 UTC
Post by Marcelo Tosatti
Post by Michael Wolf
Post by Marcelo Tosatti
Makes sense?
Not sure what the concrete way to report stolen time relative to hard
capping is (probably easier inside the scheduler, where run_delay is
calculated).
Reporting the hard capping to the guest is a good idea (which saves the
user from having to measure it themselves), but better done separately
via new field.
didnt respond to this in the previous response. I'm not sure I'm
following you here. I thought this is what I was doing by having a
consigned (expected steal) field add to the /proc/stat output. Are you
looking for something else or a better naming convention?
Expected steal is not a good measure to use (because as mentioned in the
previous email there is no expected steal over a fixed period of time).
It is fine to report 'maximum percentage of underlying physical CPU'
(what percentage of the physical CPU time guest VM is allowed to make
use of).
And then steal time is relative to maximum percentage of underlying
physical CPU time allowed.
So last August I had sent out an RFC set of patches to do this. That
patchset was meant to handle the general overcommit case as well as the
capping case by having qemu pass a percentage to the host that would
then be passed onto the guest and used to adjust the steal time.
Here is the link to the discussion
http://lkml.indiana.edu/hypermail/linux/kernel/1208.3/01458.html

As you will see there Avi didn't like the idea of a percentage down in
the guest, among other reasons he was concerned about migration. Also
in the email thread you will see that Anthony Liguori was opposed to the
idea of just changing the steal time, he wanted it split out.

What Glauber has suggested and I am working on implementing is taking
out the timer and adding a last read field in the host. So in the host
I can determine the total time that has passed and compute a percentage
and apply that percentage to the steal time while the info is still on
the host. Then pass the steal and consigned time to the guest.

Does that address your concerns?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Marcelo Tosatti
2013-03-08 02:10:01 UTC
Post by Michael Wolf
Post by Marcelo Tosatti
Post by Michael Wolf
Post by Marcelo Tosatti
Makes sense?
Not sure what the concrete way to report stolen time relative to hard
capping is (probably easier inside the scheduler, where run_delay is
calculated).
Reporting the hard capping to the guest is a good idea (which saves the
user from having to measure it themselves), but better done separately
via new field.
didnt respond to this in the previous response. I'm not sure I'm
following you here. I thought this is what I was doing by having a
consigned (expected steal) field add to the /proc/stat output. Are you
looking for something else or a better naming convention?
Expected steal is not a good measure to use (because as mentioned in the
previous email there is no expected steal over a fixed period of time).
It is fine to report 'maximum percentage of underlying physical CPU'
(what percentage of the physical CPU time guest VM is allowed to make
use of).
And then steal time is relative to maximum percentage of underlying
physical CPU time allowed.
So last August I had sent out an RFC set of patches to do this. That
patchset was meant to handle the general overcommit case as well as the
capping case by having qemu pass a percentage to the host that would
then be passed onto the guest and used to adjust the steal time.
Here is the link to the discussion
http://lkml.indiana.edu/hypermail/linux/kernel/1208.3/01458.html
As you will see there Avi didn't like the idea of a percentage down in
the guest, among other reasons he was concerned about migration. Also
in the email thread you will see that Anthony Liguori was opposed to the
idea of just changing the steal time, he wanted it split out.
What Glauber has suggested and I am working on implementing is taking
out the timer and adding a last read field in the host. So in the host
I can determine the total time that has passed and compute a percentage
and apply that percentage to the steal time while the info is still on
the host. Then pass the steal and consigned time to the guest.
Does that address your concerns?
I am not asking about passing percentage down the host - just pointing
out a counter example to the correctness of the current algorithm.

I cannot see how you can report proper steal time value relative to
hard cap without having that number calculated in the scheduler. IOW,
"run_delay" must be split in two: you want to differentiate whether run
delay was due to hard cap exhaustion or due to other reasons. Without
that, steal time reporting is incorrect (as the example details). Now
the question is, how to do that separation.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Marcelo Tosatti
2013-03-08 02:30:02 UTC
Post by Marcelo Tosatti
Post by Michael Wolf
Post by Marcelo Tosatti
Post by Michael Wolf
Post by Marcelo Tosatti
Makes sense?
Not sure what the concrete way to report stolen time relative to hard
capping is (probably easier inside the scheduler, where run_delay is
calculated).
Reporting the hard capping to the guest is a good idea (which saves the
user from having to measure it themselves), but better done separately
via new field.
didnt respond to this in the previous response. I'm not sure I'm
following you here. I thought this is what I was doing by having a
consigned (expected steal) field add to the /proc/stat output. Are you
looking for something else or a better naming convention?
Expected steal is not a good measure to use (because as mentioned in the
previous email there is no expected steal over a fixed period of time).
It is fine to report 'maximum percentage of underlying physical CPU'
(what percentage of the physical CPU time guest VM is allowed to make
use of).
And then steal time is relative to maximum percentage of underlying
physical CPU time allowed.
So last August I had sent out an RFC set of patches to do this. That
patchset was meant to handle the general overcommit case as well as the
capping case by having qemu pass a percentage to the host that would
then be passed onto the guest and used to adjust the steal time.
Here is the link to the discussion
http://lkml.indiana.edu/hypermail/linux/kernel/1208.3/01458.html
As you will see there Avi didn't like the idea of a percentage down in
the guest, among other reasons he was concerned about migration.
OK.
Post by Marcelo Tosatti
Post by Michael Wolf
Also in the email thread you will see that Anthony Liguori was
opposed to the idea of just changing the steal time, he wanted it
split out.
"What I had previously suggested what splitting entitlement loss out of
steal time and reporting it as a separate metric (but not reporting a
fixed notion of entitlement).

You're missing the entitlement loss bit above. But you need to call
out entitlement loss in order to report idle time correctly.

I think changing steal time (as this patch does) is wrong.

Regards,

Anthony Liguori"

This is what is suggested below. What you mentioned earlier

"So in this case each guest will have time on the runqueue but neither
will ever be throttled since they will not exceed their quota in the
defined period. So now just trying to do this in the scheduler doesn't
work because you cannot rely on the throttled flag. In either case the
time is accumulated as time on the runqueue.

This is why my patchset had included a timer. It was basically
mimicking the bandwidth controller by using a timer set to the same
period. So in a given period of time a fixed quota of time on the
runqueue can be expected. If the amount of time on the runqueue exceeds
the expected, then report it."

Understood, but its problematic: it is possible for a vcpu to be
deprived of cycles even if it did not exceed its quota. Did you
investigate whether its possible to split run_delay?
Post by Marcelo Tosatti
Post by Michael Wolf
What Glauber has suggested and I am working on implementing is taking
out the timer and adding a last read field in the host. So in the host
I can determine the total time that has passed and compute a percentage
and apply that percentage to the steal time while the info is still on
the host. Then pass the steal and consigned time to the guest.
Or maybe i missed why the suggestion above is immune to this problem?
Post by Marcelo Tosatti
Post by Michael Wolf
Does that address your concerns?
I am not asking about passing percentage down the host - just pointing
out a counter example to the correctness of the current algorithm.
I cannot see how you can report proper steal time value relative to
hard cap without having that number calculated in the scheduler. IOW,
"run_delay" must be split in two: you want to differentiate whether run
delay was due to hard cap exhaustion or due to other reasons. Without
that, steal time reporting is incorrect (as the example details). Now
the question is, how to do that separation.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Frederic Weisbecker
2013-03-06 13:40:02 UTC
Post by Michael Wolf
Sorry for the delay in the response. I did not see the email
right away.
Post by Marcelo Tosatti
Post by Frederic Weisbecker
Post by Michael Wolf
In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat. This can
cause confusion for the end user.
Sorry, I'm no expert in this area. But I don't really understand what
is confusing for the end user here.
I suppose that what is wanted is to subtract stolen time due to 'known
reasons' from steal time reporting. 'Known reasons' being, for example,
hard caps. So a vcpu executing instructions with no halt, but limited to
80% of available bandwidth, would not have 20% of stolen time reported.
Yes exactly and the end user many times did not set up the guest and is
not aware of the capping. The end user is only aware of the performance
level that they were told they would get with the guest.
Post by Marcelo Tosatti
But yes, a description of the scenario that is being dealt with, with
details, is important.
I will add more detail to the description next time I submit the
patches. How about something like,"In a cloud environment the user of a
kvm guest is not aware of the underlying hardware or how many other
guests are running on it. The end user is only aware of a level of
performance that they should see." or does that just muddy the picture
more??
That alone is probably not enough. But yeah, make sure you clearly
state the difference between expected (caps, sched bandwidth...) and
unexpected (overcommitting, competing load...) stolen time. Then add a
practical example as you made above that explains why it matters to
make that distinction and why you want to report it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Michael Wolf
2013-03-06 16:40:01 UTC
Post by Frederic Weisbecker
Post by Michael Wolf
Sorry for the delay in the response. I did not see the email
right away.
Post by Marcelo Tosatti
Post by Frederic Weisbecker
Post by Michael Wolf
In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat. This can
cause confusion for the end user.
Sorry, I'm no expert in this area. But I don't really understand what
is confusing for the end user here.
I suppose that what is wanted is to subtract stolen time due to 'known
reasons' from steal time reporting. 'Known reasons' being, for example,
hard caps. So a vcpu executing instructions with no halt, but limited to
80% of available bandwidth, would not have 20% of stolen time reported.
Yes exactly and the end user many times did not set up the guest and is
not aware of the capping. The end user is only aware of the performance
level that they were told they would get with the guest.
Post by Marcelo Tosatti
But yes, a description of the scenario that is being dealt with, with
details, is important.
I will add more detail to the description next time I submit the
patches. How about something like,"In a cloud environment the user of a
kvm guest is not aware of the underlying hardware or how many other
guests are running on it. The end user is only aware of a level of
performance that they should see." or does that just muddy the picture
more??
That alone is probably not enough. But yeah, make sure you clearly
state the difference between expected (caps, sched bandwidth...) and
unexpected (overcommitting, competing load...) stolen time. Then add a
practical example as you made above that explains why it matters to
make that distinction and why you want to report it.
Ok, I understand what you are requesting. I will make sure to add it to
the description the next time I submit the patches.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Frederic Weisbecker
2013-03-06 13:30:02 UTC
Post by Marcelo Tosatti
Post by Frederic Weisbecker
Post by Michael Wolf
In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat. This can
cause confusion for the end user.
Sorry, I'm no expert in this area. But I don't really understand what
is confusing for the end user here.
I suppose that what is wanted is to subtract stolen time due to 'known
reasons' from steal time reporting. 'Known reasons' being, for example,
hard caps. So a vcpu executing instructions with no halt, but limited to
80% of available bandwidth, would not have 20% of stolen time reported.
Ok, that's a good explanation to add to make that subtle steal time
issue clearer.
Post by Marcelo Tosatti
But yes, a description of the scenario that is being dealt with, with
details, is important.
Yeah especially for such a significant user ABI change.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/