Discussion:
[PATCH 6/8] sched, numa, mm: Implement constant, per task Working Set Sampling (WSS) rate
(too old to reply)
Peter Zijlstra
2012-11-12 16:30:02 UTC
Permalink
Previously, to probe the working set of a task, we'd use
a very simple and crude method: mark all of its address
space PROT_NONE.

That method has various (obvious) disadvantages:

- it samples the working set at dissimilar rates,
giving some tasks a sampling quality advantage
over others.

- creates performance problems for tasks with very
large working sets

- over-samples processes with large address spaces but
which only very rarely execute

Improve that method by keeping a rotating offset into the
address space that marks the current position of the scan,
and advance it by a constant rate (in a CPU cycles execution
proportional manner). If the offset reaches the last mapped
address of the mm then it then it starts over at the first
address.

The per-task nature of the working set sampling functionality
in this tree allows such constant rate, per task,
execution-weight proportional sampling of the working set,
with an adaptive sampling interval/frequency that goes from
once per 100 msecs up to just once per 1.6 seconds.
The current sampling volume is 256 MB per interval.

As tasks mature and converge their working set, so does the
sampling rate slow down to just a trickle, 256 MB per 1.6
seconds of CPU time executed.

This, beyond being adaptive, also rate-limits rarely
executing systems and does not over-sample on overloaded
systems.

[ In AutoNUMA speak, this patch deals with the effective sampling
rate of the 'hinting page fault'. AutoNUMA's scanning is
currently rate-limited, but it is also fundamentally
single-threaded, executing in the knuma_scand kernel thread,
so the limit in AutoNUMA is global and does not scale up with
the number of CPUs, nor does it scan tasks in an execution
proportional manner.

So the idea of rate-limiting the scanning was first implemented
in the AutoNUMA tree via a global rate limit. This patch goes
beyond that by implementing an execution rate proportional
working set sampling rate that is not implemented via a single
global scanning daemon. ]

[ Dan Carpenter pointed out a possible NULL pointer dereference in the
first version of this patch. ]

Based-on-idea-by: Andrea Arcangeli <***@redhat.com>
Bug-Found-By: Dan Carpenter <***@oracle.com>
Signed-off-by: Peter Zijlstra <***@chello.nl>
Cc: Linus Torvalds <***@linux-foundation.org>
Cc: Andrew Morton <***@linux-foundation.org>
Cc: Peter Zijlstra <***@chello.nl>
Cc: Andrea Arcangeli <***@redhat.com>
Cc: Rik van Riel <***@redhat.com>
Cc: Mel Gorman <***@suse.de>
[ Wrote changelog and fixed bug. ]
Signed-off-by: Ingo Molnar <***@kernel.org>
---
include/linux/mm_types.h | 1 +
include/linux/sched.h | 1 +
kernel/sched/fair.c | 43 ++++++++++++++++++++++++++++++-------------
kernel/sysctl.c | 7 +++++++
4 files changed, 39 insertions(+), 13 deletions(-)

Index: linux/include/linux/mm_types.h
===================================================================
--- linux.orig/include/linux/mm_types.h
+++ linux/include/linux/mm_types.h
@@ -405,6 +405,7 @@ struct mm_struct {
#endif
#ifdef CONFIG_SCHED_NUMA
unsigned long numa_next_scan;
+ unsigned long numa_scan_offset;
int numa_scan_seq;
#endif
struct uprobes_state uprobes_state;
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -2047,6 +2047,7 @@ extern enum sched_tunable_scaling sysctl

extern unsigned int sysctl_sched_numa_scan_period_min;
extern unsigned int sysctl_sched_numa_scan_period_max;
+extern unsigned int sysctl_sched_numa_scan_size;
extern unsigned int sysctl_sched_numa_settle_count;

#ifdef CONFIG_SCHED_DEBUG
Index: linux/kernel/sched/fair.c
===================================================================
--- linux.orig/kernel/sched/fair.c
+++ linux/kernel/sched/fair.c
@@ -825,8 +825,9 @@ static void account_numa_dequeue(struct
/*
* numa task sample period in ms: 5s
*/
-unsigned int sysctl_sched_numa_scan_period_min = 5000;
-unsigned int sysctl_sched_numa_scan_period_max = 5000*16;
+unsigned int sysctl_sched_numa_scan_period_min = 100;
+unsigned int sysctl_sched_numa_scan_period_max = 100*16;
+unsigned int sysctl_sched_numa_scan_size = 256; /* MB */

/*
* Wait for the 2-sample stuff to settle before migrating again
@@ -912,6 +913,9 @@ void task_numa_work(struct callback_head
unsigned long migrate, next_scan, now = jiffies;
struct task_struct *p = current;
struct mm_struct *mm = p->mm;
+ struct vm_area_struct *vma;
+ unsigned long offset, end;
+ long length;

WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));

@@ -938,18 +942,31 @@ void task_numa_work(struct callback_head
if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
return;

- ACCESS_ONCE(mm->numa_scan_seq)++;
- {
- struct vm_area_struct *vma;
-
- down_write(&mm->mmap_sem);
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
- if (!vma_migratable(vma))
- continue;
- change_protection(vma, vma->vm_start, vma->vm_end, vma_prot_none(vma), 0);
- }
- up_write(&mm->mmap_sem);
+ offset = mm->numa_scan_offset;
+ length = sysctl_sched_numa_scan_size;
+ length <<= 20;
+
+ down_write(&mm->mmap_sem);
+ vma = find_vma(mm, offset);
+ if (!vma) {
+ ACCESS_ONCE(mm->numa_scan_seq)++;
+ offset = 0;
+ vma = mm->mmap;
+ }
+ for (; vma && length > 0; vma = vma->vm_next) {
+ if (!vma_migratable(vma))
+ continue;
+
+ offset = max(offset, vma->vm_start);
+ end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
+ length -= end - offset;
+
+ change_prot_none(vma, offset, end);
+
+ offset = end;
}
+ mm->numa_scan_offset = offset;
+ up_write(&mm->mmap_sem);
}

/*
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -367,6 +367,13 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec,
},
{
+ .procname = "sched_numa_scan_size_mb",
+ .data = &sysctl_sched_numa_scan_size,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
.procname = "sched_numa_settle_count",
.data = &sysctl_sched_numa_settle_count,
.maxlen = sizeof(unsigned int),


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Peter Zijlstra
2012-11-12 16:30:03 UTC
Permalink
Introduce a per-page last_cpu field, fold this into the struct
page::flags field whenever possible.

The unlikely/rare 32bit NUMA configs will likely grow the page-frame.

[ Completely dropping 32bit support for CONFIG_SCHED_NUMA would simplify
things, but it would also remove the warning if we grow enough 64bit
only page-flags to push the last-cpu out. ]

Suggested-by: Rik van Riel <***@redhat.com>
Signed-off-by: Peter Zijlstra <***@chello.nl>
Cc: Linus Torvalds <***@linux-foundation.org>
Cc: Andrew Morton <***@linux-foundation.org>
Cc: Peter Zijlstra <***@chello.nl>
Cc: Andrea Arcangeli <***@redhat.com>
Cc: Rik van Riel <***@redhat.com>
Cc: Mel Gorman <***@suse.de>
Signed-off-by: Ingo Molnar <***@kernel.org>
---
include/linux/mm.h | 90 ++++++++++++++++++++------------------
include/linux/mm_types.h | 9 +++
include/linux/mmzone.h | 14 -----
include/linux/page-flags-layout.h | 83 +++++++++++++++++++++++++++++++++++
kernel/bounds.c | 2
mm/huge_memory.c | 3 +
mm/memory.c | 4 +
7 files changed, 151 insertions(+), 54 deletions(-)

Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -594,50 +594,11 @@ static inline pte_t maybe_mkwrite(pte_t
* sets it, so none of the operations on it need to be atomic.
*/

-
-/*
- * page->flags layout:
- *
- * There are three possibilities for how page->flags get
- * laid out. The first is for the normal case, without
- * sparsemem. The second is for sparsemem when there is
- * plenty of space for node and section. The last is when
- * we have run out of space and have to fall back to an
- * alternate (slower) way of determining the node.
- *
- * No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
- * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
- * classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |
- */
-#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
-#define SECTIONS_WIDTH SECTIONS_SHIFT
-#else
-#define SECTIONS_WIDTH 0
-#endif
-
-#define ZONES_WIDTH ZONES_SHIFT
-
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define NODES_WIDTH NODES_SHIFT
-#else
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-#error "Vmemmap: No space for nodes field in page flags"
-#endif
-#define NODES_WIDTH 0
-#endif
-
-/* Page flags: | [SECTION] | [NODE] | ZONE | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_CPU] | ... | FLAGS | */
#define SECTIONS_PGOFF ((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
#define NODES_PGOFF (SECTIONS_PGOFF - NODES_WIDTH)
#define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH)
-
-/*
- * We are going to use the flags for the page to node mapping if its in
- * there. This includes the case where there is no node, so it is implicit.
- */
-#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
-#define NODE_NOT_IN_PAGE_FLAGS
-#endif
+#define LAST_CPU_PGOFF (ZONES_PGOFF - LAST_CPU_WIDTH)

/*
* Define the bit shifts to access each section. For non-existent
@@ -647,6 +608,7 @@ static inline pte_t maybe_mkwrite(pte_t
#define SECTIONS_PGSHIFT (SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
#define NODES_PGSHIFT (NODES_PGOFF * (NODES_WIDTH != 0))
#define ZONES_PGSHIFT (ZONES_PGOFF * (ZONES_WIDTH != 0))
+#define LAST_CPU_PGSHIFT (LAST_CPU_PGOFF * (LAST_CPU_WIDTH != 0))

/* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
#ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -668,6 +630,7 @@ static inline pte_t maybe_mkwrite(pte_t
#define ZONES_MASK ((1UL << ZONES_WIDTH) - 1)
#define NODES_MASK ((1UL << NODES_WIDTH) - 1)
#define SECTIONS_MASK ((1UL << SECTIONS_WIDTH) - 1)
+#define LAST_CPU_MASK ((1UL << LAST_CPU_WIDTH) - 1)
#define ZONEID_MASK ((1UL << ZONEID_SHIFT) - 1)

static inline enum zone_type page_zonenum(const struct page *page)
@@ -706,6 +669,51 @@ static inline int page_to_nid(const stru
}
#endif

+#ifdef CONFIG_SCHED_NUMA
+#ifdef LAST_CPU_NOT_IN_PAGE_FLAGS
+static inline int page_xchg_last_cpu(struct page *page, int cpu)
+{
+ return xchg(&page->_last_cpu, cpu);
+}
+
+static inline int page_last_cpu(struct page *page)
+{
+ return page->_last_cpu;
+}
+#else
+static inline int page_xchg_last_cpu(struct page *page, int cpu)
+{
+ unsigned long old_flags, flags;
+ int last_cpu;
+
+ do {
+ old_flags = flags = page->flags;
+ last_cpu = (flags >> LAST_CPU_PGSHIFT) & LAST_CPU_MASK;
+
+ flags &= ~(LAST_CPU_MASK << LAST_CPU_PGSHIFT);
+ flags |= (cpu & LAST_CPU_MASK) << LAST_CPU_PGSHIFT;
+ } while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
+
+ return last_cpu;
+}
+
+static inline int page_last_cpu(struct page *page)
+{
+ return (page->flags >> LAST_CPU_PGSHIFT) & LAST_CPU_MASK;
+}
+#endif /* LAST_CPU_NOT_IN_PAGE_FLAGS */
+#else /* CONFIG_SCHED_NUMA */
+static inline int page_xchg_last_cpu(struct page *page, int cpu)
+{
+ return page_to_nid(page);
+}
+
+static inline int page_last_cpu(struct page *page)
+{
+ return page_to_nid(page);
+}
+#endif /* CONFIG_SCHED_NUMA */
+
static inline struct zone *page_zone(const struct page *page)
{
return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
Index: linux/include/linux/mm_types.h
===================================================================
--- linux.orig/include/linux/mm_types.h
+++ linux/include/linux/mm_types.h
@@ -12,6 +12,7 @@
#include <linux/cpumask.h>
#include <linux/page-debug-flags.h>
#include <linux/uprobes.h>
+#include <linux/page-flags-layout.h>
#include <asm/page.h>
#include <asm/mmu.h>

@@ -175,6 +176,10 @@ struct page {
*/
void *shadow;
#endif
+
+#ifdef LAST_CPU_NOT_IN_PAGE_FLAGS
+ int _last_cpu;
+#endif
}
/*
* The struct page can be forced to be double word aligned so that atomic ops
@@ -398,6 +403,10 @@ struct mm_struct {
#ifdef CONFIG_CPUMASK_OFFSTACK
struct cpumask cpumask_allocation;
#endif
+#ifdef CONFIG_SCHED_NUMA
+ unsigned long numa_next_scan;
+ int numa_scan_seq;
+#endif
struct uprobes_state uprobes_state;
};

Index: linux/include/linux/mmzone.h
===================================================================
--- linux.orig/include/linux/mmzone.h
+++ linux/include/linux/mmzone.h
@@ -15,7 +15,7 @@
#include <linux/seqlock.h>
#include <linux/nodemask.h>
#include <linux/pageblock-flags.h>
-#include <generated/bounds.h>
+#include <linux/page-flags-layout.h>
#include <linux/atomic.h>
#include <asm/page.h>

@@ -318,16 +318,6 @@ enum zone_type {
* match the requested limits. See gfp_zone() in include/linux/gfp.h
*/

-#if MAX_NR_ZONES < 2
-#define ZONES_SHIFT 0
-#elif MAX_NR_ZONES <= 2
-#define ZONES_SHIFT 1
-#elif MAX_NR_ZONES <= 4
-#define ZONES_SHIFT 2
-#else
-#error ZONES_SHIFT -- too many zones configured adjust calculation
-#endif
-
struct zone {
/* Fields commonly accessed by the page allocator */

@@ -1030,8 +1020,6 @@ static inline unsigned long early_pfn_to
* PA_SECTION_SHIFT physical address to/from section number
* PFN_SECTION_SHIFT pfn to/from section number
*/
-#define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
-
#define PA_SECTION_SHIFT (SECTION_SIZE_BITS)
#define PFN_SECTION_SHIFT (SECTION_SIZE_BITS - PAGE_SHIFT)

Index: linux/include/linux/page-flags-layout.h
===================================================================
--- /dev/null
+++ linux/include/linux/page-flags-layout.h
@@ -0,0 +1,83 @@
+#ifndef _LINUX_PAGE_FLAGS_LAYOUT
+#define _LINUX_PAGE_FLAGS_LAYOUT
+
+#include <linux/numa.h>
+#include <generated/bounds.h>
+
+#if MAX_NR_ZONES < 2
+#define ZONES_SHIFT 0
+#elif MAX_NR_ZONES <= 2
+#define ZONES_SHIFT 1
+#elif MAX_NR_ZONES <= 4
+#define ZONES_SHIFT 2
+#else
+#error ZONES_SHIFT -- too many zones configured adjust calculation
+#endif
+
+#ifdef CONFIG_SPARSEMEM
+#include <asm/sparsemem.h>
+
+/*
+ * SECTION_SHIFT #bits space required to store a section #
+ */
+#define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
+#endif
+
+/*
+ * page->flags layout:
+ *
+ * There are five possibilities for how page->flags get laid out. The first
+ * (and second) is for the normal case, without sparsemem. The third is for
+ * sparsemem when there is plenty of space for node and section. The last is
+ * when we have run out of space and have to fall back to an alternate (slower)
+ * way of determining the node.
+ *
+ * No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
+ * " plus space for last_cpu:| NODE | ZONE | LAST_CPU | ... | FLAGS |
+ * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
+ * " plus space for last_cpu:| SECTION | NODE | ZONE | LAST_CPU | ... | FLAGS |
+ * classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |
+ */
+#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
+
+#define SECTIONS_WIDTH SECTIONS_SHIFT
+#else
+#define SECTIONS_WIDTH 0
+#endif
+
+#define ZONES_WIDTH ZONES_SHIFT
+
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define NODES_WIDTH NODES_SHIFT
+#else
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+#error "Vmemmap: No space for nodes field in page flags"
+#endif
+#define NODES_WIDTH 0
+#endif
+
+#ifdef CONFIG_SCHED_NUMA
+#define LAST_CPU_SHIFT NR_CPUS_BITS
+#else
+#define LAST_CPU_SHIFT 0
+#endif
+
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPU_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_CPU_WIDTH LAST_CPU_SHIFT
+#else
+#define LAST_CPU_WIDTH 0
+#endif
+
+/*
+ * We are going to use the flags for the page to node mapping if its in
+ * there. This includes the case where there is no node, so it is implicit.
+ */
+#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
+#define NODE_NOT_IN_PAGE_FLAGS
+#endif
+
+#if defined(CONFIG_SCHED_NUMA) && LAST_CPU_WIDTH == 0
+#define LAST_CPU_NOT_IN_PAGE_FLAGS
+#endif
+
+#endif /* _LINUX_PAGE_FLAGS_LAYOUT */
Index: linux/kernel/bounds.c
===================================================================
--- linux.orig/kernel/bounds.c
+++ linux/kernel/bounds.c
@@ -10,6 +10,7 @@
#include <linux/mmzone.h>
#include <linux/kbuild.h>
#include <linux/page_cgroup.h>
+#include <linux/log2.h>

void foo(void)
{
@@ -17,5 +18,6 @@ void foo(void)
DEFINE(NR_PAGEFLAGS, __NR_PAGEFLAGS);
DEFINE(MAX_NR_ZONES, __MAX_NR_ZONES);
DEFINE(NR_PCG_FLAGS, __NR_PCG_FLAGS);
+ DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
/* End of constants */
}
Index: linux/mm/huge_memory.c
===================================================================
--- linux.orig/mm/huge_memory.c
+++ linux/mm/huge_memory.c
@@ -746,6 +746,7 @@ void do_huge_pmd_numa_page(struct mm_str
struct page *new_page = NULL;
struct page *page = NULL;
int node, lru;
+ int last_cpu;

spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(*pmd, entry)))
@@ -760,6 +761,7 @@ void do_huge_pmd_numa_page(struct mm_str
page = pmd_page(entry);
if (page) {
VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+ last_cpu = page_last_cpu(page);

get_page(page);
node = mpol_misplaced(page, vma, haddr);
@@ -1441,6 +1443,7 @@ static void __split_huge_page_refcount(s
page_tail->mapping = page->mapping;

page_tail->index = page->index + i;
+ page_xchg_last_cpu(page, page_last_cpu(page_tail));

BUG_ON(!PageAnon(page_tail));
BUG_ON(!PageUptodate(page_tail));
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c
+++ linux/mm/memory.c
@@ -70,6 +70,10 @@

#include "internal.h"

+#ifdef LAST_CPU_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA config, growing page-frame for last_cpu.
+#endif
+
#ifndef CONFIG_NEED_MULTIPLE_NODES
/* use the per-pgdat data instead for discontigmem - mbligh */
unsigned long max_mapnr;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ingo Molnar
2012-11-13 12:00:02 UTC
Permalink
A cleanliness side note, this bit does not belong into this
Post by Peter Zijlstra
Index: linux/include/linux/mm_types.h
===================================================================
--- linux.orig/include/linux/mm_types.h
+++ linux/include/linux/mm_types.h
@@ -398,6 +403,10 @@ struct mm_struct {
#ifdef CONFIG_CPUMASK_OFFSTACK
struct cpumask cpumask_allocation;
#endif
+#ifdef CONFIG_SCHED_NUMA
+ unsigned long numa_next_scan;
+ int numa_scan_seq;
+#endif
struct uprobes_state uprobes_state;
};
I've moved it over into the 5th patch.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Rik van Riel
2012-11-13 16:10:01 UTC
Permalink
Post by Peter Zijlstra
@@ -706,6 +669,51 @@ static inline int page_to_nid(const stru
}
#endif
+#ifdef CONFIG_SCHED_NUMA
+#ifdef LAST_CPU_NOT_IN_PAGE_FLAGS
+static inline int page_xchg_last_cpu(struct page *page, int cpu)
+{
+ return xchg(&page->_last_cpu, cpu);
+}
+
+static inline int page_last_cpu(struct page *page)
+{
+ return page->_last_cpu;
+}
+#else
+static inline int page_xchg_last_cpu(struct page *page, int cpu)
+{
+ unsigned long old_flags, flags;
+ int last_cpu;
+
+ do {
+ old_flags = flags = page->flags;
+ last_cpu = (flags >> LAST_CPU_PGSHIFT) & LAST_CPU_MASK;
+
+ flags &= ~(LAST_CPU_MASK << LAST_CPU_PGSHIFT);
+ flags |= (cpu & LAST_CPU_MASK) << LAST_CPU_PGSHIFT;
+ } while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
+
+ return last_cpu;
+}
These functions, and the accompanying config option, could
use some comments and documentation, explaining why things
are done this way, why it is safe, and what (if any) constraints
it places on other users of page.flags ...
Post by Peter Zijlstra
+static inline int page_last_cpu(struct page *page)
+{
+ return (page->flags >> LAST_CPU_PGSHIFT) & LAST_CPU_MASK;
+}
+#endif /* LAST_CPU_NOT_IN_PAGE_FLAGS */
+#else /* CONFIG_SCHED_NUMA */
+static inline int page_xchg_last_cpu(struct page *page, int cpu)
+{
+ return page_to_nid(page);
+}
+
+static inline int page_last_cpu(struct page *page)
+{
+ return page_to_nid(page);
+}
+#endif /* CONFIG_SCHED_NUMA */
+
--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Peter Zijlstra
2012-11-12 16:30:04 UTC
Permalink
Add a 1 second delay before starting to scan the working set of
a task and starting to balance it amongst nodes.

[ note that before the constant per task WSS sampling rate patch
the initial scan would happen much later still, in effect that
patch caused this regression. ]

The theory is that short-run tasks benefit very little from NUMA
placement: they come and go, and they better stick to the node
they were started on. As tasks mature and rebalance to other CPUs
and nodes, so does their NUMA placement have to change and so
does it start to matter more and more.

In practice this change fixes an observable kbuild regression:

# [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]

!NUMA:
45.291088843 seconds time elapsed ( +- 0.40% )
45.154231752 seconds time elapsed ( +- 0.36% )

+NUMA, no slow start:
46.172308123 seconds time elapsed ( +- 0.30% )
46.343168745 seconds time elapsed ( +- 0.25% )

+NUMA, 1 sec slow start:
45.224189155 seconds time elapsed ( +- 0.25% )
45.160866532 seconds time elapsed ( +- 0.17% )

and it also fixes an observable perf bench (hackbench) regression:

# perf stat --null --repeat 10 perf bench sched messaging

-NUMA:

-NUMA: 0.246225691 seconds time elapsed ( +- 1.31% )
+NUMA no slow start: 0.252620063 seconds time elapsed ( +- 1.13% )

+NUMA 1sec delay: 0.248076230 seconds time elapsed ( +- 1.35% )

The implementation is simple and straightforward, most of the patch
deals with adding the /proc/sys/kernel/sched_numa_scan_delay_ms tunable
knob.

Signed-off-by: Peter Zijlstra <***@chello.nl>
Cc: Linus Torvalds <***@linux-foundation.org>
Cc: Andrew Morton <***@linux-foundation.org>
Cc: Peter Zijlstra <***@chello.nl>
Cc: Andrea Arcangeli <***@redhat.com>
Cc: Rik van Riel <***@redhat.com>
Cc: Mel Gorman <***@suse.de>
[ Wrote the changelog, ran measurements, tuned the default. ]
Signed-off-by: Ingo Molnar <***@kernel.org>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 16 ++++++++++------
kernel/sysctl.c | 7 +++++++
4 files changed, 19 insertions(+), 7 deletions(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -2045,6 +2045,7 @@ enum sched_tunable_scaling {
};
extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;

+extern unsigned int sysctl_sched_numa_scan_delay;
extern unsigned int sysctl_sched_numa_scan_period_min;
extern unsigned int sysctl_sched_numa_scan_period_max;
extern unsigned int sysctl_sched_numa_scan_size;
Index: linux/kernel/sched/core.c
===================================================================
--- linux.orig/kernel/sched/core.c
+++ linux/kernel/sched/core.c
@@ -1556,7 +1556,7 @@ static void __sched_fork(struct task_str
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
p->numa_migrate_seq = 2;
p->numa_faults = NULL;
- p->numa_scan_period = sysctl_sched_numa_scan_period_min;
+ p->numa_scan_period = sysctl_sched_numa_scan_delay;
p->numa_work.next = &p->numa_work;
#endif /* CONFIG_SCHED_NUMA */
}
Index: linux/kernel/sched/fair.c
===================================================================
--- linux.orig/kernel/sched/fair.c
+++ linux/kernel/sched/fair.c
@@ -823,11 +823,12 @@ static void account_numa_dequeue(struct
}

/*
- * numa task sample period in ms: 5s
+ * Scan @scan_size MB every @scan_period after an initial @scan_delay.
*/
-unsigned int sysctl_sched_numa_scan_period_min = 100;
-unsigned int sysctl_sched_numa_scan_period_max = 100*16;
-unsigned int sysctl_sched_numa_scan_size = 256; /* MB */
+unsigned int sysctl_sched_numa_scan_delay = 1000; /* ms */
+unsigned int sysctl_sched_numa_scan_period_min = 100; /* ms */
+unsigned int sysctl_sched_numa_scan_period_max = 100*16;/* ms */
+unsigned int sysctl_sched_numa_scan_size = 256; /* MB */

/*
* Wait for the 2-sample stuff to settle before migrating again
@@ -938,10 +939,12 @@ void task_numa_work(struct callback_head
if (time_before(now, migrate))
return;

- next_scan = now + 2*msecs_to_jiffies(sysctl_sched_numa_scan_period_min);
+ next_scan = now + msecs_to_jiffies(sysctl_sched_numa_scan_period_min);
if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
return;

+ current->numa_scan_period += jiffies_to_msecs(2);
+
start = mm->numa_scan_offset;
pages = sysctl_sched_numa_scan_size;
pages <<= 20 - PAGE_SHIFT; /* MB in pages */
@@ -998,7 +1001,8 @@ void task_tick_numa(struct rq *rq, struc
period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;

if (now - curr->node_stamp > period) {
- curr->node_stamp = now;
+ curr->node_stamp += period;
+ curr->numa_scan_period = sysctl_sched_numa_scan_period_min;

/*
* We are comparing runtime to wall clock time here, which
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -353,6 +353,13 @@ static struct ctl_table kern_table[] = {
#endif /* CONFIG_SMP */
#ifdef CONFIG_SCHED_NUMA
{
+ .procname = "sched_numa_scan_delay_ms",
+ .data = &sysctl_sched_numa_scan_delay,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
.procname = "sched_numa_scan_period_min_ms",
.data = &sysctl_sched_numa_scan_period_min,
.maxlen = sizeof(unsigned int),


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Peter Zijlstra
2012-11-12 16:30:03 UTC
Permalink
Avoid a few #ifdef's later on.

Signed-off-by: Peter Zijlstra <***@chello.nl>
Cc: Paul Turner <***@google.com>
Cc: Lee Schermerhorn <***@hp.com>
Cc: Christoph Lameter <***@linux.com>
Cc: Rik van Riel <***@redhat.com>
Cc: Mel Gorman <***@suse.de>
Cc: Andrew Morton <***@linux-foundation.org>
Cc: Linus Torvalds <***@linux-foundation.org>
Signed-off-by: Ingo Molnar <***@kernel.org>
---
kernel/sched/sched.h | 6 ++++++
1 file changed, 6 insertions(+)

Index: linux/kernel/sched/sched.h
===================================================================
--- linux.orig/kernel/sched/sched.h
+++ linux/kernel/sched/sched.h
@@ -663,6 +663,12 @@ extern struct static_key sched_feat_keys
#define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
#endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */

+#ifdef CONFIG_SCHED_NUMA
+#define sched_feat_numa(x) sched_feat(x)
+#else
+#define sched_feat_numa(x) (0)
+#endif
+
static inline u64 global_rt_period(void)
{
return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Peter Zijlstra
2012-11-12 16:30:03 UTC
Permalink
The principal ideas behind this patch are the fundamental
difference between shared and privately used memory and the very
strong desire to only rely on per-task behavioral state for
scheduling decisions.

We define 'shared memory' as all user memory that is frequently
accessed by multiple tasks and conversely 'private memory' is
the user memory used predominantly by a single task.

To approximate the above strict definition we recognise that
task placement is dominantly per cpu and thus using cpu granular
page access state is a natural fit. Thus we introduce
page::last_cpu as the cpu that last accessed a page.

Using this, we can construct two per-task node-vectors, 'S_i'
and 'P_i' reflecting the amount of shared and privately used
pages of this task respectively. Pages for which two consecutive
'hits' are of the same cpu are assumed private and the others
are shared.

[ This means that we will start evaluating this state when the
task has not migrated for at least 2 scans, see NUMA_SETTLE ]

Using these vectors we can compute the total number of
shared/private pages of this task and determine which dominates.

[ Note that for shared tasks we only see '1/n' the total number
of shared pages for the other tasks will take the other
faults; where 'n' is the number of tasks sharing the memory.
So for an equal comparison we should divide total private by
'n' as well, but we don't have 'n' so we pick 2. ]

We can also compute which node holds most of our memory, running
on this node will be called 'ideal placement' (As per previous
patches we will prefer to pull memory towards wherever we run.)

We change the load-balancer to prefer moving tasks in order of:

1) !numa tasks and numa tasks in the direction of more faults
2) allow !ideal tasks getting worse in the direction of faults
3) allow private tasks to get worse
4) allow shared tasks to get worse

This order ensures we prefer increasing memory locality but when
we do have to make hard decisions we prefer spreading private
over shared, because spreading shared tasks significantly
increases the interconnect bandwidth since not all memory can
follow.

We also add an extra 'lateral' force to the load balancer that
perturbs the state when otherwise 'fairly' balanced. This
ensures we don't get 'stuck' in a state which is fair but
undesired from a memory location POV (see can_do_numa_run()).

Lastly, we allow shared tasks to defeat the default spreading of
tasks such that, when possible, they can aggregate on a single
node.

Shared tasks aggregate for the very simple reason that there has
to be a single node that holds most of their memory and a second
most, etc.. and tasks want to move up the faults ladder.

Signed-off-by: Peter Zijlstra <***@chello.nl>
Cc: Linus Torvalds <***@linux-foundation.org>
Cc: Andrew Morton <***@linux-foundation.org>
Cc: Peter Zijlstra <***@chello.nl>
Cc: Andrea Arcangeli <***@redhat.com>
Cc: Rik van Riel <***@redhat.com>
Cc: Mel Gorman <***@suse.de>
Signed-off-by: Ingo Molnar <***@kernel.org>
---
Documentation/scheduler/numa-problem.txt | 20
arch/sh/mm/Kconfig | 1
include/linux/init_task.h | 8
include/linux/sched.h | 44 +
include/uapi/linux/mempolicy.h | 1
init/Kconfig | 14
kernel/sched/core.c | 68 ++
kernel/sched/fair.c | 983 +++++++++++++++++++++++++------
kernel/sched/features.h | 9
kernel/sched/sched.h | 32 -
kernel/sysctl.c | 31
mm/huge_memory.c | 7
mm/memory.c | 6
mm/mempolicy.c | 95 ++
mm/migrate.c | 6
15 files changed, 1097 insertions(+), 228 deletions(-)

Index: linux/Documentation/scheduler/numa-problem.txt
===================================================================
--- linux.orig/Documentation/scheduler/numa-problem.txt
+++ linux/Documentation/scheduler/numa-problem.txt
@@ -133,6 +133,8 @@ XXX properties of this M vs a potential

2b) migrate memory towards 'n_i' using 2 samples.

+XXX include the statistical babble on double sampling somewhere near
+
This separates pages into those that will migrate and those that will not due
to the two samples not matching. We could consider the first to be of 'p_i'
(private) and the second to be of 's_i' (shared).
@@ -142,7 +144,17 @@ This interpretation can be motivated by
's_i' (shared). (here we loose the need for memory limits again, since it
becomes indistinguishable from shared).

-XXX include the statistical babble on double sampling somewhere near
+ 2c) use cpu samples instead of node samples
+
+The problem with sampling on node granularity is that one looses 's_i' for
+the local node, since one cannot distinguish between two accesses from the
+same node.
+
+By increasing the granularity to per-cpu we gain the ability to have both an
+'s_i' and 'p_i' per node. Since we do all task placement per-cpu as well this
+seems like a natural match. The line where we overcommit cpus is where we loose
+granularity again, but when we loose overcommit we naturally spread tasks.
+Therefore it should work out nicely.

This reduces the problem further; we loose 'M' as per 2a, it further reduces
the 'T_k,l' (interconnect traffic) term to only include shared (since per the
@@ -150,12 +162,6 @@ above all private will be local):

T_k,l = \Sum_i bs_i,l for every n_i = k, l != k

-[ more or less matches the state of sched/numa and describes its remaining
- problems and assumptions. It should work well for tasks without significant
- shared memory usage between tasks. ]
-
-Possible future directions:
-
Motivated by the form of 'T_k,l', try and obtain each term of the sum, so we
can evaluate it;

Index: linux/arch/sh/mm/Kconfig
===================================================================
--- linux.orig/arch/sh/mm/Kconfig
+++ linux/arch/sh/mm/Kconfig
@@ -111,6 +111,7 @@ config VSYSCALL
config NUMA
bool "Non Uniform Memory Access (NUMA) Support"
depends on MMU && SYS_SUPPORTS_NUMA && EXPERIMENTAL
+ select EMBEDDED_NUMA
default n
help
Some SH systems have many various memories scattered around
Index: linux/include/linux/init_task.h
===================================================================
--- linux.orig/include/linux/init_task.h
+++ linux/include/linux/init_task.h
@@ -143,6 +143,13 @@ extern struct task_group root_task_group

#define INIT_TASK_COMM "swapper"

+#ifdef CONFIG_SCHED_NUMA
+# define INIT_TASK_NUMA(tsk) \
+ .numa_shared = -1,
+#else
+# define INIT_TASK_NUMA(tsk)
+#endif
+
/*
* INIT_TASK is used to set up the first task table, touch at
* your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -210,6 +217,7 @@ extern struct task_group root_task_group
INIT_TRACE_RECURSION \
INIT_TASK_RCU_PREEMPT(tsk) \
INIT_CPUSET_SEQ \
+ INIT_TASK_NUMA(tsk) \
}


Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -823,6 +823,7 @@ enum cpu_idle_type {
#define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */
#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */
#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
+#define SD_NUMA 0x4000 /* cross-node balancing */

extern int __weak arch_sd_sibiling_asym_packing(void);

@@ -1501,6 +1502,18 @@ struct task_struct {
short il_next;
short pref_node_fork;
#endif
+#ifdef CONFIG_SCHED_NUMA
+ int numa_shared;
+ int numa_max_node;
+ int numa_scan_seq;
+ int numa_migrate_seq;
+ unsigned int numa_scan_period;
+ u64 node_stamp; /* migration stamp */
+ unsigned long numa_weight;
+ unsigned long *numa_faults;
+ struct callback_head numa_work;
+#endif /* CONFIG_SCHED_NUMA */
+
struct rcu_head rcu;

/*
@@ -1575,6 +1588,26 @@ struct task_struct {
/* Future-safe accessor for struct task_struct's cpus_allowed. */
#define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)

+#ifdef CONFIG_SCHED_NUMA
+extern void task_numa_fault(int node, int cpu, int pages);
+#else
+static inline void task_numa_fault(int node, int cpu, int pages) { }
+#endif /* CONFIG_SCHED_NUMA */
+
+/*
+ * -1: non-NUMA task
+ * 0: NUMA task with a dominantly 'private' working set
+ * 1: NUMA task with a dominantly 'shared' working set
+ */
+static inline int task_numa_shared(struct task_struct *p)
+{
+#ifdef CONFIG_SCHED_NUMA
+ return p->numa_shared;
+#else
+ return -1;
+#endif
+}
+
/*
* Priority of a process goes from 0..MAX_PRIO-1, valid RT
* priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
@@ -2012,6 +2045,10 @@ enum sched_tunable_scaling {
};
extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;

+extern unsigned int sysctl_sched_numa_scan_period_min;
+extern unsigned int sysctl_sched_numa_scan_period_max;
+extern unsigned int sysctl_sched_numa_settle_count;
+
#ifdef CONFIG_SCHED_DEBUG
extern unsigned int sysctl_sched_migration_cost;
extern unsigned int sysctl_sched_nr_migrate;
@@ -2022,18 +2059,17 @@ extern unsigned int sysctl_sched_shares_
int sched_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length,
loff_t *ppos);
-#endif
-#ifdef CONFIG_SCHED_DEBUG
+
static inline unsigned int get_sysctl_timer_migration(void)
{
return sysctl_timer_migration;
}
-#else
+#else /* CONFIG_SCHED_DEBUG */
static inline unsigned int get_sysctl_timer_migration(void)
{
return 1;
}
-#endif
+#endif /* CONFIG_SCHED_DEBUG */
extern unsigned int sysctl_sched_rt_period;
extern int sysctl_sched_rt_runtime;

Index: linux/include/uapi/linux/mempolicy.h
===================================================================
--- linux.orig/include/uapi/linux/mempolicy.h
+++ linux/include/uapi/linux/mempolicy.h
@@ -69,6 +69,7 @@ enum mpol_rebind_step {
#define MPOL_F_LOCAL (1 << 1) /* preferred local allocation */
#define MPOL_F_REBINDING (1 << 2) /* identify policies in rebinding */
#define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */
+#define MPOL_F_HOME (1 << 4) /* this is the home-node policy */


#endif /* _UAPI_LINUX_MEMPOLICY_H */
Index: linux/init/Kconfig
===================================================================
--- linux.orig/init/Kconfig
+++ linux/init/Kconfig
@@ -696,6 +696,20 @@ config LOG_BUF_SHIFT
config HAVE_UNSTABLE_SCHED_CLOCK
bool

+#
+# For architectures that (ab)use NUMA to represent different memory regions
+# all cpu-local but of different latencies, such as SuperH.
+#
+config EMBEDDED_NUMA
+ bool
+
+config SCHED_NUMA
+ bool "Memory placement aware NUMA scheduler"
+ default n
+ depends on SMP && NUMA && MIGRATION && !EMBEDDED_NUMA
+ help
+ This option adds support for automatic NUMA aware memory/task placement.
+
menuconfig CGROUPS
boolean "Control Group support"
depends on EVENTFD
Index: linux/kernel/sched/core.c
===================================================================
--- linux.orig/kernel/sched/core.c
+++ linux/kernel/sched/core.c
@@ -1544,6 +1544,21 @@ static void __sched_fork(struct task_str
#ifdef CONFIG_PREEMPT_NOTIFIERS
INIT_HLIST_HEAD(&p->preempt_notifiers);
#endif
+
+#ifdef CONFIG_SCHED_NUMA
+ if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
+ p->mm->numa_next_scan = jiffies;
+ p->mm->numa_scan_seq = 0;
+ }
+
+ p->numa_shared = -1;
+ p->node_stamp = 0ULL;
+ p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
+ p->numa_migrate_seq = 2;
+ p->numa_faults = NULL;
+ p->numa_scan_period = sysctl_sched_numa_scan_period_min;
+ p->numa_work.next = &p->numa_work;
+#endif /* CONFIG_SCHED_NUMA */
}

/*
@@ -1785,6 +1800,7 @@ static void finish_task_switch(struct rq
if (mm)
mmdrop(mm);
if (unlikely(prev_state == TASK_DEAD)) {
+ task_numa_free(prev);
/*
* Remove function-return probe instances associated with this
* task and put them back on the free list.
@@ -5495,7 +5511,9 @@ static void destroy_sched_domains(struct
DEFINE_PER_CPU(struct sched_domain *, sd_llc);
DEFINE_PER_CPU(int, sd_llc_id);

-static void update_top_cache_domain(int cpu)
+DEFINE_PER_CPU(struct sched_domain *, sd_node);
+
+static void update_domain_cache(int cpu)
{
struct sched_domain *sd;
int id = cpu;
@@ -5506,6 +5524,15 @@ static void update_top_cache_domain(int

rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
per_cpu(sd_llc_id, cpu) = id;
+
+ for_each_domain(cpu, sd) {
+ if (cpumask_equal(sched_domain_span(sd),
+ cpumask_of_node(cpu_to_node(cpu))))
+ goto got_node;
+ }
+ sd = NULL;
+got_node:
+ rcu_assign_pointer(per_cpu(sd_node, cpu), sd);
}

/*
@@ -5548,7 +5575,7 @@ cpu_attach_domain(struct sched_domain *s
rcu_assign_pointer(rq->sd, sd);
destroy_sched_domains(tmp, cpu);

- update_top_cache_domain(cpu);
+ update_domain_cache(cpu);
}

/* cpus with isolated domains */
@@ -5970,6 +5997,37 @@ static struct sched_domain_topology_leve

static struct sched_domain_topology_level *sched_domain_topology = default_topology;

+#ifdef CONFIG_SCHED_NUMA
+
+/*
+ */
+void sched_setnuma(struct task_struct *p, int node, int shared)
+{
+ unsigned long flags;
+ int on_rq, running;
+ struct rq *rq;
+
+ rq = task_rq_lock(p, &flags);
+ on_rq = p->on_rq;
+ running = task_current(rq, p);
+
+ if (on_rq)
+ dequeue_task(rq, p, 0);
+ if (running)
+ p->sched_class->put_prev_task(rq, p);
+
+ p->numa_shared = shared;
+ p->numa_max_node = node;
+
+ if (running)
+ p->sched_class->set_curr_task(rq);
+ if (on_rq)
+ enqueue_task(rq, p, 0);
+ task_rq_unlock(rq, p, &flags);
+}
+
+#endif /* CONFIG_SCHED_NUMA */
+
#ifdef CONFIG_NUMA

static int sched_domains_numa_levels;
@@ -6015,6 +6073,7 @@ sd_numa_init(struct sched_domain_topolog
| 0*SD_SHARE_PKG_RESOURCES
| 1*SD_SERIALIZE
| 0*SD_PREFER_SIBLING
+ | 1*SD_NUMA
| sd_local_flags(level)
,
.last_balance = jiffies,
@@ -6869,7 +6928,6 @@ void __init sched_init(void)
rq->post_schedule = 0;
rq->active_balance = 0;
rq->next_balance = jiffies;
- rq->push_cpu = 0;
rq->cpu = i;
rq->online = 0;
rq->idle_stamp = 0;
@@ -6877,6 +6935,10 @@ void __init sched_init(void)

INIT_LIST_HEAD(&rq->cfs_tasks);

+#ifdef CONFIG_SCHED_NUMA
+ rq->nr_shared_running = 0;
+#endif
+
rq_attach_root(rq, &def_root_domain);
#ifdef CONFIG_NO_HZ
rq->nohz_flags = 0;
Index: linux/kernel/sched/fair.c
===================================================================
--- linux.orig/kernel/sched/fair.c
+++ linux/kernel/sched/fair.c
@@ -29,6 +29,9 @@
#include <linux/slab.h>
#include <linux/profile.h>
#include <linux/interrupt.h>
+#include <linux/random.h>
+#include <linux/mempolicy.h>
+#include <linux/task_work.h>

#include <trace/events/sched.h>

@@ -774,6 +777,235 @@ update_stats_curr_start(struct cfs_rq *c
}

/**************************************************
+ * Scheduling class numa methods.
+ *
+ * The purpose of the NUMA bits are to maintain compute (task) and data
+ * (memory) locality.
+ *
+ * We keep a faults vector per task and use periodic fault scans to try and
+ * estalish a task<->page relation. This assumes the task<->page relation is a
+ * compute<->data relation, this is false for things like virt. and n:m
+ * threading solutions but its the best we can do given the information we
+ * have.
+ *
+ * We try and migrate such that we increase along the order provided by this
+ * vector while maintaining fairness.
+ *
+ * Tasks start out with their numa status unset (-1) this effectively means
+ * they act !NUMA until we've established the task is busy enough to bother
+ * with placement.
+ */
+
+#ifdef CONFIG_SMP
+static unsigned long task_h_load(struct task_struct *p);
+#endif
+
+#ifdef CONFIG_SCHED_NUMA
+static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+ if (task_numa_shared(p) != -1) {
+ p->numa_weight = task_h_load(p);
+ rq->nr_numa_running++;
+ rq->nr_shared_running += task_numa_shared(p);
+ rq->nr_ideal_running += (cpu_to_node(task_cpu(p)) == p->numa_max_node);
+ rq->numa_weight += p->numa_weight;
+ }
+}
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+ if (task_numa_shared(p) != -1) {
+ rq->nr_numa_running--;
+ rq->nr_shared_running -= task_numa_shared(p);
+ rq->nr_ideal_running -= (cpu_to_node(task_cpu(p)) == p->numa_max_node);
+ rq->numa_weight -= p->numa_weight;
+ }
+}
+
+/*
+ * numa task sample period in ms: 5s
+ */
+unsigned int sysctl_sched_numa_scan_period_min = 5000;
+unsigned int sysctl_sched_numa_scan_period_max = 5000*16;
+
+/*
+ * Wait for the 2-sample stuff to settle before migrating again
+ */
+unsigned int sysctl_sched_numa_settle_count = 2;
+
+static void task_numa_migrate(struct task_struct *p, int next_cpu)
+{
+ p->numa_migrate_seq = 0;
+}
+
+static void task_numa_placement(struct task_struct *p)
+{
+ int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
+ unsigned long total[2] = { 0, 0 };
+ unsigned long faults, max_faults = 0;
+ int node, priv, shared, max_node = -1;
+
+ if (p->numa_scan_seq == seq)
+ return;
+
+ p->numa_scan_seq = seq;
+
+ for (node = 0; node < nr_node_ids; node++) {
+ faults = 0;
+ for (priv = 0; priv < 2; priv++) {
+ faults += p->numa_faults[2*node + priv];
+ total[priv] += p->numa_faults[2*node + priv];
+ p->numa_faults[2*node + priv] /= 2;
+ }
+ if (faults > max_faults) {
+ max_faults = faults;
+ max_node = node;
+ }
+ }
+
+ if (max_node != p->numa_max_node)
+ sched_setnuma(p, max_node, task_numa_shared(p));
+
+ p->numa_migrate_seq++;
+ if (sched_feat(NUMA_SETTLE) &&
+ p->numa_migrate_seq < sysctl_sched_numa_settle_count)
+ return;
+
+ /*
+ * Note: shared is spread across multiple tasks and in the future
+ * we might want to consider a different equation below to reduce
+ * the impact of a little private memory accesses.
+ */
+ shared = (total[0] >= total[1] / 4);
+ if (shared != task_numa_shared(p)) {
+ sched_setnuma(p, p->numa_max_node, shared);
+ p->numa_migrate_seq = 0;
+ }
+}
+
+/*
+ * Got a PROT_NONE fault for a page on @node.
+ */
+void task_numa_fault(int node, int last_cpu, int pages)
+{
+ struct task_struct *p = current;
+ int priv = (task_cpu(p) == last_cpu);
+
+ if (unlikely(!p->numa_faults)) {
+ int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
+
+ p->numa_faults = kzalloc(size, GFP_KERNEL);
+ if (!p->numa_faults)
+ return;
+ }
+
+ task_numa_placement(p);
+ p->numa_faults[2*node + priv] += pages;
+}
+
+/*
+ * The expensive part of numa migration is done from task_work context.
+ * Triggered from task_tick_numa().
+ */
+void task_numa_work(struct callback_head *work)
+{
+ unsigned long migrate, next_scan, now = jiffies;
+ struct task_struct *p = current;
+ struct mm_struct *mm = p->mm;
+
+ WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
+
+ work->next = work; /* protect against double add */
+ /*
+ * Who cares about NUMA placement when they're dying.
+ *
+ * NOTE: make sure not to dereference p->mm before this check,
+ * exit_task_work() happens _after_ exit_mm() so we could be called
+ * without p->mm even though we still had it when we enqueued this
+ * work.
+ */
+ if (p->flags & PF_EXITING)
+ return;
+
+ /*
+ * Enforce maximal scan/migration frequency..
+ */
+ migrate = mm->numa_next_scan;
+ if (time_before(now, migrate))
+ return;
+
+ next_scan = now + 2*msecs_to_jiffies(sysctl_sched_numa_scan_period_min);
+ if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
+ return;
+
+ ACCESS_ONCE(mm->numa_scan_seq)++;
+ {
+ struct vm_area_struct *vma;
+
+ down_write(&mm->mmap_sem);
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ if (!vma_migratable(vma))
+ continue;
+ change_protection(vma, vma->vm_start, vma->vm_end, vma_prot_none(vma), 0);
+ }
+ up_write(&mm->mmap_sem);
+ }
+}
+
+/*
+ * Drive the periodic memory faults..
+ */
+void task_tick_numa(struct rq *rq, struct task_struct *curr)
+{
+ struct callback_head *work = &curr->numa_work;
+ u64 period, now;
+
+ /*
+ * We don't care about NUMA placement if we don't have memory.
+ */
+ if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
+ return;
+
+ /*
+ * Using runtime rather than walltime has the dual advantage that
+ * we (mostly) drive the selection from busy threads and that the
+ * task needs to have done some actual work before we bother with
+ * NUMA placement.
+ */
+ now = curr->se.sum_exec_runtime;
+ period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
+
+ if (now - curr->node_stamp > period) {
+ curr->node_stamp = now;
+
+ /*
+ * We are comparing runtime to wall clock time here, which
+ * puts a maximum scan frequency limit on the task work.
+ *
+ * This, together with the limits in task_numa_work() filters
+ * us from over-sampling if there are many threads: if all
+ * threads happen to come in at the same time we don't create a
+ * spike in overhead.
+ *
+ * We also avoid multiple threads scanning at once in parallel to
+ * each other.
+ */
+ if (!time_before(jiffies, curr->mm->numa_next_scan)) {
+ init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
+ task_work_add(curr, work, true);
+ }
+ }
+}
+#else /* !CONFIG_SCHED_NUMA: */
+#ifdef CONFIG_SMP
+static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p) { }
+#endif
+static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p) { }
+static inline void task_tick_numa(struct rq *rq, struct task_struct *curr) { }
+static inline void task_numa_migrate(struct task_struct *p, int next_cpu) { }
+#endif /* CONFIG_SCHED_NUMA */
+
+/**************************************************
* Scheduling class queueing methods:
*/

@@ -784,9 +1016,13 @@ account_entity_enqueue(struct cfs_rq *cf
if (!parent_entity(se))
update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
#ifdef CONFIG_SMP
- if (entity_is_task(se))
- list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
-#endif
+ if (entity_is_task(se)) {
+ struct rq *rq = rq_of(cfs_rq);
+
+ account_numa_enqueue(rq, task_of(se));
+ list_add(&se->group_node, &rq->cfs_tasks);
+ }
+#endif /* CONFIG_SMP */
cfs_rq->nr_running++;
}

@@ -796,8 +1032,10 @@ account_entity_dequeue(struct cfs_rq *cf
update_load_sub(&cfs_rq->load, se->load.weight);
if (!parent_entity(se))
update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
- if (entity_is_task(se))
+ if (entity_is_task(se)) {
list_del_init(&se->group_node);
+ account_numa_dequeue(rq_of(cfs_rq), task_of(se));
+ }
cfs_rq->nr_running--;
}

@@ -3177,20 +3415,8 @@ unlock:
return new_cpu;
}

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
- * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
- * cfs_rq_of(p) references at time of call are still valid and identify the
- * previous cpu. However, the caller only guarantees p->pi_lock is held; no
- * other assumptions, including the state of rq->lock, should be made.
- */
-static void
-migrate_task_rq_fair(struct task_struct *p, int next_cpu)
+static void migrate_task_rq_entity(struct task_struct *p, int next_cpu)
{
struct sched_entity *se = &p->se;
struct cfs_rq *cfs_rq = cfs_rq_of(se);
@@ -3206,7 +3432,27 @@ migrate_task_rq_fair(struct task_struct
atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
}
}
+#else
+static void migrate_task_rq_entity(struct task_struct *p, int next_cpu) { }
#endif
+
+/*
+ * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
+ * removed when useful for applications beyond shares distribution (e.g.
+ * load-balance).
+ */
+/*
+ * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
+ * cfs_rq_of(p) references at time of call are still valid and identify the
+ * previous cpu. However, the caller only guarantees p->pi_lock is held; no
+ * other assumptions, including the state of rq->lock, should be made.
+ */
+static void
+migrate_task_rq_fair(struct task_struct *p, int next_cpu)
+{
+ migrate_task_rq_entity(p, next_cpu);
+ task_numa_migrate(p, next_cpu);
+}
#endif /* CONFIG_SMP */

static unsigned long
@@ -3580,7 +3826,10 @@ static unsigned long __read_mostly max_l

#define LBF_ALL_PINNED 0x01
#define LBF_NEED_BREAK 0x02
-#define LBF_SOME_PINNED 0x04
+#define LBF_SOME_PINNED 0x04
+#define LBF_NUMA_RUN 0x08
+#define LBF_NUMA_SHARED 0x10
+#define LBF_KEEP_SHARED 0x20

struct lb_env {
struct sched_domain *sd;
@@ -3599,6 +3848,8 @@ struct lb_env {
struct cpumask *cpus;

unsigned int flags;
+ unsigned int failed;
+ unsigned int iteration;

unsigned int loop;
unsigned int loop_break;
@@ -3620,11 +3871,87 @@ static void move_task(struct task_struct
check_preempt_curr(env->dst_rq, p, 0);
}

+#ifdef CONFIG_SCHED_NUMA
+
+static inline unsigned long task_node_faults(struct task_struct *p, int node)
+{
+ return p->numa_faults[2*node] + p->numa_faults[2*node + 1];
+}
+
+static int task_faults_down(struct task_struct *p, struct lb_env *env)
+{
+ int src_node, dst_node, node, down_node = -1;
+ unsigned long faults, src_faults, max_faults = 0;
+
+ if (!sched_feat_numa(NUMA_FAULTS_DOWN) || !p->numa_faults)
+ return 1;
+
+ src_node = cpu_to_node(env->src_cpu);
+ dst_node = cpu_to_node(env->dst_cpu);
+
+ if (src_node == dst_node)
+ return 1;
+
+ src_faults = task_node_faults(p, src_node);
+
+ for (node = 0; node < nr_node_ids; node++) {
+ if (node == src_node)
+ continue;
+
+ faults = task_node_faults(p, node);
+
+ if (faults > max_faults && faults <= src_faults) {
+ max_faults = faults;
+ down_node = node;
+ }
+ }
+
+ if (down_node == dst_node)
+ return 1; /* move towards the next node down */
+
+ return 0; /* stay here */
+}
+
+static int task_faults_up(struct task_struct *p, struct lb_env *env)
+{
+ unsigned long src_faults, dst_faults;
+ int src_node, dst_node;
+
+ if (!sched_feat_numa(NUMA_FAULTS_UP) || !p->numa_faults)
+ return 0; /* can't say it improved */
+
+ src_node = cpu_to_node(env->src_cpu);
+ dst_node = cpu_to_node(env->dst_cpu);
+
+ if (src_node == dst_node)
+ return 0; /* pointless, don't do that */
+
+ src_faults = task_node_faults(p, src_node);
+ dst_faults = task_node_faults(p, dst_node);
+
+ if (dst_faults > src_faults)
+ return 1; /* move to dst */
+
+ return 0; /* stay where we are */
+}
+
+#else /* !CONFIG_SCHED_NUMA: */
+static inline int task_faults_up(struct task_struct *p, struct lb_env *env)
+{
+ return 0;
+}
+
+static inline int task_faults_down(struct task_struct *p, struct lb_env *env)
+{
+ return 0;
+}
+#endif
+
/*
* Is this task likely cache-hot:
*/
static int
-task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
+task_hot(struct task_struct *p, struct lb_env *env)
{
s64 delta;

@@ -3647,80 +3974,153 @@ task_hot(struct task_struct *p, u64 now,
if (sysctl_sched_migration_cost == 0)
return 0;

- delta = now - p->se.exec_start;
+ delta = env->src_rq->clock_task - p->se.exec_start;

return delta < (s64)sysctl_sched_migration_cost;
}

/*
- * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
+ * We do not migrate tasks that cannot be migrated to this CPU
+ * due to cpus_allowed.
+ *
+ * NOTE: this function has env-> side effects, to help the balancing
+ * of pinned tasks.
*/
-static
-int can_migrate_task(struct task_struct *p, struct lb_env *env)
+static bool can_migrate_pinned_task(struct task_struct *p, struct lb_env *env)
{
- int tsk_cache_hot = 0;
- /*
- * We do not migrate tasks that are:
- * 1) running (obviously), or
- * 2) cannot be migrated to this CPU due to cpus_allowed, or
- * 3) are cache-hot on their current CPU.
- */
- if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
- int new_dst_cpu;
+ int new_dst_cpu;

- schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+ if (cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p)))
+ return true;

- /*
- * Remember if this task can be migrated to any other cpu in
- * our sched_group. We may want to revisit it if we couldn't
- * meet load balance goals by pulling other tasks on src_cpu.
- *
- * Also avoid computing new_dst_cpu if we have already computed
- * one in current iteration.
- */
- if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED))
- return 0;
+ schedstat_inc(p, se.statistics.nr_failed_migrations_affine);

- new_dst_cpu = cpumask_first_and(env->dst_grpmask,
- tsk_cpus_allowed(p));
- if (new_dst_cpu < nr_cpu_ids) {
- env->flags |= LBF_SOME_PINNED;
- env->new_dst_cpu = new_dst_cpu;
- }
- return 0;
+ /*
+ * Remember if this task can be migrated to any other cpu in
+ * our sched_group. We may want to revisit it if we couldn't
+ * meet load balance goals by pulling other tasks on src_cpu.
+ *
+ * Also avoid computing new_dst_cpu if we have already computed
+ * one in current iteration.
+ */
+ if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED))
+ return false;
+
+ new_dst_cpu = cpumask_first_and(env->dst_grpmask, tsk_cpus_allowed(p));
+ if (new_dst_cpu < nr_cpu_ids) {
+ env->flags |= LBF_SOME_PINNED;
+ env->new_dst_cpu = new_dst_cpu;
}
+ return false;
+}

- /* Record that we found atleast one task that could run on dst_cpu */
- env->flags &= ~LBF_ALL_PINNED;
+/*
+ * We cannot (easily) migrate tasks that are currently running:
+ */
+static bool can_migrate_running_task(struct task_struct *p, struct lb_env *env)
+{
+ if (!task_running(env->src_rq, p))
+ return true;

- if (task_running(env->src_rq, p)) {
- schedstat_inc(p, se.statistics.nr_failed_migrations_running);
- return 0;
- }
+ schedstat_inc(p, se.statistics.nr_failed_migrations_running);
+ return false;
+}

+/*
+ * Can we migrate a NUMA task? The rules are rather involved:
+ */
+static bool can_migrate_numa_task(struct task_struct *p, struct lb_env *env)
+{
/*
- * Aggressive migration if:
- * 1) task is cache cold, or
- * 2) too many balance attempts have failed.
+ * iteration:
+ * 0 -- only allow improvement, or !numa
+ * 1 -- + worsen !ideal
+ * 2 priv
+ * 3 shared (everything)
+ *
+ * NUMA_HOT_DOWN:
+ * 1 .. nodes -- allow getting worse by step
+ * nodes+1 -- punt, everything goes!
+ *
+ * LBF_NUMA_RUN -- numa only, only allow improvement
+ * LBF_NUMA_SHARED -- shared only
+ *
+ * LBF_KEEP_SHARED -- do not touch shared tasks
*/

- tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
- if (!tsk_cache_hot ||
- env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
-#ifdef CONFIG_SCHEDSTATS
- if (tsk_cache_hot) {
- schedstat_inc(env->sd, lb_hot_gained[env->idle]);
- schedstat_inc(p, se.statistics.nr_forced_migrations);
- }
+ /* a numa run can only move numa tasks about to improve things */
+ if (env->flags & LBF_NUMA_RUN) {
+ if (task_numa_shared(p) < 0)
+ return false;
+ /* can only pull shared tasks */
+ if ((env->flags & LBF_NUMA_SHARED) && !task_numa_shared(p))
+ return false;
+ } else {
+ if (task_numa_shared(p) < 0)
+ goto try_migrate;
+ }
+
+ /* can not move shared tasks */
+ if ((env->flags & LBF_KEEP_SHARED) && task_numa_shared(p) == 1)
+ return false;
+
+ if (task_faults_up(p, env))
+ return true; /* memory locality beats cache hotness */
+
+ if (env->iteration < 1)
+ return false;
+
+#ifdef CONFIG_SCHED_NUMA
+ if (p->numa_max_node != cpu_to_node(task_cpu(p))) /* !ideal */
+ goto demote;
#endif
- return 1;
- }

- if (tsk_cache_hot) {
- schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
- return 0;
- }
- return 1;
+ if (env->iteration < 2)
+ return false;
+
+ if (task_numa_shared(p) == 0) /* private */
+ goto demote;
+
+ if (env->iteration < 3)
+ return false;
+
+demote:
+ if (env->iteration < 5)
+ return task_faults_down(p, env);
+
+try_migrate:
+ if (env->failed > env->sd->cache_nice_tries)
+ return true;
+
+ return !task_hot(p, env);
+}
+
+/*
+ * can_migrate_task() - may task p from runqueue rq be migrated to this_cpu?
+ */
+static int can_migrate_task(struct task_struct *p, struct lb_env *env)
+{
+ if (!can_migrate_pinned_task(p, env))
+ return false;
+
+ /* Record that we found atleast one task that could run on dst_cpu */
+ env->flags &= ~LBF_ALL_PINNED;
+
+ if (!can_migrate_running_task(p, env))
+ return false;
+
+ if (env->sd->flags & SD_NUMA)
+ return can_migrate_numa_task(p, env);
+
+ if (env->failed > env->sd->cache_nice_tries)
+ return true;
+
+ if (!task_hot(p, env))
+ return true;
+
+ schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
+
+ return false;
}

/*
@@ -3735,6 +4135,7 @@ static int move_one_task(struct lb_env *
struct task_struct *p, *n;

list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
+
if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
continue;

@@ -3742,6 +4143,7 @@ static int move_one_task(struct lb_env *
continue;

move_task(p, env);
+
/*
* Right now, this is only the second place move_task()
* is called, so we can safely collect move_task()
@@ -3753,8 +4155,6 @@ static int move_one_task(struct lb_env *
return 0;
}

-static unsigned long task_h_load(struct task_struct *p);
-
static const unsigned int sched_nr_migrate_break = 32;

/*
@@ -3766,7 +4166,6 @@ static const unsigned int sched_nr_migra
*/
static int move_tasks(struct lb_env *env)
{
- struct list_head *tasks = &env->src_rq->cfs_tasks;
struct task_struct *p;
unsigned long load;
int pulled = 0;
@@ -3774,8 +4173,8 @@ static int move_tasks(struct lb_env *env
if (env->imbalance <= 0)
return 0;

- while (!list_empty(tasks)) {
- p = list_first_entry(tasks, struct task_struct, se.group_node);
+ while (!list_empty(&env->src_rq->cfs_tasks)) {
+ p = list_first_entry(&env->src_rq->cfs_tasks, struct task_struct, se.group_node);

env->loop++;
/* We've more or less seen every task there is, call it quits */
@@ -3786,7 +4185,7 @@ static int move_tasks(struct lb_env *env
if (env->loop > env->loop_break) {
env->loop_break += sched_nr_migrate_break;
env->flags |= LBF_NEED_BREAK;
- break;
+ goto out;
}

if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
@@ -3794,7 +4193,7 @@ static int move_tasks(struct lb_env *env

load = task_h_load(p);

- if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
+ if (sched_feat(LB_MIN) && load < 16 && !env->failed)
goto next;

if ((load / 2) > env->imbalance)
@@ -3814,7 +4213,7 @@ static int move_tasks(struct lb_env *env
* the critical section.
*/
if (env->idle == CPU_NEWLY_IDLE)
- break;
+ goto out;
#endif

/*
@@ -3822,13 +4221,13 @@ static int move_tasks(struct lb_env *env
* weighted load.
*/
if (env->imbalance <= 0)
- break;
+ goto out;

continue;
next:
- list_move_tail(&p->se.group_node, tasks);
+ list_move_tail(&p->se.group_node, &env->src_rq->cfs_tasks);
}
-
+out:
/*
* Right now, this is one of only two places move_task() is called,
* so we can safely collect move_task() stats here rather than
@@ -3953,12 +4352,13 @@ static inline void update_blocked_averag
static inline void update_h_load(long cpu)
{
}
-
+#ifdef CONFIG_SMP
static unsigned long task_h_load(struct task_struct *p)
{
return p->se.load.weight;
}
#endif
+#endif

/********** Helpers for find_busiest_group ************************/
/*
@@ -3976,7 +4376,7 @@ struct sd_lb_stats {
unsigned long this_load;
unsigned long this_load_per_task;
unsigned long this_nr_running;
- unsigned long this_has_capacity;
+ unsigned int this_has_capacity;
unsigned int this_idle_cpus;

/* Statistics of the busiest group */
@@ -3985,10 +4385,28 @@ struct sd_lb_stats {
unsigned long busiest_load_per_task;
unsigned long busiest_nr_running;
unsigned long busiest_group_capacity;
- unsigned long busiest_has_capacity;
+ unsigned int busiest_has_capacity;
unsigned int busiest_group_weight;

int group_imb; /* Is there imbalance in this sd */
+
+#ifdef CONFIG_SCHED_NUMA
+ unsigned long this_numa_running;
+ unsigned long this_numa_weight;
+ unsigned long this_shared_running;
+ unsigned long this_ideal_running;
+ unsigned long this_group_capacity;
+
+ struct sched_group *numa;
+ unsigned long numa_load;
+ unsigned long numa_nr_running;
+ unsigned long numa_numa_running;
+ unsigned long numa_shared_running;
+ unsigned long numa_ideal_running;
+ unsigned long numa_numa_weight;
+ unsigned long numa_group_capacity;
+ unsigned int numa_has_capacity;
+#endif
};

/*
@@ -4004,6 +4422,13 @@ struct sg_lb_stats {
unsigned long group_weight;
int group_imb; /* Is there an imbalance in the group ? */
int group_has_capacity; /* Is there extra capacity in the group? */
+
+#ifdef CONFIG_SCHED_NUMA
+ unsigned long sum_ideal_running;
+ unsigned long sum_numa_running;
+ unsigned long sum_numa_weight;
+#endif
+ unsigned long sum_shared_running; /* 0 on non-NUMA */
};

/**
@@ -4032,6 +4457,160 @@ static inline int get_sd_load_idx(struct
return load_idx;
}

+#ifdef CONFIG_SCHED_NUMA
+
+static inline bool pick_numa_rand(int n)
+{
+ return !(get_random_int() % n);
+}
+
+static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
+{
+ sgs->sum_ideal_running += rq->nr_ideal_running;
+ sgs->sum_shared_running += rq->nr_shared_running;
+ sgs->sum_numa_running += rq->nr_numa_running;
+ sgs->sum_numa_weight += rq->numa_weight;
+}
+
+static inline
+void update_sd_numa_stats(struct sched_domain *sd, struct sched_group *sg,
+ struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
+ int local_group)
+{
+ if (!(sd->flags & SD_NUMA))
+ return;
+
+ if (local_group) {
+ sds->this_numa_running = sgs->sum_numa_running;
+ sds->this_numa_weight = sgs->sum_numa_weight;
+ sds->this_shared_running = sgs->sum_shared_running;
+ sds->this_ideal_running = sgs->sum_ideal_running;
+ sds->this_group_capacity = sgs->group_capacity;
+
+ } else if (sgs->sum_numa_running - sgs->sum_ideal_running) {
+ if (!sds->numa || pick_numa_rand(sd->span_weight / sg->group_weight)) {
+ sds->numa = sg;
+ sds->numa_load = sgs->avg_load;
+ sds->numa_nr_running = sgs->sum_nr_running;
+ sds->numa_numa_running = sgs->sum_numa_running;
+ sds->numa_shared_running = sgs->sum_shared_running;
+ sds->numa_ideal_running = sgs->sum_ideal_running;
+ sds->numa_numa_weight = sgs->sum_numa_weight;
+ sds->numa_has_capacity = sgs->group_has_capacity;
+ sds->numa_group_capacity = sgs->group_capacity;
+ }
+ }
+}
+
+static struct rq *
+find_busiest_numa_queue(struct lb_env *env, struct sched_group *sg)
+{
+ struct rq *rq, *busiest = NULL;
+ int cpu;
+
+ for_each_cpu_and(cpu, sched_group_cpus(sg), env->cpus) {
+ rq = cpu_rq(cpu);
+
+ if (!rq->nr_numa_running)
+ continue;
+
+ if (!(rq->nr_numa_running - rq->nr_ideal_running))
+ continue;
+
+ if ((env->flags & LBF_KEEP_SHARED) && !(rq->nr_running - rq->nr_shared_running))
+ continue;
+
+ if (!busiest || pick_numa_rand(sg->group_weight))
+ busiest = rq;
+ }
+
+ return busiest;
+}
+
+#define TP_SG(_sg) \
+ (_sg) ? cpumask_first(sched_group_cpus(_sg)) : -1, \
+ (_sg) ? (_sg)->group_weight : -1
+
+static bool can_do_numa_run(struct lb_env *env, struct sd_lb_stats *sds)
+{
+ /*
+ * if we're overloaded; don't pull when:
+ * - the other guy isn't
+ * - imbalance would become too great
+ */
+ if (!sds->this_has_capacity) {
+ if (sds->numa_has_capacity)
+ return false;
+
+#if 0
+ if (sds->this_load * env->sd->imbalance_pct > sds->numa_load * 100)
+ return false;
+#endif
+ }
+
+ /*
+ * pull if we got easy trade
+ */
+ if (sds->this_nr_running - sds->this_numa_running)
+ return true;
+
+ /*
+ * If we got capacity allow stacking up on shared tasks.
+ */
+ if ((sds->this_shared_running < sds->this_group_capacity) && sds->numa_shared_running) {
+ env->flags |= LBF_NUMA_SHARED;
+ return true;
+ }
+
+ /*
+ * pull if we could possibly trade
+ */
+ if (sds->this_numa_running - sds->this_ideal_running)
+ return true;
+
+ return false;
+}
+
+/*
+ * introduce some controlled imbalance to perturb the state so we allow the
+ * state to improve should be tightly controlled/co-ordinated with
+ * can_migrate_task()
+ */
+static int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
+{
+ if (!sds->numa || !sds->numa_numa_running)
+ return 0;
+
+ if (!can_do_numa_run(env, sds))
+ return 0;
+
+ env->flags |= LBF_NUMA_RUN;
+ env->flags &= ~LBF_KEEP_SHARED;
+ env->imbalance = sds->numa_numa_weight / sds->numa_numa_running;
+ sds->busiest = sds->numa;
+ env->find_busiest_queue = find_busiest_numa_queue;
+
+ return 1;
+}
+
+#else /* !CONFIG_SCHED_NUMA: */
+static inline
+void update_sd_numa_stats(struct sched_domain *sd, struct sched_group *sg,
+ struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
+ int local_group)
+{
+}
+
+static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
+{
+}
+
+static inline int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
+{
+ return 0;
+}
+#endif
+
unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
{
return SCHED_POWER_SCALE;
@@ -4245,6 +4824,9 @@ static inline void update_sg_lb_stats(st
sgs->group_load += load;
sgs->sum_nr_running += nr_running;
sgs->sum_weighted_load += weighted_cpuload(i);
+
+ update_sg_numa_stats(sgs, rq);
+
if (idle_cpu(i))
sgs->idle_cpus++;
}
@@ -4336,6 +4918,13 @@ static bool update_sd_pick_busiest(struc
return false;
}

+static void update_src_keep_shared(struct lb_env *env, bool keep_shared)
+{
+ env->flags &= ~LBF_KEEP_SHARED;
+ if (keep_shared)
+ env->flags |= LBF_KEEP_SHARED;
+}
+
/**
* update_sd_lb_stats - Update sched_domain's statistics for load balancing.
* @env: The load balancing environment.
@@ -4368,6 +4957,7 @@ static inline void update_sd_lb_stats(st
sds->total_load += sgs.group_load;
sds->total_pwr += sg->sgp->power;

+#ifdef CONFIG_SCHED_NUMA
/*
* In case the child domain prefers tasks go to siblings
* first, lower the sg capacity to one so that we'll try
@@ -4378,8 +4968,11 @@ static inline void update_sd_lb_stats(st
* heaviest group when it is already under-utilized (possible
* with a large weight task outweighs the tasks on the system).
*/
- if (prefer_sibling && !local_group && sds->this_has_capacity)
- sgs.group_capacity = min(sgs.group_capacity, 1UL);
+ if (prefer_sibling && !local_group && sds->this_has_capacity) {
+ sgs.group_capacity = clamp_val(sgs.sum_shared_running,
+ 1UL, sgs.group_capacity);
+ }
+#endif

if (local_group) {
sds->this_load = sgs.avg_load;
@@ -4398,8 +4991,13 @@ static inline void update_sd_lb_stats(st
sds->busiest_has_capacity = sgs.group_has_capacity;
sds->busiest_group_weight = sgs.group_weight;
sds->group_imb = sgs.group_imb;
+
+ update_src_keep_shared(env,
+ sgs.sum_shared_running <= sgs.group_capacity);
}

+ update_sd_numa_stats(env->sd, sg, sds, &sgs, local_group);
+
sg = sg->next;
} while (sg != env->sd->groups);
}
@@ -4652,14 +5250,14 @@ find_busiest_group(struct lb_env *env, i
* don't try and pull any tasks.
*/
if (sds.this_load >= sds.max_load)
- goto out_balanced;
+ goto out_imbalanced;

/*
* Don't pull any tasks if this group is already above the domain
* average load.
*/
if (sds.this_load >= sds.avg_load)
- goto out_balanced;
+ goto out_imbalanced;

if (env->idle == CPU_IDLE) {
/*
@@ -4685,7 +5283,15 @@ force_balance:
calculate_imbalance(env, &sds);
return sds.busiest;

+out_imbalanced:
+ /* if we've got capacity allow for secondary placement preference */
+ if (!sds.this_has_capacity)
+ goto ret;
+
out_balanced:
+ if (check_numa_busiest_group(env, &sds))
+ return sds.busiest;
+
ret:
env->imbalance = 0;
return NULL;
@@ -4723,6 +5329,9 @@ static struct rq *find_busiest_queue(str
if (capacity && rq->nr_running == 1 && wl > env->imbalance)
continue;

+ if ((env->flags & LBF_KEEP_SHARED) && !(rq->nr_running - rq->nr_shared_running))
+ continue;
+
/*
* For the load comparisons with the other cpu's, consider
* the weighted_cpuload() scaled with the cpu power, so that
@@ -4749,25 +5358,40 @@ static struct rq *find_busiest_queue(str
/* Working cpumask for load_balance and load_balance_newidle. */
DEFINE_PER_CPU(cpumask_var_t, load_balance_tmpmask);

-static int need_active_balance(struct lb_env *env)
-{
- struct sched_domain *sd = env->sd;
-
- if (env->idle == CPU_NEWLY_IDLE) {
+static int active_load_balance_cpu_stop(void *data);

+static void update_sd_failed(struct lb_env *env, int ld_moved)
+{
+ if (!ld_moved) {
+ schedstat_inc(env->sd, lb_failed[env->idle]);
/*
- * ASYM_PACKING needs to force migrate tasks from busy but
- * higher numbered CPUs in order to pack all tasks in the
- * lowest numbered CPUs.
+ * Increment the failure counter only on periodic balance.
+ * We do not want newidle balance, which can be very
+ * frequent, pollute the failure counter causing
+ * excessive cache_hot migrations and active balances.
*/
- if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
- return 1;
- }
-
- return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
+ if (env->idle != CPU_NEWLY_IDLE && !(env->flags & LBF_NUMA_RUN))
+ env->sd->nr_balance_failed++;
+ } else
+ env->sd->nr_balance_failed = 0;
}

-static int active_load_balance_cpu_stop(void *data);
+/*
+ * See can_migrate_numa_task()
+ */
+static int lb_max_iteration(struct lb_env *env)
+{
+ if (!(env->sd->flags & SD_NUMA))
+ return 0;
+
+ if (env->flags & LBF_NUMA_RUN)
+ return 0; /* NUMA_RUN may only improve */
+
+ if (sched_feat_numa(NUMA_FAULTS_DOWN))
+ return 5; /* nodes^2 would suck */
+
+ return 3;
+}

/*
* Check this_cpu to ensure it is balanced within domain. Attempt to move
@@ -4793,6 +5417,8 @@ static int load_balance(int this_cpu, st
.loop_break = sched_nr_migrate_break,
.cpus = cpus,
.find_busiest_queue = find_busiest_queue,
+ .failed = sd->nr_balance_failed,
+ .iteration = 0,
};

cpumask_copy(cpus, cpu_active_mask);
@@ -4816,6 +5442,8 @@ redo:
schedstat_inc(sd, lb_nobusyq[idle]);
goto out_balanced;
}
+ env.src_rq = busiest;
+ env.src_cpu = busiest->cpu;

BUG_ON(busiest == env.dst_rq);

@@ -4895,92 +5523,72 @@ more_balance:
}

/* All tasks on this runqueue were pinned by CPU affinity */
- if (unlikely(env.flags & LBF_ALL_PINNED)) {
- cpumask_clear_cpu(cpu_of(busiest), cpus);
- if (!cpumask_empty(cpus)) {
- env.loop = 0;
- env.loop_break = sched_nr_migrate_break;
- goto redo;
- }
- goto out_balanced;
+ if (unlikely(env.flags & LBF_ALL_PINNED))
+ goto out_pinned;
+
+ if (!ld_moved && env.iteration < lb_max_iteration(&env)) {
+ env.iteration++;
+ env.loop = 0;
+ goto more_balance;
}
}

- if (!ld_moved) {
- schedstat_inc(sd, lb_failed[idle]);
+ if (!ld_moved && idle != CPU_NEWLY_IDLE) {
+ raw_spin_lock_irqsave(&busiest->lock, flags);
+
/*
- * Increment the failure counter only on periodic balance.
- * We do not want newidle balance, which can be very
- * frequent, pollute the failure counter causing
- * excessive cache_hot migrations and active balances.
+ * Don't kick the active_load_balance_cpu_stop,
+ * if the curr task on busiest cpu can't be
+ * moved to this_cpu
*/
- if (idle != CPU_NEWLY_IDLE)
- sd->nr_balance_failed++;
-
- if (need_active_balance(&env)) {
- raw_spin_lock_irqsave(&busiest->lock, flags);
-
- /* don't kick the active_load_balance_cpu_stop,
- * if the curr task on busiest cpu can't be
- * moved to this_cpu
- */
- if (!cpumask_test_cpu(this_cpu,
- tsk_cpus_allowed(busiest->curr))) {
- raw_spin_unlock_irqrestore(&busiest->lock,
- flags);
- env.flags |= LBF_ALL_PINNED;
- goto out_one_pinned;
- }
-
- /*
- * ->active_balance synchronizes accesses to
- * ->active_balance_work. Once set, it's cleared
- * only after active load balance is finished.
- */
- if (!busiest->active_balance) {
- busiest->active_balance = 1;
- busiest->push_cpu = this_cpu;
- active_balance = 1;
- }
+ if (!cpumask_test_cpu(this_cpu, tsk_cpus_allowed(busiest->curr))) {
raw_spin_unlock_irqrestore(&busiest->lock, flags);
-
- if (active_balance) {
- stop_one_cpu_nowait(cpu_of(busiest),
- active_load_balance_cpu_stop, busiest,
- &busiest->active_balance_work);
- }
-
- /*
- * We've kicked active balancing, reset the failure
- * counter.
- */
- sd->nr_balance_failed = sd->cache_nice_tries+1;
+ env.flags |= LBF_ALL_PINNED;
+ goto out_pinned;
}
- } else
- sd->nr_balance_failed = 0;

- if (likely(!active_balance)) {
- /* We were unbalanced, so reset the balancing interval */
- sd->balance_interval = sd->min_interval;
- } else {
/*
- * If we've begun active balancing, start to back off. This
- * case may not be covered by the all_pinned logic if there
- * is only 1 task on the busy runqueue (because we don't call
- * move_tasks).
- */
- if (sd->balance_interval < sd->max_interval)
- sd->balance_interval *= 2;
+ * ->active_balance synchronizes accesses to
+ * ->active_balance_work. Once set, it's cleared
+ * only after active load balance is finished.
+ */
+ if (!busiest->active_balance) {
+ busiest->active_balance = 1;
+ busiest->ab_dst_cpu = this_cpu;
+ busiest->ab_flags = env.flags;
+ busiest->ab_failed = env.failed;
+ busiest->ab_idle = env.idle;
+ active_balance = 1;
+ }
+ raw_spin_unlock_irqrestore(&busiest->lock, flags);
+
+ if (active_balance) {
+ stop_one_cpu_nowait(cpu_of(busiest),
+ active_load_balance_cpu_stop, busiest,
+ &busiest->ab_work);
+ }
}

- goto out;
+ if (!active_balance)
+ update_sd_failed(&env, ld_moved);
+
+ sd->balance_interval = sd->min_interval;
+out:
+ return ld_moved;
+
+out_pinned:
+ cpumask_clear_cpu(cpu_of(busiest), cpus);
+ if (!cpumask_empty(cpus)) {
+ env.loop = 0;
+ env.loop_break = sched_nr_migrate_break;
+ goto redo;
+ }

out_balanced:
schedstat_inc(sd, lb_balanced[idle]);

sd->nr_balance_failed = 0;

-out_one_pinned:
/* tune up the balancing interval */
if (((env.flags & LBF_ALL_PINNED) &&
sd->balance_interval < MAX_PINNED_INTERVAL) ||
@@ -4988,8 +5596,8 @@ out_one_pinned:
sd->balance_interval *= 2;

ld_moved = 0;
-out:
- return ld_moved;
+
+ goto out;
}

/*
@@ -5060,7 +5668,7 @@ static int active_load_balance_cpu_stop(
{
struct rq *busiest_rq = data;
int busiest_cpu = cpu_of(busiest_rq);
- int target_cpu = busiest_rq->push_cpu;
+ int target_cpu = busiest_rq->ab_dst_cpu;
struct rq *target_rq = cpu_rq(target_cpu);
struct sched_domain *sd;

@@ -5098,17 +5706,23 @@ static int active_load_balance_cpu_stop(
.sd = sd,
.dst_cpu = target_cpu,
.dst_rq = target_rq,
- .src_cpu = busiest_rq->cpu,
+ .src_cpu = busiest_cpu,
.src_rq = busiest_rq,
- .idle = CPU_IDLE,
+ .flags = busiest_rq->ab_flags,
+ .failed = busiest_rq->ab_failed,
+ .idle = busiest_rq->ab_idle,
};
+ env.iteration = lb_max_iteration(&env);

schedstat_inc(sd, alb_count);

- if (move_one_task(&env))
+ if (move_one_task(&env)) {
schedstat_inc(sd, alb_pushed);
- else
+ update_sd_failed(&env, 1);
+ } else {
schedstat_inc(sd, alb_failed);
+ update_sd_failed(&env, 0);
+ }
}
rcu_read_unlock();
double_unlock_balance(busiest_rq, target_rq);
@@ -5508,6 +6122,9 @@ static void task_tick_fair(struct rq *rq
}

update_rq_runnable_avg(rq, 1);
+
+ if (sched_feat_numa(NUMA) && nr_node_ids > 1)
+ task_tick_numa(rq, curr);
}

/*
@@ -5902,9 +6519,7 @@ const struct sched_class fair_sched_clas

#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
.migrate_task_rq = migrate_task_rq_fair,
-#endif
.rq_online = rq_online_fair,
.rq_offline = rq_offline_fair,

Index: linux/kernel/sched/features.h
===================================================================
--- linux.orig/kernel/sched/features.h
+++ linux/kernel/sched/features.h
@@ -66,3 +66,12 @@ SCHED_FEAT(TTWU_QUEUE, true)
SCHED_FEAT(FORCE_SD_OVERLAP, false)
SCHED_FEAT(RT_RUNTIME_SHARE, true)
SCHED_FEAT(LB_MIN, false)
+
+#ifdef CONFIG_SCHED_NUMA
+/* Do the working set probing faults: */
+SCHED_FEAT(NUMA, true)
+SCHED_FEAT(NUMA_FAULTS_UP, true)
+SCHED_FEAT(NUMA_FAULTS_DOWN, false)
+SCHED_FEAT(NUMA_SETTLE, true)
+#endif
+
Index: linux/kernel/sched/sched.h
===================================================================
--- linux.orig/kernel/sched/sched.h
+++ linux/kernel/sched/sched.h
@@ -3,6 +3,7 @@
#include <linux/mutex.h>
#include <linux/spinlock.h>
#include <linux/stop_machine.h>
+#include <linux/slab.h>

#include "cpupri.h"

@@ -420,17 +421,29 @@ struct rq {
unsigned long cpu_power;

unsigned char idle_balance;
- /* For active balancing */
int post_schedule;
+
+ /* For active balancing */
int active_balance;
- int push_cpu;
- struct cpu_stop_work active_balance_work;
+ int ab_dst_cpu;
+ int ab_flags;
+ int ab_failed;
+ int ab_idle;
+ struct cpu_stop_work ab_work;
+
/* cpu of this runqueue: */
int cpu;
int online;

struct list_head cfs_tasks;

+#ifdef CONFIG_SCHED_NUMA
+ unsigned long numa_weight;
+ unsigned long nr_numa_running;
+ unsigned long nr_ideal_running;
+#endif
+ unsigned long nr_shared_running; /* 0 on non-NUMA */
+
u64 rt_avg;
u64 age_stamp;
u64 idle_stamp;
@@ -501,6 +514,18 @@ DECLARE_PER_CPU(struct rq, runqueues);
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define raw_rq() (&__raw_get_cpu_var(runqueues))

+#ifdef CONFIG_SCHED_NUMA
+extern void sched_setnuma(struct task_struct *p, int node, int shared);
+static inline void task_numa_free(struct task_struct *p)
+{
+ kfree(p->numa_faults);
+}
+#else /* CONFIG_SCHED_NUMA */
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_SCHED_NUMA */
+
#ifdef CONFIG_SMP

#define rcu_dereference_check_sched_domain(p) \
@@ -544,6 +569,7 @@ static inline struct sched_domain *highe

DECLARE_PER_CPU(struct sched_domain *, sd_llc);
DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(struct sched_domain *, sd_node);

extern int group_balance_cpu(struct sched_group *sg);

Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -256,9 +256,11 @@ static int min_sched_granularity_ns = 10
static int max_sched_granularity_ns = NSEC_PER_SEC; /* 1 second */
static int min_wakeup_granularity_ns; /* 0 usecs */
static int max_wakeup_granularity_ns = NSEC_PER_SEC; /* 1 second */
+#ifdef CONFIG_SMP
static int min_sched_tunable_scaling = SCHED_TUNABLESCALING_NONE;
static int max_sched_tunable_scaling = SCHED_TUNABLESCALING_END-1;
-#endif
+#endif /* CONFIG_SMP */
+#endif /* CONFIG_SCHED_DEBUG */

#ifdef CONFIG_COMPACTION
static int min_extfrag_threshold;
@@ -301,6 +303,7 @@ static struct ctl_table kern_table[] = {
.extra1 = &min_wakeup_granularity_ns,
.extra2 = &max_wakeup_granularity_ns,
},
+#ifdef CONFIG_SMP
{
.procname = "sched_tunable_scaling",
.data = &sysctl_sched_tunable_scaling,
@@ -347,7 +350,31 @@ static struct ctl_table kern_table[] = {
.extra1 = &zero,
.extra2 = &one,
},
-#endif
+#endif /* CONFIG_SMP */
+#ifdef CONFIG_SCHED_NUMA
+ {
+ .procname = "sched_numa_scan_period_min_ms",
+ .data = &sysctl_sched_numa_scan_period_min,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "sched_numa_scan_period_max_ms",
+ .data = &sysctl_sched_numa_scan_period_max,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "sched_numa_settle_count",
+ .data = &sysctl_sched_numa_settle_count,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+#endif /* CONFIG_SCHED_NUMA */
+#endif /* CONFIG_SCHED_DEBUG */
{
.procname = "sched_rt_period_us",
.data = &sysctl_sched_rt_period,
Index: linux/mm/huge_memory.c
===================================================================
--- linux.orig/mm/huge_memory.c
+++ linux/mm/huge_memory.c
@@ -777,9 +777,10 @@ fixup:

unlock:
spin_unlock(&mm->page_table_lock);
- if (page)
+ if (page) {
+ task_numa_fault(page_to_nid(page), last_cpu, HPAGE_PMD_NR);
put_page(page);
-
+ }
return;

migrate:
@@ -848,6 +849,8 @@ migrate:

put_page(page); /* Drop the rmap reference */

+ task_numa_fault(node, last_cpu, HPAGE_PMD_NR);
+
if (lru)
put_page(page); /* drop the LRU isolation reference */

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c
+++ linux/mm/memory.c
@@ -3484,6 +3484,7 @@ static int do_numa_page(struct mm_struct
{
struct page *page = NULL;
int node, page_nid = -1;
+ int last_cpu = -1;
spinlock_t *ptl;

ptl = pte_lockptr(mm, pmd);
@@ -3495,6 +3496,7 @@ static int do_numa_page(struct mm_struct
if (page) {
get_page(page);
page_nid = page_to_nid(page);
+ last_cpu = page_last_cpu(page);
node = mpol_misplaced(page, vma, address);
if (node != -1)
goto migrate;
@@ -3514,8 +3516,10 @@ out_pte_upgrade_unlock:
out_unlock:
pte_unmap_unlock(ptep, ptl);
out:
- if (page)
+ if (page) {
+ task_numa_fault(page_nid, last_cpu, 1);
put_page(page);
+ }

return 0;

Index: linux/mm/mempolicy.c
===================================================================
--- linux.orig/mm/mempolicy.c
+++ linux/mm/mempolicy.c
@@ -2194,12 +2194,70 @@ static void sp_free(struct sp_node *n)
kmem_cache_free(sn_cache, n);
}

+/*
+ * Multi-stage node selection is used in conjunction with a periodic
+ * migration fault to build a temporal task<->page relation. By
+ * using a two-stage filter we remove short/unlikely relations.
+ *
+ * Using P(p) ~ n_p / n_t as per frequentist probability, we can
+ * equate a task's usage of a particular page (n_p) per total usage
+ * of this page (n_t) (in a given time-span) to a probability.
+ *
+ * Our periodic faults will then sample this probability and getting
+ * the same result twice in a row, given these samples are fully
+ * independent, is then given by P(n)^2, provided our sample period
+ * is sufficiently short compared to the usage pattern.
+ *
+ * This quadric squishes small probabilities, making it less likely
+ * we act on an unlikely task<->page relation.
+ *
+ * Return the best node ID this page should be on, or -1 if it should
+ * stay where it is.
+ */
+static int
+numa_migration_target(struct page *page, int page_nid,
+ struct task_struct *p, int this_cpu,
+ int cpu_last_access)
+{
+ int nid_last_access;
+ int this_nid;
+
+ if (task_numa_shared(p) < 0)
+ return -1;
+
+ /*
+ * Possibly migrate towards the current node, depends on
+ * task_numa_placement() and access details.
+ */
+ nid_last_access = cpu_to_node(cpu_last_access);
+ this_nid = cpu_to_node(this_cpu);
+
+ if (nid_last_access != this_nid) {
+ /*
+ * 'Access miss': the page got last accessed from a remote node.
+ */
+ return -1;
+ }
+ /*
+ * 'Access hit': the page got last accessed from our node.
+ *
+ * Migrate the page if needed.
+ */
+
+ /* The page is already on this node: */
+ if (page_nid == this_nid)
+ return -1;
+
+ return this_nid;
+}
+
/**
* mpol_misplaced - check whether current page node is valid in policy
*
* @page - page to be checked
* @vma - vm area where page mapped
* @addr - virtual address where page mapped
+ * @multi - use multi-stage node binding
*
* Lookup current policy node id for vma,addr and "compare to" page's
* node id.
@@ -2213,18 +2271,22 @@ static void sp_free(struct sp_node *n)
*/
int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
{
+ int best_nid = -1, page_nid;
+ int cpu_last_access, this_cpu;
struct mempolicy *pol;
- struct zone *zone;
- int curnid = page_to_nid(page);
unsigned long pgoff;
- int polnid = -1;
- int ret = -1;
+ struct zone *zone;

BUG_ON(!vma);

+ this_cpu = raw_smp_processor_id();
+ page_nid = page_to_nid(page);
+
+ cpu_last_access = page_xchg_last_cpu(page, this_cpu);
+
pol = get_vma_policy(current, vma, addr);
- if (!(pol->flags & MPOL_F_MOF))
- goto out;
+ if (!(pol->flags & MPOL_F_MOF) && !(task_numa_shared(current) >= 0))
+ goto out_keep_page;

switch (pol->mode) {
case MPOL_INTERLEAVE:
@@ -2233,14 +2295,14 @@ int mpol_misplaced(struct page *page, st

pgoff = vma->vm_pgoff;
pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
- polnid = offset_il_node(pol, vma, pgoff);
+ best_nid = offset_il_node(pol, vma, pgoff);
break;

case MPOL_PREFERRED:
if (pol->flags & MPOL_F_LOCAL)
- polnid = numa_node_id();
+ best_nid = numa_node_id();
else
- polnid = pol->v.preferred_node;
+ best_nid = pol->v.preferred_node;
break;

case MPOL_BIND:
@@ -2250,24 +2312,25 @@ int mpol_misplaced(struct page *page, st
* else select nearest allowed node, if any.
* If no allowed nodes, use current [!misplaced].
*/
- if (node_isset(curnid, pol->v.nodes))
- goto out;
+ if (node_isset(page_nid, pol->v.nodes))
+ goto out_keep_page;
(void)first_zones_zonelist(
node_zonelist(numa_node_id(), GFP_HIGHUSER),
gfp_zone(GFP_HIGHUSER),
&pol->v.nodes, &zone);
- polnid = zone->node;
+ best_nid = zone->node;
break;

default:
BUG();
}
- if (curnid != polnid)
- ret = polnid;
-out:
+
+ best_nid = numa_migration_target(page, page_nid, current, this_cpu, cpu_last_access);
+
+out_keep_page:
mpol_cond_put(pol);

- return ret;
+ return best_nid;
}

static void sp_delete(struct shared_policy *sp, struct sp_node *n)
Index: linux/mm/migrate.c
===================================================================
--- linux.orig/mm/migrate.c
+++ linux/mm/migrate.c
@@ -1427,12 +1427,6 @@ int migrate_misplaced_page(struct page *
gfp_t gfp = GFP_HIGHUSER_MOVABLE;

/*
- * Don't migrate pages that are mapped in multiple processes.
- */
- if (page_mapcount(page) != 1)
- goto out;
-
- /*
* Never wait for allocations just to migrate on fault, but don't dip
* into reserves. And, only accept pages from the specified node. No
* sense migrating to a different "misplaced" page!


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Christoph Lameter
2012-11-13 00:10:01 UTC
Permalink
Post by Peter Zijlstra
We define 'shared memory' as all user memory that is frequently
accessed by multiple tasks and conversely 'private memory' is
the user memory used predominantly by a single task.
"All"? Should that not be "a memory segment that is frequently..."?
Post by Peter Zijlstra
Using this, we can construct two per-task node-vectors, 'S_i'
and 'P_i' reflecting the amount of shared and privately used
pages of this task respectively. Pages for which two consecutive
'hits' are of the same cpu are assumed private and the others
are shared.
The classification is per task? But most tasks have memory areas
that are private and other areas where shared accesses occur. Can that be
per memory area? Private areas need to be kept with the process. Shared
areas may have to be spread across nodes if the memory area is too large.

Guess that is too complicated to determine unless we would be using vmas
which may only roughly correlate to the memory regions for which memory
policies are currently manually setup.

But then this is rather different from my expectations that I had after
reading the intro.
Post by Peter Zijlstra
We also add an extra 'lateral' force to the load balancer that
perturbs the state when otherwise 'fairly' balanced. This
ensures we don't get 'stuck' in a state which is fair but
undesired from a memory location POV (see can_do_numa_run()).
We do useless moves and create additional overhead?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ingo Molnar
2012-11-13 08:20:01 UTC
Permalink
Post by Peter Zijlstra
Using this, we can construct two per-task node-vectors,
'S_i' and 'P_i' reflecting the amount of shared and
privately used pages of this task respectively. Pages for
which two consecutive 'hits' are of the same cpu are assumed
private and the others are shared.
The classification is per task? [...]
Yes, exactly - access patterns are fundamentally and physically
per task, as a task can execute only on a single CPU at once. (I
say 'task' instead of 'thread' or 'process' because the new code
makes no distinction between threads and processes.)

The new code maps out inter-task relationships, statistically.

So we are basically able to (statistically) approximate which
task relates to which other task in the system, based on their
memory access patterns alone: using a very compact metric and
matching scheduler rules and a (lazy) memory placement machinery
on the VM side.

Say consider the following 10-task workload, where the scheduler
is able to figure out these relationships:

{ A, B, C, D } dominantly share memory X with each other
{ E, F, G, H } dominantly share memory Y with each other
{ I } uses memory privately
{ J } uses memory privately

and the scheduler rules then try to converge these groups of
tasks ideally.

[ The 'role' and grouping of tasks is not static but sampled and
average based - so if a worker thread changes its role, the
scheduler will adapt placement to that. ]

[ A 'private task' is basically a special case for sharing
memory: if a task only shares memory with itself. Its
placement and spreading is easy. ]
[...] But most tasks have memory areas that are private and
other areas where shared accesses occur. Can that be per
memory area? [...]
Do you mean per vma, and/or per mm?

How would that work? Consider the above case:

- 12x CPU-intense threads and a large piece of shared memory

- 2x 4 threads are using two large shared memory area to
calculate (one area for each group of threads)

- the 4 remaining processes aggregate and sort the results from
the 8 threads, in their own dominantly 'private' working set.

how does per vma or per mm describe that properly? The whole
workload might be just within a single large vma within a JVM.
Or it might be implemented using processes and anonymous shared
memory.

If you look at this from a 'per task access pattern and
inter-task working set relationship' perspective then the
resolution and optimization is natural: the 2x 4 threads should
be grouped together modulo capacity constraints, while the
remaining 4 'private memory' threads should be spread out over
the remaining capacity of the system.

What matters is how tasks relate to each other as they perform
processing, not which APIs the workload uses to create tasks and
memory areas.

The main constraint from a placement optimization complexity POV
is task->CPU placement: for NUMA workloads the main challenge -
and 80% of the code and much of the real meat of the feature -
is to categorize and place tasks properly.

There might be much discussion about PROT_NONE and memory
migration details, but that is because the VM code is 5 times
larger than the scheduler code and due to that there's 5 times
more VM hackers than scheduler hackers ;-)

In reality the main complexity of this problem [the placement
optimization problem portion] is a dominantly CPU/task scheduler
feature, and IMO rather fundamentally so: it's not an
implementation choice but derives from the Von Neumann model of
computing in essence.

And that is why IMO the task based access pattern metric
implementaton is such a good fit in practice as well - and that
is why other approaches struggled getting a hold of the NUMA
problem.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Rik van Riel
2012-11-13 23:00:02 UTC
Permalink
Post by Peter Zijlstra
The principal ideas behind this patch are the fundamental
difference between shared and privately used memory and the very
strong desire to only rely on per-task behavioral state for
scheduling decisions.
We define 'shared memory' as all user memory that is frequently
accessed by multiple tasks and conversely 'private memory' is
the user memory used predominantly by a single task.
To approximate the above strict definition we recognise that
task placement is dominantly per cpu and thus using cpu granular
page access state is a natural fit. Thus we introduce
page::last_cpu as the cpu that last accessed a page.
Using this, we can construct two per-task node-vectors, 'S_i'
and 'P_i' reflecting the amount of shared and privately used
pages of this task respectively. Pages for which two consecutive
'hits' are of the same cpu are assumed private and the others
are shared.
That is an intriguing idea. It will be interesting to see how
well it works with various workloads.
Post by Peter Zijlstra
[ Note that for shared tasks we only see '1/n' the total number
of shared pages for the other tasks will take the other
faults; where 'n' is the number of tasks sharing the memory.
So for an equal comparison we should divide total private by
'n' as well, but we don't have 'n' so we pick 2. ]
Unless I am misreading the code (it is a little hard to read in places,
more on that further down), the number picked appears to be 4.
Post by Peter Zijlstra
We can also compute which node holds most of our memory, running
on this node will be called 'ideal placement' (As per previous
patches we will prefer to pull memory towards wherever we run.)
1) !numa tasks and numa tasks in the direction of more faults
2) allow !ideal tasks getting worse in the direction of faults
3) allow private tasks to get worse
4) allow shared tasks to get worse
This reflects some of the things autonuma does, so I
suspect it will work in your code too :)

It is interesting to see how sched/numa has moved from
the homenodes-through-syscalls concepts to something so
close to what autonuma does.
Post by Peter Zijlstra
Index: linux/Documentation/scheduler/numa-problem.txt
===================================================================
--- linux.orig/Documentation/scheduler/numa-problem.txt
+++ linux/Documentation/scheduler/numa-problem.txt
@@ -133,6 +133,8 @@ XXX properties of this M vs a potential
2b) migrate memory towards 'n_i' using 2 samples.
+XXX include the statistical babble on double sampling somewhere near
+
This document is becoming less and less reflective of what
the code actually does :)
Post by Peter Zijlstra
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1501,6 +1502,18 @@ struct task_struct {
short il_next;
short pref_node_fork;
#endif
+#ifdef CONFIG_SCHED_NUMA
+ int numa_shared;
+ int numa_max_node;
+ int numa_scan_seq;
+ int numa_migrate_seq;
+ unsigned int numa_scan_period;
+ u64 node_stamp; /* migration stamp */
+ unsigned long numa_weight;
+ unsigned long *numa_faults;
+ struct callback_head numa_work;
+#endif /* CONFIG_SCHED_NUMA */
+
All these struct members could use comments explaining what
they are. Having a struct as central to the operation of
the kernel as task_struct full of undocumented members is a
bad idea - lets not make it worse.
Post by Peter Zijlstra
+/*
+ * -1: non-NUMA task
+ * 0: NUMA task with a dominantly 'private' working set
+ * 1: NUMA task with a dominantly 'shared' working set
+ */
+static inline int task_numa_shared(struct task_struct *p)
+{
+#ifdef CONFIG_SCHED_NUMA
+ return p->numa_shared;
+#else
+ return -1;
+#endif
+}
Just what is a "non-NUMA task"? That is not at all obvious, and
could use a better comment.
Post by Peter Zijlstra
Index: linux/include/uapi/linux/mempolicy.h
===================================================================
--- linux.orig/include/uapi/linux/mempolicy.h
+++ linux/include/uapi/linux/mempolicy.h
@@ -69,6 +69,7 @@ enum mpol_rebind_step {
#define MPOL_F_LOCAL (1 << 1) /* preferred local allocation */
#define MPOL_F_REBINDING (1 << 2) /* identify policies in rebinding */
#define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */
+#define MPOL_F_HOME (1 << 4) /* this is the home-node policy */
What does that imply?

How is it different from migrate on fault?
Post by Peter Zijlstra
Index: linux/kernel/sched/core.c
===================================================================
--- linux.orig/kernel/sched/core.c
+++ linux/kernel/sched/core.c
@@ -1544,6 +1544,21 @@ static void __sched_fork(struct task_str
#ifdef CONFIG_PREEMPT_NOTIFIERS
INIT_HLIST_HEAD(&p->preempt_notifiers);
#endif
+
+#ifdef CONFIG_SCHED_NUMA
+ if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
+ p->mm->numa_next_scan = jiffies;
+ p->mm->numa_scan_seq = 0;
+ }
+
+ p->numa_shared = -1;
+ p->node_stamp = 0ULL;
+ p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
+ p->numa_migrate_seq = 2;
Why is it set to 2?

What happens when the number overflows? (can it?)

This kind of thing is just begging for a comment...
Post by Peter Zijlstra
@@ -5970,6 +5997,37 @@ static struct sched_domain_topology_leve
static struct sched_domain_topology_level *sched_domain_topology = default_topology;
+#ifdef CONFIG_SCHED_NUMA
+
+/*
* Set the preferred home node for a task. Hopefully the load
* balancer will move it later.
Post by Peter Zijlstra
+ */
Excellent, this function has a comment. Too bad it's empty.
You may want to fix that :)
Post by Peter Zijlstra
+void sched_setnuma(struct task_struct *p, int node, int shared)
+{
+ unsigned long flags;
+ int on_rq, running;
+ struct rq *rq;
+/*
+ * numa task sample period in ms: 5s
+ */
+unsigned int sysctl_sched_numa_scan_period_min = 5000;
+unsigned int sysctl_sched_numa_scan_period_max = 5000*16;
+
+/*
+ * Wait for the 2-sample stuff to settle before migrating again
+ */
+unsigned int sysctl_sched_numa_settle_count = 2;
These two could do with longer comments, explaining why these
defaults are set to these values.
Post by Peter Zijlstra
+static void task_numa_placement(struct task_struct *p)
+{
+ int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
+ unsigned long total[2] = { 0, 0 };
+ unsigned long faults, max_faults = 0;
+ int node, priv, shared, max_node = -1;
+
+ if (p->numa_scan_seq == seq)
+ return;
+
+ p->numa_scan_seq = seq;
+
+ for (node = 0; node < nr_node_ids; node++) {
+ faults = 0;
+ for (priv = 0; priv < 2; priv++) {
+ faults += p->numa_faults[2*node + priv];
+ total[priv] += p->numa_faults[2*node + priv];
+ p->numa_faults[2*node + priv] /= 2;
+ }
What is "priv"?

If it is fault type (not sure, but it looks like it might be from
reading the rest of the code), would it be better to do this with
an enum?

That way we can see some of the symbolic names of what we are
iterating over, and figure out what is going on.
Post by Peter Zijlstra
+ if (faults > max_faults) {
+ max_faults = faults;
+ max_node = node;
+ }
+ }
+
+ if (max_node != p->numa_max_node)
+ sched_setnuma(p, max_node, task_numa_shared(p));
+
+ p->numa_migrate_seq++;
+ if (sched_feat(NUMA_SETTLE) &&
+ p->numa_migrate_seq < sysctl_sched_numa_settle_count)
+ return;
+
+ /*
+ * Note: shared is spread across multiple tasks and in the future
+ * we might want to consider a different equation below to reduce
+ * the impact of a little private memory accesses.
+ */
+ shared = (total[0] >= total[1] / 4);
That would also allow us to use the enum here, which would allow
me to figure out which of these indexes is used for shared faults,
and which for private ones.

Btw, is the 4 above the factor 2 from the changelog? :)
Post by Peter Zijlstra
+ if (shared != task_numa_shared(p)) {
+ sched_setnuma(p, p->numa_max_node, shared);
+ p->numa_migrate_seq = 0;
+ }
+}
+
+/*
+ */
+void task_numa_fault(int node, int last_cpu, int pages)
Neither the comment nor the function name hint at the primary function
of this function: updating the numa fault statistics.
Post by Peter Zijlstra
+{
+ struct task_struct *p = current;
+ int priv = (task_cpu(p) == last_cpu);
One quick question: why are you using last_cpu and not simply the last
node, since the load balancer is free to move tasks around inside each
NUMA node?

I have some ideas on why you are doing it, but it would be good to
explicitly document it.
Post by Peter Zijlstra
+ if (unlikely(!p->numa_faults)) {
+ int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
+
+ p->numa_faults = kzalloc(size, GFP_KERNEL);
+ if (!p->numa_faults)
+ return;
+ }
+
+ task_numa_placement(p);
+ p->numa_faults[2*node + priv] += pages;
+}
Ahhh, so private faults are the second number, and shared faults
the first one. Would have been nice if that had been documented
somewhere...
Post by Peter Zijlstra
+/*
+ */
Yes, they are. I have read this part of the patch several times,
and am still not sure what exactly the code is doing, or why.
Post by Peter Zijlstra
+static bool can_migrate_numa_task(struct task_struct *p, struct lb_env *env)
+{
/*
- * 1) task is cache cold, or
- * 2) too many balance attempts have failed.
+ * 0 -- only allow improvement, or !numa
+ * 1 -- + worsen !ideal
+ * 2 priv
+ * 3 shared (everything)
+ *
+ * 1 .. nodes -- allow getting worse by step
+ * nodes+1 -- punt, everything goes!
+ *
+ * LBF_NUMA_RUN -- numa only, only allow improvement
+ * LBF_NUMA_SHARED -- shared only
+ *
+ * LBF_KEEP_SHARED -- do not touch shared tasks
*/
These comments do not explain why things are done this way,
nor are they verbose enough to even explain what they are doing.

It is taking a lot of scrolling through the patch to find where
this function is invoked with different iteration values. Documenting
that here would be nice.
Post by Peter Zijlstra
- tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
- if (!tsk_cache_hot ||
- env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
-#ifdef CONFIG_SCHEDSTATS
- if (tsk_cache_hot) {
- schedstat_inc(env->sd, lb_hot_gained[env->idle]);
- schedstat_inc(p, se.statistics.nr_forced_migrations);
- }
+ /* a numa run can only move numa tasks about to improve things */
+ if (env->flags & LBF_NUMA_RUN) {
+ if (task_numa_shared(p) < 0)
+ return false;
What does <0 mean again? A comment would be good.
Post by Peter Zijlstra
+ /* can only pull shared tasks */
+ if ((env->flags & LBF_NUMA_SHARED) && !task_numa_shared(p))
+ return false;
Why?
Post by Peter Zijlstra
+ } else {
+ if (task_numa_shared(p) < 0)
+ goto try_migrate;
+ }
+
+ /* can not move shared tasks */
+ if ((env->flags & LBF_KEEP_SHARED) && task_numa_shared(p) == 1)
+ return false;
+
+ if (task_faults_up(p, env))
+ return true; /* memory locality beats cache hotness */
Does "task_faults_up" mean "move to a node with better memory locality"?
Post by Peter Zijlstra
+
+ if (env->iteration < 1)
+ return false;
+
+#ifdef CONFIG_SCHED_NUMA
+ if (p->numa_max_node != cpu_to_node(task_cpu(p))) /* !ideal */
+ goto demote;
#endif
- return 1;
- }
- if (tsk_cache_hot) {
- schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
- return 0;
- }
- return 1;
+ if (env->iteration < 2)
+ return false;
+
+ if (task_numa_shared(p) == 0) /* private */
+ goto demote;
It would be good to document why we are demoting in this case.
Post by Peter Zijlstra
+
+ if (env->iteration < 3)
+ return false;
+
+ if (env->iteration < 5)
+ return task_faults_down(p, env);
And why we are demoting if env->iteration is 3 or 4...
Post by Peter Zijlstra
@@ -3976,7 +4376,7 @@ struct sd_lb_stats {
unsigned long this_load;
unsigned long this_load_per_task;
unsigned long this_nr_running;
- unsigned long this_has_capacity;
+ unsigned int this_has_capacity;
unsigned int this_idle_cpus;
/* Statistics of the busiest group */
@@ -3985,10 +4385,28 @@ struct sd_lb_stats {
unsigned long busiest_load_per_task;
unsigned long busiest_nr_running;
unsigned long busiest_group_capacity;
- unsigned long busiest_has_capacity;
+ unsigned int busiest_has_capacity;
unsigned int busiest_group_weight;
int group_imb; /* Is there imbalance in this sd */
+
+#ifdef CONFIG_SCHED_NUMA
+ unsigned long this_numa_running;
+ unsigned long this_numa_weight;
+ unsigned long this_shared_running;
+ unsigned long this_ideal_running;
+ unsigned long this_group_capacity;
+
+ struct sched_group *numa;
+ unsigned long numa_load;
+ unsigned long numa_nr_running;
+ unsigned long numa_numa_running;
+ unsigned long numa_shared_running;
+ unsigned long numa_ideal_running;
+ unsigned long numa_numa_weight;
+ unsigned long numa_group_capacity;
+ unsigned int numa_has_capacity;
+#endif
Same comment as for task_struct. It would be most useful to have
the members of this structure documented.
Post by Peter Zijlstra
@@ -4723,6 +5329,9 @@ static struct rq *find_busiest_queue(str
if (capacity && rq->nr_running == 1 && wl > env->imbalance)
continue;
+ if ((env->flags & LBF_KEEP_SHARED) && !(rq->nr_running - rq->nr_shared_running))
+ continue;
If the runqueue struct entries were documented, we would know what
this condition was testing. Please add documentation.
Post by Peter Zijlstra
+/*
+ * See can_migrate_numa_task()
+ */
Wait a moment. When I read that function, I wondered why it was
called with certain parameters, and the function setting that
parameter is referring me to the function that is being called?

Having some comment explaining what the strategy is would be useful,
to say the least.
Post by Peter Zijlstra
+static int lb_max_iteration(struct lb_env *env)
+{
+ if (!(env->sd->flags & SD_NUMA))
+ return 0;
+
+ if (env->flags & LBF_NUMA_RUN)
+ return 0; /* NUMA_RUN may only improve */
+
+ if (sched_feat_numa(NUMA_FAULTS_DOWN))
+ return 5; /* nodes^2 would suck */
+
+ return 3;
+}
}
/* All tasks on this runqueue were pinned by CPU affinity */
- if (unlikely(env.flags & LBF_ALL_PINNED)) {
- cpumask_clear_cpu(cpu_of(busiest), cpus);
- if (!cpumask_empty(cpus)) {
- env.loop = 0;
- env.loop_break = sched_nr_migrate_break;
- goto redo;
- }
- goto out_balanced;
+ if (unlikely(env.flags & LBF_ALL_PINNED))
+ goto out_pinned;
+
+ if (!ld_moved && env.iteration < lb_max_iteration(&env)) {
+ env.iteration++;
+ env.loop = 0;
+ goto more_balance;
}
Things are starting to make some sense. Overall, this code could use
better comments explaining why it does things the way it does.
Post by Peter Zijlstra
===================================================================
--- linux.orig/kernel/sched/features.h
+++ linux/kernel/sched/features.h
@@ -66,3 +66,12 @@ SCHED_FEAT(TTWU_QUEUE, true)
SCHED_FEAT(FORCE_SD_OVERLAP, false)
SCHED_FEAT(RT_RUNTIME_SHARE, true)
SCHED_FEAT(LB_MIN, false)
+
+#ifdef CONFIG_SCHED_NUMA
+/* Do the working set probing faults: */
+SCHED_FEAT(NUMA, true)
+SCHED_FEAT(NUMA_FAULTS_UP, true)
+SCHED_FEAT(NUMA_FAULTS_DOWN, false)
+SCHED_FEAT(NUMA_SETTLE, true)
+#endif
Are these documented somewhere?
--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Rik van Riel
2012-11-16 18:10:02 UTC
Permalink
Post by Peter Zijlstra
1) !numa tasks and numa tasks in the direction of more faults
2) allow !ideal tasks getting worse in the direction of faults
3) allow private tasks to get worse
4) allow shared tasks to get worse
This order ensures we prefer increasing memory locality but when
we do have to make hard decisions we prefer spreading private
over shared, because spreading shared tasks significantly
increases the interconnect bandwidth since not all memory can
follow.
Combined with the fact that we only turn a certain amount
of memory into NUMA ptes each second, could this result in
a program being classified as a private task one second,
and a shared task a few seconds later?

What does the code do to prevent such an oscillating of
task classification? (which would have consequences for
the way the task's NUMA placement is handled, and might
result in the task moving from node to node needlessly)
--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ingo Molnar
2012-11-16 18:20:02 UTC
Permalink
Post by Peter Zijlstra
1) !numa tasks and numa tasks in the direction of more faults
2) allow !ideal tasks getting worse in the direction of faults
3) allow private tasks to get worse
4) allow shared tasks to get worse
This order ensures we prefer increasing memory locality but when
we do have to make hard decisions we prefer spreading private
over shared, because spreading shared tasks significantly
increases the interconnect bandwidth since not all memory can
follow.
Combined with the fact that we only turn a certain amount of
memory into NUMA ptes each second, could this result in a
program being classified as a private task one second, and a
shared task a few seconds later?
It's a statistical method, like most of scheduling.

It's as prone to oscillation as tasks are already prone to being
moved spuriously by the load balancer today, due to the per CPU
load average being statistical and them being slightly above or
below a critical load average value.

Higher freq oscillation should not happen normally though, we
dampen these metrics and have per CPU hysteresis.

( We can also add explicit hysteresis if anyone demonstrates
real oscillation with a real workload - wanted to keep it
simple first and change it only as-needed. )

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Rik van Riel
2012-11-16 18:30:02 UTC
Permalink
Post by Ingo Molnar
Post by Peter Zijlstra
1) !numa tasks and numa tasks in the direction of more faults
2) allow !ideal tasks getting worse in the direction of faults
3) allow private tasks to get worse
4) allow shared tasks to get worse
This order ensures we prefer increasing memory locality but when
we do have to make hard decisions we prefer spreading private
over shared, because spreading shared tasks significantly
increases the interconnect bandwidth since not all memory can
follow.
Combined with the fact that we only turn a certain amount of
memory into NUMA ptes each second, could this result in a
program being classified as a private task one second, and a
shared task a few seconds later?
It's a statistical method, like most of scheduling.
It's as prone to oscillation as tasks are already prone to being
moved spuriously by the load balancer today, due to the per CPU
load average being statistical and them being slightly above or
below a critical load average value.
Higher freq oscillation should not happen normally though, we
dampen these metrics and have per CPU hysteresis.
( We can also add explicit hysteresis if anyone demonstrates
real oscillation with a real workload - wanted to keep it
simple first and change it only as-needed. )
This heuristic is by no means simple, and there still is no
explanation for the serious performance degradations that
were seen on a 4 node system running specjbb in 4 node-sized
JVMs.

I asked a number of questions on this patch yesterday, and
am hoping to get explanations at some point :)
--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andi Kleen
2012-11-29 19:40:02 UTC
Permalink
Post by Peter Zijlstra
+
+ down_write(&mm->mmap_sem);
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ if (!vma_migratable(vma))
+ continue;
+ change_protection(vma, vma->vm_start, vma->vm_end, vma_prot_none(vma), 0);
+ }
What happens if I have a 1TB process? Will you really unmap all of the
1TB in that timer?
Post by Peter Zijlstra
if (pol->flags & MPOL_F_LOCAL)
- polnid = numa_node_id();
+ best_nid = numa_node_id();
else
- polnid = pol->v.preferred_node;
+ best_nid = pol->v.preferred_node;
So that's not the local node anymore? That will change behaviour for
people using the NUMA affinity APIs explicitely. I don't think that's a
good idea, if someone set the affinity explicitely the kernel better
follow that.

If you want to change behaviour for non DEFAULT like this
please use a new policy type.

-Andi
--
***@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Peter Zijlstra
2012-11-12 16:30:03 UTC
Permalink
By accounting against the present PTEs, scanning speed reflects the
actual present (mapped) memory.

Suggested-by: Ingo Molnar <***@kernel.org>
Signed-off-by: Peter Zijlstra <***@chello.nl>
Cc: Linus Torvalds <***@linux-foundation.org>
Cc: Andrew Morton <***@linux-foundation.org>
Cc: Peter Zijlstra <***@chello.nl>
Cc: Andrea Arcangeli <***@redhat.com>
Cc: Rik van Riel <***@redhat.com>
Cc: Mel Gorman <***@suse.de>
Signed-off-by: Ingo Molnar <***@kernel.org>
---
include/linux/hugetlb.h | 8 ++++++--
include/linux/mm.h | 6 +++---
kernel/sched/fair.c | 37 +++++++++++++++++++++----------------
mm/hugetlb.c | 10 ++++++++--
mm/mprotect.c | 41 ++++++++++++++++++++++++++++++-----------
5 files changed, 68 insertions(+), 34 deletions(-)

Index: linux/include/linux/hugetlb.h
===================================================================
--- linux.orig/include/linux/hugetlb.h
+++ linux/include/linux/hugetlb.h
@@ -87,7 +87,7 @@ struct page *follow_huge_pud(struct mm_s
pud_t *pud, int write);
int pmd_huge(pmd_t pmd);
int pud_huge(pud_t pmd);
-void hugetlb_change_protection(struct vm_area_struct *vma,
+unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot);

#else /* !CONFIG_HUGETLB_PAGE */
@@ -132,7 +132,11 @@ static inline void copy_huge_page(struct
{
}

-#define hugetlb_change_protection(vma, address, end, newprot)
+static inline unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
+ unsigned long address, unsigned long end, pgprot_t newprot)
+{
+ return 0;
+}

static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start,
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -1099,7 +1099,7 @@ extern unsigned long move_page_tables(st
extern unsigned long do_mremap(unsigned long addr,
unsigned long old_len, unsigned long new_len,
unsigned long flags, unsigned long new_addr);
-extern void change_protection(struct vm_area_struct *vma, unsigned long start,
+extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgprot_t newprot,
int dirty_accountable);
extern int mprotect_fixup(struct vm_area_struct *vma,
@@ -1581,10 +1581,10 @@ static inline pgprot_t vma_prot_none(str
return pgprot_modify(vma->vm_page_prot, vm_get_page_prot(vmflags));
}

-static inline void
+static inline unsigned long
change_prot_none(struct vm_area_struct *vma, unsigned long start, unsigned long end)
{
- change_protection(vma, start, end, vma_prot_none(vma), 0);
+ return change_protection(vma, start, end, vma_prot_none(vma), 0);
}

struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
Index: linux/kernel/sched/fair.c
===================================================================
--- linux.orig/kernel/sched/fair.c
+++ linux/kernel/sched/fair.c
@@ -914,8 +914,8 @@ void task_numa_work(struct callback_head
struct task_struct *p = current;
struct mm_struct *mm = p->mm;
struct vm_area_struct *vma;
- unsigned long offset, end;
- long length;
+ unsigned long start, end;
+ long pages;

WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));

@@ -942,30 +942,35 @@ void task_numa_work(struct callback_head
if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
return;

- offset = mm->numa_scan_offset;
- length = sysctl_sched_numa_scan_size;
- length <<= 20;
+ start = mm->numa_scan_offset;
+ pages = sysctl_sched_numa_scan_size;
+ pages <<= 20 - PAGE_SHIFT; /* MB in pages */
+ if (!pages)
+ return;

down_write(&mm->mmap_sem);
- vma = find_vma(mm, offset);
+ vma = find_vma(mm, start);
if (!vma) {
ACCESS_ONCE(mm->numa_scan_seq)++;
- offset = 0;
+ start = 0;
vma = mm->mmap;
}
- for (; vma && length > 0; vma = vma->vm_next) {
+ for (; vma; vma = vma->vm_next) {
if (!vma_migratable(vma))
continue;

- offset = max(offset, vma->vm_start);
- end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
- length -= end - offset;
-
- change_prot_none(vma, offset, end);
-
- offset = end;
+ do {
+ start = max(start, vma->vm_start);
+ end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
+ end = min(end, vma->vm_end);
+ pages -= change_prot_none(vma, start, end);
+ start = end;
+ if (pages <= 0)
+ goto out;
+ } while (end != vma->vm_end);
}
- mm->numa_scan_offset = offset;
+out:
+ mm->numa_scan_offset = start;
up_write(&mm->mmap_sem);
}

Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -3014,7 +3014,7 @@ same_page:
return i ? i : -EFAULT;
}

-void hugetlb_change_protection(struct vm_area_struct *vma,
+unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot)
{
struct mm_struct *mm = vma->vm_mm;
@@ -3022,6 +3022,7 @@ void hugetlb_change_protection(struct vm
pte_t *ptep;
pte_t pte;
struct hstate *h = hstate_vma(vma);
+ unsigned long pages = 0;

BUG_ON(address >= end);
flush_cache_range(vma, address, end);
@@ -3032,12 +3033,15 @@ void hugetlb_change_protection(struct vm
ptep = huge_pte_offset(mm, address);
if (!ptep)
continue;
- if (huge_pmd_unshare(mm, &address, ptep))
+ if (huge_pmd_unshare(mm, &address, ptep)) {
+ pages++;
continue;
+ }
if (!huge_pte_none(huge_ptep_get(ptep))) {
pte = huge_ptep_get_and_clear(mm, address, ptep);
pte = pte_mkhuge(pte_modify(pte, newprot));
set_huge_pte_at(mm, address, ptep, pte);
+ pages++;
}
}
spin_unlock(&mm->page_table_lock);
@@ -3049,6 +3053,8 @@ void hugetlb_change_protection(struct vm
*/
flush_tlb_range(vma, start, end);
mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+
+ return pages << h->order;
}

int hugetlb_reserve_pages(struct inode *inode,
Index: linux/mm/mprotect.c
===================================================================
--- linux.orig/mm/mprotect.c
+++ linux/mm/mprotect.c
@@ -28,12 +28,13 @@
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>

-static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
+static unsigned long change_pte_range(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
{
pte_t *pte, oldpte;
spinlock_t *ptl;
+ unsigned long pages = 0;

pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
arch_enter_lazy_mmu_mode();
@@ -53,6 +54,7 @@ static void change_pte_range(struct mm_s
ptent = pte_mkwrite(ptent);

ptep_modify_prot_commit(mm, addr, pte, ptent);
+ pages++;
} else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);

@@ -65,18 +67,22 @@ static void change_pte_range(struct mm_s
set_pte_at(mm, addr, pte,
swp_entry_to_pte(entry));
}
+ pages++;
}
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
+
+ return pages;
}

-static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
+static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
{
pmd_t *pmd;
unsigned long next;
+ unsigned long pages = 0;

pmd = pmd_offset(pud, addr);
do {
@@ -84,35 +90,42 @@ static inline void change_pmd_range(stru
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
split_huge_page_pmd(vma->vm_mm, pmd);
- else if (change_huge_pmd(vma, pmd, addr, newprot))
+ else if (change_huge_pmd(vma, pmd, addr, newprot)) {
+ pages += HPAGE_PMD_NR;
continue;
+ }
/* fall through */
}
if (pmd_none_or_clear_bad(pmd))
continue;
- change_pte_range(vma->vm_mm, pmd, addr, next, newprot,
+ pages += change_pte_range(vma->vm_mm, pmd, addr, next, newprot,
dirty_accountable);
} while (pmd++, addr = next, addr != end);
+
+ return pages;
}

-static inline void change_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
+static inline unsigned long change_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
{
pud_t *pud;
unsigned long next;
+ unsigned long pages = 0;

pud = pud_offset(pgd, addr);
do {
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud))
continue;
- change_pmd_range(vma, pud, addr, next, newprot,
+ pages += change_pmd_range(vma, pud, addr, next, newprot,
dirty_accountable);
} while (pud++, addr = next, addr != end);
+
+ return pages;
}

-static void change_protection_range(struct vm_area_struct *vma,
+static unsigned long change_protection_range(struct vm_area_struct *vma,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
{
@@ -120,6 +133,7 @@ static void change_protection_range(stru
pgd_t *pgd;
unsigned long next;
unsigned long start = addr;
+ unsigned long pages = 0;

BUG_ON(addr >= end);
pgd = pgd_offset(mm, addr);
@@ -128,24 +142,29 @@ static void change_protection_range(stru
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- change_pud_range(vma, pgd, addr, next, newprot,
+ pages += change_pud_range(vma, pgd, addr, next, newprot,
dirty_accountable);
} while (pgd++, addr = next, addr != end);
flush_tlb_range(vma, start, end);
+
+ return pages;
}

-void change_protection(struct vm_area_struct *vma, unsigned long start,
+unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgprot_t newprot,
int dirty_accountable)
{
struct mm_struct *mm = vma->vm_mm;
+ unsigned long pages;

mmu_notifier_invalidate_range_start(mm, start, end);
if (is_vm_hugetlb_page(vma))
- hugetlb_change_protection(vma, start, end, newprot);
+ pages = hugetlb_change_protection(vma, start, end, newprot);
else
- change_protection_range(vma, start, end, newprot, dirty_accountable);
+ pages = change_protection_range(vma, start, end, newprot, dirty_accountable);
mmu_notifier_invalidate_range_end(mm, start, end);
+
+ return pages;
}

int


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Peter Zijlstra
2012-11-12 16:30:03 UTC
Permalink
The NUMA placement code has been rewritten several times, but
the basic ideas took a lot of work to develop. The people who
put in the work deserve credit for it. Thanks Andrea & Peter :)

[ The Documentation/scheduler/numa-problem.txt file should
probably be rewritten once we figure out the final details of
what the NUMA code needs to do, and why. ]

Signed-off-by: Rik van Riel <***@redhat.com>
Acked-by: Peter Zijlstra <***@chello.nl>
Cc: Peter Zijlstra <***@chello.nl>
Cc: Andrea Arcangeli <***@redhat.com>
Cc: Rik van Riel <***@redhat.com>
Cc: Mel Gorman <***@suse.de>
Cc: Linus Torvalds <***@linux-foundation.org>
Cc: Andrew Morton <***@linux-foundation.org>
Signed-off-by: Ingo Molnar <***@kernel.org>
----
This is against tip.git numa/core
---
CREDITS | 1 +
kernel/sched/fair.c | 3 +++
mm/memory.c | 2 ++
3 files changed, 6 insertions(+)

Index: linux/CREDITS
===================================================================
--- linux.orig/CREDITS
+++ linux/CREDITS
@@ -125,6 +125,7 @@ D: Author of pscan that helps to fix lp/
D: Author of lil (Linux Interrupt Latency benchmark)
D: Fixed the shm swap deallocation at swapoff time (try_to_unuse message)
D: VM hacker
+D: NUMA task placement
D: Various other kernel hacks
S: Imola 40026
S: Italy
Index: linux/kernel/sched/fair.c
===================================================================
--- linux.orig/kernel/sched/fair.c
+++ linux/kernel/sched/fair.c
@@ -18,6 +18,9 @@
*
* Adaptive scheduling granularity, math enhancements by Peter Zijlstra
* Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <***@redhat.com>
+ *
+ * NUMA placement, statistics and algorithm by Andrea Arcangeli,
+ * CFS balancing changes by Peter Zijlstra. Copyright (C) 2012 Red Hat, Inc.
*/

#include <linux/latencytop.h>
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c
+++ linux/mm/memory.c
@@ -36,6 +36,8 @@
* (***@pdb.siemens.de)
*
* Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ *
+ * 2012 - NUMA placement page faults (Andrea Arcangeli, Peter Zijlstra)
*/

#include <linux/kernel_stat.h>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Peter Zijlstra
2012-11-12 16:30:04 UTC
Permalink
Add THP migration for the NUMA working set scanning fault case.

It uses the page lock to serialize. No migration pte dance is
necessary because the pte is already unmapped when we decide
to migrate.

Signed-off-by: Peter Zijlstra <***@chello.nl>
Cc: Johannes Weiner <***@cmpxchg.org>
Cc: Mel Gorman <***@suse.de>
Cc: Andrea Arcangeli <***@redhat.com>
Cc: Andrew Morton <***@linux-foundation.org>
Cc: Linus Torvalds <***@linux-foundation.org>
[ Significant fixes and changelog. ]
Signed-off-by: Ingo Molnar <***@kernel.org>
---
mm/huge_memory.c | 133 ++++++++++++++++++++++++++++++++++++++++++-------------
mm/migrate.c | 2
2 files changed, 104 insertions(+), 31 deletions(-)

Index: linux/mm/huge_memory.c
===================================================================
--- linux.orig/mm/huge_memory.c
+++ linux/mm/huge_memory.c
@@ -743,12 +743,13 @@ void do_huge_pmd_numa_page(struct mm_str
unsigned int flags, pmd_t entry)
{
unsigned long haddr = address & HPAGE_PMD_MASK;
+ struct page *new_page = NULL;
struct page *page = NULL;
- int node;
+ int node, lru;

spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(*pmd, entry)))
- goto out_unlock;
+ goto unlock;

if (unlikely(pmd_trans_splitting(entry))) {
spin_unlock(&mm->page_table_lock);
@@ -756,45 +757,117 @@ void do_huge_pmd_numa_page(struct mm_str
return;
}

-#ifdef CONFIG_NUMA
page = pmd_page(entry);
- VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+ if (page) {
+ VM_BUG_ON(!PageCompound(page) || !PageHead(page));

- get_page(page);
+ get_page(page);
+ node = mpol_misplaced(page, vma, haddr);
+ if (node != -1)
+ goto migrate;
+ }
+
+fixup:
+ /* change back to regular protection */
+ entry = pmd_modify(entry, vma->vm_page_prot);
+ set_pmd_at(mm, haddr, pmd, entry);
+ update_mmu_cache_pmd(vma, address, entry);
+
+unlock:
spin_unlock(&mm->page_table_lock);
+ if (page)
+ put_page(page);

- /*
- * XXX should we serialize against split_huge_page ?
- */
-
- node = mpol_misplaced(page, vma, haddr);
- if (node == -1)
- goto do_fixup;
-
- /*
- * Due to lacking code to migrate thp pages, we'll split
- * (which preserves the special PROT_NONE) and re-take the
- * fault on the normal pages.
- */
- split_huge_page(page);
- put_page(page);
return;

-do_fixup:
+migrate:
+ spin_unlock(&mm->page_table_lock);
+
+ lock_page(page);
spin_lock(&mm->page_table_lock);
- if (unlikely(!pmd_same(*pmd, entry)))
- goto out_unlock;
-#endif
+ if (unlikely(!pmd_same(*pmd, entry))) {
+ spin_unlock(&mm->page_table_lock);
+ unlock_page(page);
+ put_page(page);
+ return;
+ }
+ spin_unlock(&mm->page_table_lock);

- /* change back to regular protection */
- entry = pmd_modify(entry, vma->vm_page_prot);
- if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1))
- update_mmu_cache_pmd(vma, address, entry);
+ new_page = alloc_pages_node(node,
+ (GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT,
+ HPAGE_PMD_ORDER);
+
+ if (!new_page)
+ goto alloc_fail;
+
+ lru = PageLRU(page);
+
+ if (lru && isolate_lru_page(page)) /* does an implicit get_page() */
+ goto alloc_fail;
+
+ if (!trylock_page(new_page))
+ BUG();
+
+ /* anon mapping, we can simply copy page->mapping to the new page: */
+ new_page->mapping = page->mapping;
+ new_page->index = page->index;

-out_unlock:
+ migrate_page_copy(new_page, page);
+
+ WARN_ON(PageLRU(new_page));
+
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry))) {
+ spin_unlock(&mm->page_table_lock);
+ if (lru)
+ putback_lru_page(page);
+
+ unlock_page(new_page);
+ ClearPageActive(new_page); /* Set by migrate_page_copy() */
+ new_page->mapping = NULL;
+ put_page(new_page); /* Free it */
+
+ unlock_page(page);
+ put_page(page); /* Drop the local reference */
+
+ return;
+ }
+
+ entry = mk_pmd(new_page, vma->vm_page_prot);
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ entry = pmd_mkhuge(entry);
+
+ page_add_new_anon_rmap(new_page, vma, haddr);
+
+ set_pmd_at(mm, haddr, pmd, entry);
+ update_mmu_cache_pmd(vma, address, entry);
+ page_remove_rmap(page);
spin_unlock(&mm->page_table_lock);
- if (page)
+
+ put_page(page); /* Drop the rmap reference */
+
+ if (lru)
+ put_page(page); /* drop the LRU isolation reference */
+
+ unlock_page(new_page);
+ unlock_page(page);
+ put_page(page); /* Drop the local reference */
+
+ return;
+
+alloc_fail:
+ if (new_page)
+ put_page(new_page);
+
+ unlock_page(page);
+
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry))) {
put_page(page);
+ page = NULL;
+ goto unlock;
+ }
+ goto fixup;
}

int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
Index: linux/mm/migrate.c
===================================================================
--- linux.orig/mm/migrate.c
+++ linux/mm/migrate.c
@@ -417,7 +417,7 @@ int migrate_huge_page_move_mapping(struc
*/
void migrate_page_copy(struct page *newpage, struct page *page)
{
- if (PageHuge(page))
+ if (PageHuge(page) || PageTransHuge(page))
copy_huge_page(newpage, page);
else
copy_highpage(newpage, page);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ingo Molnar
2012-11-12 18:50:02 UTC
Permalink
Hi,
This series implements an improved version of NUMA scheduling,
based on the review and testing feedback we got.
[...]
This new scheduler code is then able to group tasks that are
"memory related" via their memory access patterns together: in
the NUMA context moving them on the same node if possible, and
spreading them amongst nodes if they use private memory.
Here are some preliminary performance figures, comparing the
vanilla kernel against the CONFIG_SCHED_NUMA=y kernel.

Java SPEC benchmark, running on a 4 node, 64 GB, 32-way server
system (higher numbers are better):

v3.7-vanilla: run #1: 475630
run #2: 538271
run #3: 533888
run #4: 431525
----------------------------------
avg: 494828 transactions/sec

v3.7-NUMA: run #1: 626692
run #2: 622069
run #3: 630335
run #4: 629817
----------------------------------
avg: 627228 transactions/sec [ +26.7% ]

Beyond the +26.7% performance improvement in throughput, the
standard deviation of the results is much lower as well with
NUMA scheduling enabled, by about an order of magnitude.

[ That is probably so because memory and task placement is more
balanced with NUMA scheduling enabled - while with the vanilla
kernel initial placement of the working set determines the
final performance figure. ]

I've also tested Andrea's 'autonumabench' benchmark suite
against vanilla and the NUMA kernel, because Mel reported that
the CONFIG_SCHED_NUMA=y code regressed. It does not regress
anymore:

#
# NUMA01
#
perf stat --null --repeat 3 ./numa01

v3.7-vanilla: 340.3 seconds ( +/- 0.31% )
v3.7-NUMA: 216.9 seconds [ +56% ] ( +/- 8.32% )
-------------------------------------
v3.7-HARD_BIND: 166.6 seconds

Here the new NUMA code is faster than vanilla by 56% - that is
because with the vanilla kernel all memory is allocated on
node0, overloading that node's memory bandwidth.

[ Standard deviation on the vanilla kernel is low, because the
autonuma test causes close to the worst-case placement for the
vanilla kernel - and there's not much space to deviate away
from the worst-case. Despite that, stddev in the NUMA seems a
tad high, suggesting further room for improvement. ]

#
# NUMA01_THREAD_ALLOC
#
perf stat --null --repeat 3 ./numa01_THREAD_ALLOC

v3.7-vanilla: 425.1 seconds ( +/- 1.04% )
v3.7-NUMA: 118.7 seconds [ +250% ] ( +/- 0.49% )
-------------------------------------
v3.7-HARD_BIND: 200.56 seconds

Here the NUMA kernel was able to go beyond the (naive)
hard-binding result and achieved 3.5x the performance of the
vanilla kernel, with a low stddev.

#
# NUMA02
#
perf stat --null --repeat 3 ./numa02

v3.7-vanilla: 56.1 seconds ( +/- 0.72% )
v3.7-NUMA: 17.0 seconds [ +230% ] ( +/- 0.18% )
-------------------------------------
v3.7-HARD_BIND: 14.9 seconds

Here the NUMA kernel runs the test much (3.3x) faster than the
vanilla kernel. The workload is able to converge very quickly
and approximate the hard-binding ideal number very closely. If
runtime was a bit longer it would approximate it even closer.

Standard deviation is also 3 times lower than vanilla,
suggesting stable NUMA convergence.

#
# NUMA02_SMT
#
perf stat --null --repeat 3 ./numa02_SMT
v3.7-vanilla: 56.1 seconds ( +- 0.42% )
v3.7-NUMA: 17.3 seconds [ +220% ] ( +- 0.88% )
-------------------------------------
v3.7-HARD_BIND: 14.6 seconds

In this test too the NUMA kernel outperforms the vanilla kernel,
by a factor of 3.2x. It comes very close to the ideal
hard-binding convergence result. Standard deviation is a bit
high.

I have also created a new perf benchmarking and workload
generation tool: 'perf bench numa' (I'll post it later in a
separate reply).

Via 'perf bench numa' we can generate arbitrary process and
thread layouts, with arbitrary memory sharing arrangements
between them.

Here are various comparisons to the vanilla kernel (higher
numbers are better):

#
# 4 processes with 4 threads per process, sharing 4x 1GB of
# process-wide memory:
#
# perf bench numa mem -l 100 -zZ0 -p 4 -t 4 -P 1024 -T 0
#
v3.7-vanilla: 14.8 GB/sec
v3.7-NUMA: 32.9 GB/sec [ +122.3% ]

2.2 times faster.

#
# 4 processes with 4 threads per process, sharing 4x 1GB of
# process-wide memory:
#
# perf bench numa mem -l 100 -zZ0 -p 4 -t 4 -P 0 -T 1024
#

v3.7-vanilla: 17.0 GB/sec
v3.7-NUMA: 36.3 GB/sec [ +113.5% ]

2.1 times faster.

So it's a nice improvement all around. With this version the
regressions that Mel Gorman reported a week ago appear to be
fixed as well.

Thanks,

Ingo

ps. If anyone is curious about further details, let me know.
The base kernel I used for measurement was commit
02743c9c03f1 + the 8 patches Peter sent out.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Mel Gorman
2012-11-15 10:10:01 UTC
Permalink
Post by Ingo Molnar
Hi,
This series implements an improved version of NUMA scheduling,
based on the review and testing feedback we got.
[...]
This new scheduler code is then able to group tasks that are
"memory related" via their memory access patterns together: in
the NUMA context moving them on the same node if possible, and
spreading them amongst nodes if they use private memory.
Here are some preliminary performance figures, comparing the
vanilla kernel against the CONFIG_SCHED_NUMA=y kernel.
Java SPEC benchmark, running on a 4 node, 64 GB, 32-way server
Ok, I used a 4-node, 64G, 48-way server system. We have different CPUs
but the same number of nodes. In case it makes a difference each of my
machines nodes are the same size.
Post by Ingo Molnar
v3.7-vanilla: run #1: 475630
run #2: 538271
run #3: 533888
run #4: 431525
----------------------------------
avg: 494828 transactions/sec
v3.7-NUMA: run #1: 626692
run #2: 622069
run #3: 630335
run #4: 629817
----------------------------------
avg: 627228 transactions/sec [ +26.7% ]
Beyond the +26.7% performance improvement in throughput, the
standard deviation of the results is much lower as well with
NUMA scheduling enabled, by about an order of magnitude.
[ That is probably so because memory and task placement is more
balanced with NUMA scheduling enabled - while with the vanilla
kernel initial placement of the working set determines the
final performance figure. ]
I did not see the same results. I used 3.7-rc4 as a baseline as it's what
I'm developing against currently. For your patches I pulled tip/sched/core
and then applied the patches you posted to the mailing list on top. It
means my tree looks different to yours but it was necessary if I was going
to do a like-with-like comparison. I also rebased Andrea'a autonuma28fast
branch from his git tree onto 3.7-rc4 (some mess, but nothing very serious).

As before, I'm cutting this report short

SPECJBB BOPS
3.7.0 3.7.0 3.7.0
rc4-stats-v2r34 rc4-schednuma-v2r3 rc4-autonuma-v28fast
Mean 1 25034.25 ( 0.00%) 20598.50 (-17.72%) 25192.25 ( 0.63%)
Mean 2 53176.00 ( 0.00%) 43906.50 (-17.43%) 55508.25 ( 4.39%)
Mean 3 77350.50 ( 0.00%) 60342.75 (-21.99%) 82122.50 ( 6.17%)
Mean 4 99919.50 ( 0.00%) 80781.75 (-19.15%) 107233.25 ( 7.32%)
Mean 5 119797.00 ( 0.00%) 97870.00 (-18.30%) 131016.00 ( 9.37%)
Mean 6 135858.00 ( 0.00%) 123912.50 ( -8.79%) 152444.75 ( 12.21%)
Mean 7 136074.00 ( 0.00%) 126574.25 ( -6.98%) 157372.75 ( 15.65%)
Mean 8 132426.25 ( 0.00%) 121766.00 ( -8.05%) 161655.25 ( 22.07%)
Mean 9 129432.75 ( 0.00%) 114224.25 (-11.75%) 160530.50 ( 24.03%)
Mean 10 118399.75 ( 0.00%) 109040.50 ( -7.90%) 158692.00 ( 34.03%)
Mean 11 119604.00 ( 0.00%) 105566.50 (-11.74%) 154462.00 ( 29.14%)
Mean 12 112742.25 ( 0.00%) 101728.75 ( -9.77%) 149546.00 ( 32.64%)
Mean 13 109480.75 ( 0.00%) 103737.50 ( -5.25%) 144929.25 ( 32.38%)
Mean 14 109724.00 ( 0.00%) 103516.00 ( -5.66%) 143804.50 ( 31.06%)
Mean 15 109111.75 ( 0.00%) 100817.00 ( -7.60%) 141878.00 ( 30.03%)
Mean 16 105385.75 ( 0.00%) 99327.25 ( -5.75%) 140156.75 ( 32.99%)
Mean 17 101903.50 ( 0.00%) 96464.50 ( -5.34%) 138402.00 ( 35.82%)
Mean 18 103632.50 ( 0.00%) 95632.50 ( -7.72%) 137781.50 ( 32.95%)
Stddev 1 1195.76 ( 0.00%) 358.07 ( 70.06%) 861.97 ( 27.91%)
Stddev 2 883.39 ( 0.00%) 1203.29 (-36.21%) 855.08 ( 3.20%)
Stddev 3 997.25 ( 0.00%) 3755.67 (-276.60%) 545.50 ( 45.30%)
Stddev 4 1115.16 ( 0.00%) 6390.65 (-473.07%) 1183.49 ( -6.13%)
Stddev 5 1367.09 ( 0.00%) 9710.70 (-610.32%) 1022.09 ( 25.24%)
Stddev 6 1125.22 ( 0.00%) 1097.83 ( 2.43%) 1013.52 ( 9.93%)
Stddev 7 3211.72 ( 0.00%) 1533.62 ( 52.25%) 512.61 ( 84.04%)
Stddev 8 4194.96 ( 0.00%) 1518.26 ( 63.81%) 493.64 ( 88.23%)
Stddev 9 6175.10 ( 0.00%) 2648.75 ( 57.11%) 2109.83 ( 65.83%)
Stddev 10 4754.87 ( 0.00%) 1941.47 ( 59.17%) 2948.98 ( 37.98%)
Stddev 11 2706.18 ( 0.00%) 1247.95 ( 53.89%) 5907.16 (-118.28%)
Stddev 12 3607.76 ( 0.00%) 663.63 ( 81.61%) 9063.28 (-151.22%)
Stddev 13 2771.67 ( 0.00%) 1447.87 ( 47.76%) 8716.51 (-214.49%)
Stddev 14 2522.18 ( 0.00%) 1510.28 ( 40.12%) 9286.98 (-268.21%)
Stddev 15 2711.16 ( 0.00%) 1719.54 ( 36.58%) 9895.88 (-265.01%)
Stddev 16 2797.21 ( 0.00%) 983.63 ( 64.84%) 9302.92 (-232.58%)
Stddev 17 4019.85 ( 0.00%) 1927.25 ( 52.06%) 9998.34 (-148.72%)
Stddev 18 3332.20 ( 0.00%) 1401.68 ( 57.94%) 12056.08 (-261.80%)
TPut 1 100137.00 ( 0.00%) 82394.00 (-17.72%) 100769.00 ( 0.63%)
TPut 2 212704.00 ( 0.00%) 175626.00 (-17.43%) 222033.00 ( 4.39%)
TPut 3 309402.00 ( 0.00%) 241371.00 (-21.99%) 328490.00 ( 6.17%)
TPut 4 399678.00 ( 0.00%) 323127.00 (-19.15%) 428933.00 ( 7.32%)
TPut 5 479188.00 ( 0.00%) 391480.00 (-18.30%) 524064.00 ( 9.37%)
TPut 6 543432.00 ( 0.00%) 495650.00 ( -8.79%) 609779.00 ( 12.21%)
TPut 7 544296.00 ( 0.00%) 506297.00 ( -6.98%) 629491.00 ( 15.65%)
TPut 8 529705.00 ( 0.00%) 487064.00 ( -8.05%) 646621.00 ( 22.07%)
TPut 9 517731.00 ( 0.00%) 456897.00 (-11.75%) 642122.00 ( 24.03%)
TPut 10 473599.00 ( 0.00%) 436162.00 ( -7.90%) 634768.00 ( 34.03%)
TPut 11 478416.00 ( 0.00%) 422266.00 (-11.74%) 617848.00 ( 29.14%)
TPut 12 450969.00 ( 0.00%) 406915.00 ( -9.77%) 598184.00 ( 32.64%)
TPut 13 437923.00 ( 0.00%) 414950.00 ( -5.25%) 579717.00 ( 32.38%)
TPut 14 438896.00 ( 0.00%) 414064.00 ( -5.66%) 575218.00 ( 31.06%)
TPut 15 436447.00 ( 0.00%) 403268.00 ( -7.60%) 567512.00 ( 30.03%)
TPut 16 421543.00 ( 0.00%) 397309.00 ( -5.75%) 560627.00 ( 32.99%)
TPut 17 407614.00 ( 0.00%) 385858.00 ( -5.34%) 553608.00 ( 35.82%)
TPut 18 414530.00 ( 0.00%) 382530.00 ( -7.72%) 551126.00 ( 32.95%)

It is important to know how this was configured. I was running one JVM
per node and the JVMs were sized that they should fit in the node. This
is a semi-ideal configuration because it could also be hard-bound for
best performance on the vanilla kernel. You did not say if you ran with
a single JVM or multiple JVMs and it's important.

The mean values are based on the individual throughput figures reported
by each JVM. schednuma regresses against mainline quite badly. For low
numbers of warehouses it also deviates more but it's much steadier for
higher numbers of warehouses. In terms of overall throughput though,
it's worse.

autonuma deviates a *lot* with massive variances between the JVMs.
However, the average and total throughput is very high.

SPECJBB PEAKS
3.7.0 3.7.0 3.7.0
rc4-stats-v2r34 rc4-schednuma-v2r3 rc4-autonuma-v28fast
Expctd Warehouse 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%)
Expctd Peak Bops 450969.00 ( 0.00%) 406915.00 ( -9.77%) 598184.00 ( 32.64%)
Actual Warehouse 7.00 ( 0.00%) 7.00 ( 0.00%) 8.00 ( 14.29%)
Actual Peak Bops 544296.00 ( 0.00%) 506297.00 ( -6.98%) 646621.00 ( 18.80%)

There is no major difference in terms of scalability. They peak at
around the 7 warehouse mark. autonuma peaked at 8 but you can see from
the figures that it was not by a whole lot. autonumas actual peak
operations was very high (18% gain) where schednuma regressed by close
to 7%.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0
rc4-stats-v2r34rc4-schednuma-v2r3rc4-autonuma-v28fast
User 101949.84 86817.79 101748.80
System 66.05 13094.99 191.40
Elapsed 2456.35 2459.16 2451.96

system CPU time is high for schednuma. autonuma reports low system CPU
usage but as it is using kernel threads for much of its work, it cannot
be considered reliable as it would not be captured here.
Post by Ingo Molnar
I've also tested Andrea's 'autonumabench' benchmark suite
against vanilla and the NUMA kernel, because Mel reported that
the CONFIG_SCHED_NUMA=y code regressed. It does not regress
#
# NUMA01
#
perf stat --null --repeat 3 ./numa01
v3.7-vanilla: 340.3 seconds ( +/- 0.31% )
v3.7-NUMA: 216.9 seconds [ +56% ] ( +/- 8.32% )
-------------------------------------
v3.7-HARD_BIND: 166.6 seconds
Here the new NUMA code is faster than vanilla by 56% - that is
because with the vanilla kernel all memory is allocated on
node0, overloading that node's memory bandwidth.
[ Standard deviation on the vanilla kernel is low, because the
autonuma test causes close to the worst-case placement for the
vanilla kernel - and there's not much space to deviate away
from the worst-case. Despite that, stddev in the NUMA seems a
tad high, suggesting further room for improvement. ]
For machines with more than 2 nodes, numa01 is an adverse workload.
Post by Ingo Molnar
#
# NUMA01_THREAD_ALLOC
#
perf stat --null --repeat 3 ./numa01_THREAD_ALLOC
v3.7-vanilla: 425.1 seconds ( +/- 1.04% )
v3.7-NUMA: 118.7 seconds [ +250% ] ( +/- 0.49% )
-------------------------------------
v3.7-HARD_BIND: 200.56 seconds
Here the NUMA kernel was able to go beyond the (naive)
hard-binding result and achieved 3.5x the performance of the
vanilla kernel, with a low stddev.
#
# NUMA02
#
perf stat --null --repeat 3 ./numa02
v3.7-vanilla: 56.1 seconds ( +/- 0.72% )
v3.7-NUMA: 17.0 seconds [ +230% ] ( +/- 0.18% )
-------------------------------------
v3.7-HARD_BIND: 14.9 seconds
Here the NUMA kernel runs the test much (3.3x) faster than the
vanilla kernel. The workload is able to converge very quickly
and approximate the hard-binding ideal number very closely. If
runtime was a bit longer it would approximate it even closer.
Standard deviation is also 3 times lower than vanilla,
suggesting stable NUMA convergence.
#
# NUMA02_SMT
#
perf stat --null --repeat 3 ./numa02_SMT
v3.7-vanilla: 56.1 seconds ( +- 0.42% )
v3.7-NUMA: 17.3 seconds [ +220% ] ( +- 0.88% )
-------------------------------------
v3.7-HARD_BIND: 14.6 seconds
In this test too the NUMA kernel outperforms the vanilla kernel,
by a factor of 3.2x. It comes very close to the ideal
hard-binding convergence result. Standard deviation is a bit
high.
With this benchark, I'm generally seeing very good results in terms of
elapsed time.

AUTONUMA BENCH
3.7.0 3.7.0 3.7.0
rc4-stats-v2r34 rc4-schednuma-v2r3 rc4-autonuma-v28fast
User NUMA01 67351.66 ( 0.00%) 47146.57 ( 30.00%) 30273.64 ( 55.05%)
User NUMA01_THEADLOCAL 54788.28 ( 0.00%) 17198.99 ( 68.61%) 17039.73 ( 68.90%)
User NUMA02 7179.87 ( 0.00%) 2096.07 ( 70.81%) 2099.85 ( 70.75%)
User NUMA02_SMT 3028.11 ( 0.00%) 998.22 ( 67.03%) 1052.97 ( 65.23%)
System NUMA01 45.68 ( 0.00%) 3531.04 (-7629.95%) 423.91 (-828.00%)
System NUMA01_THEADLOCAL 40.92 ( 0.00%) 926.72 (-2164.71%) 188.15 (-359.80%)
System NUMA02 1.72 ( 0.00%) 23.64 (-1274.42%) 27.37 (-1491.28%)
System NUMA02_SMT 0.92 ( 0.00%) 8.18 (-789.13%) 18.43 (-1903.26%)
Elapsed NUMA01 1514.61 ( 0.00%) 1122.78 ( 25.87%) 722.66 ( 52.29%)
Elapsed NUMA01_THEADLOCAL 1264.08 ( 0.00%) 393.79 ( 68.85%) 391.48 ( 69.03%)
Elapsed NUMA02 181.88 ( 0.00%) 49.44 ( 72.82%) 61.55 ( 66.16%)
Elapsed NUMA02_SMT 168.41 ( 0.00%) 47.49 ( 71.80%) 54.72 ( 67.51%)
CPU NUMA01 4449.00 ( 0.00%) 4513.00 ( -1.44%) 4247.00 ( 4.54%)
CPU NUMA01_THEADLOCAL 4337.00 ( 0.00%) 4602.00 ( -6.11%) 4400.00 ( -1.45%)
CPU NUMA02 3948.00 ( 0.00%) 4287.00 ( -8.59%) 3455.00 ( 12.49%)
CPU NUMA02_SMT 1798.00 ( 0.00%) 2118.00 (-17.80%) 1957.00 ( -8.84%)

On NUMA01, I'm seeing a large gain for schednuma. The test was not run
multiple times so I do not know how much it deviates by on each run.
However, the system CPU usage was again very high.

NUMA01_THEADLOCAL figures were comparable with autonuma. The system CPU
usage was high. As before, autonumas looks low but with the kernel
threads we cannot be sure.

schednuma was a clear winner on NUMA02 and NUMA02_SMT.

So for the synthetic benchmarks, schednuma looks good in terms of
elapsed time. On specjbb though, it is not looking great and this may be
due to differences in how we configured the JVMs.

I would have some comparison data with my own stuff but unfortunately
the machine crashed when running tests with schednuma. That said, I
expect the figures to be bad if they had run. With V2, the CPU-follows
placement policy is broken as is PMD handling. In my current tree I'm
expecting the system CPU usage to be also high but I won't know for sure
until later today.

The machine was meant to test all this overnight but unfortunately when
running a kernel build benchmark on the schednuma patches the machine
hung while downloading the tarball with this

[ 73.863226] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 73.871062] IP: [<ffffffff8146feaa>] skb_gro_receive+0xaa/0x590
[ 73.876983] PGD 0
[ 73.878998] Oops: 0002 [#1] PREEMPT SMP
[ 73.882938] Modules linked in: af_packet mperf kvm_intel coretemp kvm crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd sr_mod lrw cdrom aes_x86_64 ses pcspkr xts i7core_edac ata_piix enclosure lpc_ich dcdbas sg gf128mul mfd_core bnx2 edac_core wmi acpi_power_meter button serio_raw joydev microcode autofs4 processor thermal_sys scsi_dh_rdac scsi_dh_hp_sw scsi_dh_alua scsi_dh_emc scsi_dh ata_generic megaraid_sas pata_atiixp [last unloaded: oprofile]
[ 73.924659] CPU 0
[ 73.926493] Pid: 0, comm: swapper/0 Not tainted 3.7.0-rc4-schednuma-v2r3 #1 Dell Inc. PowerEdge R810/0TT6JF
[ 73.936380] RIP: 0010:[<ffffffff8146feaa>] [<ffffffff8146feaa>] skb_gro_receive+0xaa/0x590
[ 73.944714] RSP: 0018:ffff88047f803b50 EFLAGS: 00010282
[ 73.950004] RAX: 0000000000000000 RBX: ffff88046c2bdbc0 RCX: 0000000000000900
[ 73.957113] RDX: 00000000000005a8 RSI: ffff88046c2bdbc0 RDI: ffff88046eadb800
[ 73.964221] RBP: ffff88047f803bb0 R08: 00000000000005dc R09: ffff88046ddeccc0
[ 73.971328] R10: ffff88086d795d78 R11: 0000000000000001 R12: ffff880462b282c0
[ 73.978436] R13: 0000000000000034 R14: 00000000000005a8 R15: ffff88046eadbec0
[ 73.985543] FS: 0000000000000000(0000) GS:ffff88047f800000(0000) knlGS:0000000000000000
[ 73.993602] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 73.999326] CR2: 0000000000000000 CR3: 0000000001a0c000 CR4: 00000000000007f0
[ 74.006435] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 74.013543] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 74.020651] Process swapper/0 (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a14420)
[ 74.028883] Stack:
[ 74.030885] 0000000000000060 ffff880462b282c0 ffff88086d795d78 ffffffff000005dc
[ 74.038300] ffff88046e5f46c0 000000606a275ec0 0000000000000000 ffff88046c2bdbc0
[ 74.045715] 00000000000005a8 ffff88086d795d78 00000000000005a8 000000006c001080
[ 74.053131] Call Trace:
[ 74.055567] <IRQ>
[ 74.057486] [<ffffffff814b9573>] tcp_gro_receive+0x213/0x2b0
[ 74.063419] [<ffffffff814cff49>] tcp4_gro_receive+0x99/0x110
[ 74.069150] [<ffffffff814e096d>] inet_gro_receive+0x1cd/0x200
[ 74.074965] [<ffffffff8147b30a>] dev_gro_receive+0x1ba/0x2b0
[ 74.080691] [<ffffffff8147b6e3>] napi_gro_receive+0xe3/0x130
[ 74.086426] [<ffffffffa009fda8>] bnx2_rx_int+0x3e8/0xf10 [bnx2]
[ 74.092416] [<ffffffffa00a0cbd>] bnx2_poll_work+0x3ed/0x450 [bnx2]
[ 74.098666] [<ffffffffa00a0d5e>] bnx2_poll_msix+0x3e/0xc0 [bnx2]
[ 74.104739] [<ffffffff8147b969>] net_rx_action+0x159/0x290
[ 74.110298] [<ffffffff8104d148>] __do_softirq+0xc8/0x250
[ 74.115682] [<ffffffff8107bf9e>] ? sched_clock_idle_wakeup_event+0x1e/0x20
[ 74.122625] [<ffffffff81577c9c>] call_softirq+0x1c/0x30
[ 74.127922] [<ffffffff8100470d>] do_softirq+0x6d/0xa0
[ 74.133041] [<ffffffff8104d44d>] irq_exit+0xad/0xc0
[ 74.137996] [<ffffffff8107779d>] scheduler_ipi+0x5d/0x110
[ 74.143469] [<ffffffff8102b7a4>] ? native_apic_msr_eoi_write+0x14/0x20
[ 74.150060] [<ffffffff810257d5>] smp_reschedule_interrupt+0x25/0x30
[ 74.156394] [<ffffffff8157785d>] reschedule_interrupt+0x6d/0x80
[ 74.162376] <EOI>
[ 74.164295] [<ffffffff81316798>] ? intel_idle+0xe8/0x150
[ 74.169875] [<ffffffff81316779>] ? intel_idle+0xc9/0x150
[ 74.175259] [<ffffffff8143de99>] cpuidle_enter+0x19/0x20
[ 74.180642] [<ffffffff8143e522>] cpuidle_idle_call+0xa2/0x340
[ 74.186458] [<ffffffff8100baca>] cpu_idle+0x7a/0xf0
[ 74.191410] [<ffffffff8154b44b>] rest_init+0x7b/0x80
[ 74.196447] [<ffffffff81ac3be2>] start_kernel+0x38f/0x39c
[ 74.201913] [<ffffffff81ac3652>] ? repair_env_string+0x5e/0x5e
[ 74.207815] [<ffffffff81ac3335>] x86_64_start_reservations+0x131/0x135
[ 74.214407] [<ffffffff81ac3439>] x86_64_start_kernel+0x100/0x10f
[ 74.220475] Code: 8b e8 00 00 00 0f 87 86 00 00 00 8b 53 68 8b 43 6c 44 29 ea 39 d0 89 53 68 0f 87 c7 04 00 00 4c 01 ab e0 00 00 00 49 8b 44 24 08 <48> 89 18 49 89 5c 24 08 0f b6 43 7c a8 10 0f 85 ac 04 00 00 83
[ 74.240051] RIP [<ffffffff8146feaa>] skb_gro_receive+0xaa/0x590
[ 74.246046] RSP <ffff88047f803b50>
[ 74.249518] CR2: 0000000000000000
[ 74.252821] ---[ end trace 97cb529523f52c9b ]---
[ 74.258895] Kernel panic - not syncing: Fatal exception in interrupt
-- 0:console -- time-stamp -- Nov/15/12 3:09:06 --

I've no idea if it is directly related to your patches and I didn't try
to reproduce it yet.
Post by Ingo Molnar
generation tool: 'perf bench numa' (I'll post it later in a
separate reply).
Via 'perf bench numa' we can generate arbitrary process and
thread layouts, with arbitrary memory sharing arrangements
between them.
Here are various comparisons to the vanilla kernel (higher
#
# 4 processes with 4 threads per process, sharing 4x 1GB of
#
# perf bench numa mem -l 100 -zZ0 -p 4 -t 4 -P 1024 -T 0
#
v3.7-vanilla: 14.8 GB/sec
v3.7-NUMA: 32.9 GB/sec [ +122.3% ]
2.2 times faster.
#
# 4 processes with 4 threads per process, sharing 4x 1GB of
#
# perf bench numa mem -l 100 -zZ0 -p 4 -t 4 -P 0 -T 1024
#
v3.7-vanilla: 17.0 GB/sec
v3.7-NUMA: 36.3 GB/sec [ +113.5% ]
2.1 times faster.
That is really cool.
Post by Ingo Molnar
So it's a nice improvement all around. With this version the
regressions that Mel Gorman reported a week ago appear to be
fixed as well.
Unfortunately I cannot concur. I'm still seeing high system CPU usage in
places and the specjbb figures are rather unfortunate.
--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Rik van Riel
2012-11-15 19:00:02 UTC
Permalink
Post by Mel Gorman
Post by Ingo Molnar
Here are some preliminary performance figures, comparing the
vanilla kernel against the CONFIG_SCHED_NUMA=y kernel.
Java SPEC benchmark, running on a 4 node, 64 GB, 32-way server
Ok, I used a 4-node, 64G, 48-way server system. We have different CPUs
but the same number of nodes. In case it makes a difference each of my
machines nodes are the same size.
Mel, do you have info on exactly what model system you
were running these tests on?

Obviously your results are very different from the ones
that Ingo saw. It would be most helpful if we could find
a similar system in one of the Red Hat labs, so Ingo can
play around with it and see what's going on :)
Post by Mel Gorman
Post by Ingo Molnar
Beyond the +26.7% performance improvement in throughput, the
standard deviation of the results is much lower as well with
NUMA scheduling enabled, by about an order of magnitude.
I did not see the same results. I used 3.7-rc4 as a baseline as it's what
I'm developing against currently. For your patches I pulled tip/sched/core
and then applied the patches you posted to the mailing list on top. It
means my tree looks different to yours but it was necessary if I was going
to do a like-with-like comparison. I also rebased Andrea'a autonuma28fast
branch from his git tree onto 3.7-rc4 (some mess, but nothing very serious).
As before, I'm cutting this report short
SPECJBB BOPS
3.7.0 3.7.0 3.7.0
rc4-stats-v2r34 rc4-schednuma-v2r3 rc4-autonuma-v28fast
Mean 1 25034.25 ( 0.00%) 20598.50 (-17.72%) 25192.25 ( 0.63%)
Mean 2 53176.00 ( 0.00%) 43906.50 (-17.43%) 55508.25 ( 4.39%)
Mean 3 77350.50 ( 0.00%) 60342.75 (-21.99%) 82122.50 ( 6.17%)
Mean 4 99919.50 ( 0.00%) 80781.75 (-19.15%) 107233.25 ( 7.32%)
Mean 5 119797.00 ( 0.00%) 97870.00 (-18.30%) 131016.00 ( 9.37%)
Mean 6 135858.00 ( 0.00%) 123912.50 ( -8.79%) 152444.75 ( 12.21%)
Mean 7 136074.00 ( 0.00%) 126574.25 ( -6.98%) 157372.75 ( 15.65%)
Mean 8 132426.25 ( 0.00%) 121766.00 ( -8.05%) 161655.25 ( 22.07%)
Mean 9 129432.75 ( 0.00%) 114224.25 (-11.75%) 160530.50 ( 24.03%)
Mean 10 118399.75 ( 0.00%) 109040.50 ( -7.90%) 158692.00 ( 34.03%)
Mean 11 119604.00 ( 0.00%) 105566.50 (-11.74%) 154462.00 ( 29.14%)
Mean 12 112742.25 ( 0.00%) 101728.75 ( -9.77%) 149546.00 ( 32.64%)
Mean 13 109480.75 ( 0.00%) 103737.50 ( -5.25%) 144929.25 ( 32.38%)
Mean 14 109724.00 ( 0.00%) 103516.00 ( -5.66%) 143804.50 ( 31.06%)
Mean 15 109111.75 ( 0.00%) 100817.00 ( -7.60%) 141878.00 ( 30.03%)
Mean 16 105385.75 ( 0.00%) 99327.25 ( -5.75%) 140156.75 ( 32.99%)
Mean 17 101903.50 ( 0.00%) 96464.50 ( -5.34%) 138402.00 ( 35.82%)
Mean 18 103632.50 ( 0.00%) 95632.50 ( -7.72%) 137781.50 ( 32.95%)
Stddev 1 1195.76 ( 0.00%) 358.07 ( 70.06%) 861.97 ( 27.91%)
Stddev 2 883.39 ( 0.00%) 1203.29 (-36.21%) 855.08 ( 3.20%)
Stddev 3 997.25 ( 0.00%) 3755.67 (-276.60%) 545.50 ( 45.30%)
Stddev 4 1115.16 ( 0.00%) 6390.65 (-473.07%) 1183.49 ( -6.13%)
Stddev 5 1367.09 ( 0.00%) 9710.70 (-610.32%) 1022.09 ( 25.24%)
Stddev 6 1125.22 ( 0.00%) 1097.83 ( 2.43%) 1013.52 ( 9.93%)
Stddev 7 3211.72 ( 0.00%) 1533.62 ( 52.25%) 512.61 ( 84.04%)
Stddev 8 4194.96 ( 0.00%) 1518.26 ( 63.81%) 493.64 ( 88.23%)
Stddev 9 6175.10 ( 0.00%) 2648.75 ( 57.11%) 2109.83 ( 65.83%)
Stddev 10 4754.87 ( 0.00%) 1941.47 ( 59.17%) 2948.98 ( 37.98%)
Stddev 11 2706.18 ( 0.00%) 1247.95 ( 53.89%) 5907.16 (-118.28%)
Stddev 12 3607.76 ( 0.00%) 663.63 ( 81.61%) 9063.28 (-151.22%)
Stddev 13 2771.67 ( 0.00%) 1447.87 ( 47.76%) 8716.51 (-214.49%)
Stddev 14 2522.18 ( 0.00%) 1510.28 ( 40.12%) 9286.98 (-268.21%)
Stddev 15 2711.16 ( 0.00%) 1719.54 ( 36.58%) 9895.88 (-265.01%)
Stddev 16 2797.21 ( 0.00%) 983.63 ( 64.84%) 9302.92 (-232.58%)
Stddev 17 4019.85 ( 0.00%) 1927.25 ( 52.06%) 9998.34 (-148.72%)
Stddev 18 3332.20 ( 0.00%) 1401.68 ( 57.94%) 12056.08 (-261.80%)
TPut 1 100137.00 ( 0.00%) 82394.00 (-17.72%) 100769.00 ( 0.63%)
TPut 2 212704.00 ( 0.00%) 175626.00 (-17.43%) 222033.00 ( 4.39%)
TPut 3 309402.00 ( 0.00%) 241371.00 (-21.99%) 328490.00 ( 6.17%)
TPut 4 399678.00 ( 0.00%) 323127.00 (-19.15%) 428933.00 ( 7.32%)
TPut 5 479188.00 ( 0.00%) 391480.00 (-18.30%) 524064.00 ( 9.37%)
TPut 6 543432.00 ( 0.00%) 495650.00 ( -8.79%) 609779.00 ( 12.21%)
TPut 7 544296.00 ( 0.00%) 506297.00 ( -6.98%) 629491.00 ( 15.65%)
TPut 8 529705.00 ( 0.00%) 487064.00 ( -8.05%) 646621.00 ( 22.07%)
TPut 9 517731.00 ( 0.00%) 456897.00 (-11.75%) 642122.00 ( 24.03%)
TPut 10 473599.00 ( 0.00%) 436162.00 ( -7.90%) 634768.00 ( 34.03%)
TPut 11 478416.00 ( 0.00%) 422266.00 (-11.74%) 617848.00 ( 29.14%)
TPut 12 450969.00 ( 0.00%) 406915.00 ( -9.77%) 598184.00 ( 32.64%)
TPut 13 437923.00 ( 0.00%) 414950.00 ( -5.25%) 579717.00 ( 32.38%)
TPut 14 438896.00 ( 0.00%) 414064.00 ( -5.66%) 575218.00 ( 31.06%)
TPut 15 436447.00 ( 0.00%) 403268.00 ( -7.60%) 567512.00 ( 30.03%)
TPut 16 421543.00 ( 0.00%) 397309.00 ( -5.75%) 560627.00 ( 32.99%)
TPut 17 407614.00 ( 0.00%) 385858.00 ( -5.34%) 553608.00 ( 35.82%)
TPut 18 414530.00 ( 0.00%) 382530.00 ( -7.72%) 551126.00 ( 32.95%)
It is important to know how this was configured. I was running one JVM
per node and the JVMs were sized that they should fit in the node. This
is a semi-ideal configuration because it could also be hard-bound for
best performance on the vanilla kernel. You did not say if you ran with
a single JVM or multiple JVMs and it's important.
The mean values are based on the individual throughput figures reported
by each JVM. schednuma regresses against mainline quite badly. For low
numbers of warehouses it also deviates more but it's much steadier for
higher numbers of warehouses. In terms of overall throughput though,
it's worse.
autonuma deviates a *lot* with massive variances between the JVMs.
However, the average and total throughput is very high.
SPECJBB PEAKS
3.7.0 3.7.0 3.7.0
rc4-stats-v2r34 rc4-schednuma-v2r3 rc4-autonuma-v28fast
Expctd Warehouse 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%)
Expctd Peak Bops 450969.00 ( 0.00%) 406915.00 ( -9.77%) 598184.00 ( 32.64%)
Actual Warehouse 7.00 ( 0.00%) 7.00 ( 0.00%) 8.00 ( 14.29%)
Actual Peak Bops 544296.00 ( 0.00%) 506297.00 ( -6.98%) 646621.00 ( 18.80%)
There is no major difference in terms of scalability. They peak at
around the 7 warehouse mark. autonuma peaked at 8 but you can see from
the figures that it was not by a whole lot. autonumas actual peak
operations was very high (18% gain) where schednuma regressed by close
to 7%.
MMTests Statistics: duration
3.7.0 3.7.0 3.7.0
rc4-stats-v2r34rc4-schednuma-v2r3rc4-autonuma-v28fast
User 101949.84 86817.79 101748.80
System 66.05 13094.99 191.40
Elapsed 2456.35 2459.16 2451.96
system CPU time is high for schednuma. autonuma reports low system CPU
usage but as it is using kernel threads for much of its work, it cannot
be considered reliable as it would not be captured here.
Post by Ingo Molnar
I've also tested Andrea's 'autonumabench' benchmark suite
against vanilla and the NUMA kernel, because Mel reported that
the CONFIG_SCHED_NUMA=y code regressed. It does not regress
#
# NUMA01
#
perf stat --null --repeat 3 ./numa01
v3.7-vanilla: 340.3 seconds ( +/- 0.31% )
v3.7-NUMA: 216.9 seconds [ +56% ] ( +/- 8.32% )
-------------------------------------
v3.7-HARD_BIND: 166.6 seconds
Here the new NUMA code is faster than vanilla by 56% - that is
because with the vanilla kernel all memory is allocated on
node0, overloading that node's memory bandwidth.
[ Standard deviation on the vanilla kernel is low, because the
autonuma test causes close to the worst-case placement for the
vanilla kernel - and there's not much space to deviate away
from the worst-case. Despite that, stddev in the NUMA seems a
tad high, suggesting further room for improvement. ]
For machines with more than 2 nodes, numa01 is an adverse workload.
Post by Ingo Molnar
#
# NUMA01_THREAD_ALLOC
#
perf stat --null --repeat 3 ./numa01_THREAD_ALLOC
v3.7-vanilla: 425.1 seconds ( +/- 1.04% )
v3.7-NUMA: 118.7 seconds [ +250% ] ( +/- 0.49% )
-------------------------------------
v3.7-HARD_BIND: 200.56 seconds
Here the NUMA kernel was able to go beyond the (naive)
hard-binding result and achieved 3.5x the performance of the
vanilla kernel, with a low stddev.
#
# NUMA02
#
perf stat --null --repeat 3 ./numa02
v3.7-vanilla: 56.1 seconds ( +/- 0.72% )
v3.7-NUMA: 17.0 seconds [ +230% ] ( +/- 0.18% )
-------------------------------------
v3.7-HARD_BIND: 14.9 seconds
Here the NUMA kernel runs the test much (3.3x) faster than the
vanilla kernel. The workload is able to converge very quickly
and approximate the hard-binding ideal number very closely. If
runtime was a bit longer it would approximate it even closer.
Standard deviation is also 3 times lower than vanilla,
suggesting stable NUMA convergence.
#
# NUMA02_SMT
#
perf stat --null --repeat 3 ./numa02_SMT
v3.7-vanilla: 56.1 seconds ( +- 0.42% )
v3.7-NUMA: 17.3 seconds [ +220% ] ( +- 0.88% )
-------------------------------------
v3.7-HARD_BIND: 14.6 seconds
In this test too the NUMA kernel outperforms the vanilla kernel,
by a factor of 3.2x. It comes very close to the ideal
hard-binding convergence result. Standard deviation is a bit
high.
With this benchark, I'm generally seeing very good results in terms of
elapsed time.
AUTONUMA BENCH
3.7.0 3.7.0 3.7.0
rc4-stats-v2r34 rc4-schednuma-v2r3 rc4-autonuma-v28fast
User NUMA01 67351.66 ( 0.00%) 47146.57 ( 30.00%) 30273.64 ( 55.05%)
User NUMA01_THEADLOCAL 54788.28 ( 0.00%) 17198.99 ( 68.61%) 17039.73 ( 68.90%)
User NUMA02 7179.87 ( 0.00%) 2096.07 ( 70.81%) 2099.85 ( 70.75%)
User NUMA02_SMT 3028.11 ( 0.00%) 998.22 ( 67.03%) 1052.97 ( 65.23%)
System NUMA01 45.68 ( 0.00%) 3531.04 (-7629.95%) 423.91 (-828.00%)
System NUMA01_THEADLOCAL 40.92 ( 0.00%) 926.72 (-2164.71%) 188.15 (-359.80%)
System NUMA02 1.72 ( 0.00%) 23.64 (-1274.42%) 27.37 (-1491.28%)
System NUMA02_SMT 0.92 ( 0.00%) 8.18 (-789.13%) 18.43 (-1903.26%)
Elapsed NUMA01 1514.61 ( 0.00%) 1122.78 ( 25.87%) 722.66 ( 52.29%)
Elapsed NUMA01_THEADLOCAL 1264.08 ( 0.00%) 393.79 ( 68.85%) 391.48 ( 69.03%)
Elapsed NUMA02 181.88 ( 0.00%) 49.44 ( 72.82%) 61.55 ( 66.16%)
Elapsed NUMA02_SMT 168.41 ( 0.00%) 47.49 ( 71.80%) 54.72 ( 67.51%)
CPU NUMA01 4449.00 ( 0.00%) 4513.00 ( -1.44%) 4247.00 ( 4.54%)
CPU NUMA01_THEADLOCAL 4337.00 ( 0.00%) 4602.00 ( -6.11%) 4400.00 ( -1.45%)
CPU NUMA02 3948.00 ( 0.00%) 4287.00 ( -8.59%) 3455.00 ( 12.49%)
CPU NUMA02_SMT 1798.00 ( 0.00%) 2118.00 (-17.80%) 1957.00 ( -8.84%)
On NUMA01, I'm seeing a large gain for schednuma. The test was not run
multiple times so I do not know how much it deviates by on each run.
However, the system CPU usage was again very high.
NUMA01_THEADLOCAL figures were comparable with autonuma. The system CPU
usage was high. As before, autonumas looks low but with the kernel
threads we cannot be sure.
schednuma was a clear winner on NUMA02 and NUMA02_SMT.
So for the synthetic benchmarks, schednuma looks good in terms of
elapsed time. On specjbb though, it is not looking great and this may be
due to differences in how we configured the JVMs.
I would have some comparison data with my own stuff but unfortunately
the machine crashed when running tests with schednuma. That said, I
expect the figures to be bad if they had run. With V2, the CPU-follows
placement policy is broken as is PMD handling. In my current tree I'm
expecting the system CPU usage to be also high but I won't know for sure
until later today.
The machine was meant to test all this overnight but unfortunately when
running a kernel build benchmark on the schednuma patches the machine
hung while downloading the tarball with this
[ 73.863226] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 73.871062] IP: [<ffffffff8146feaa>] skb_gro_receive+0xaa/0x590
[ 73.876983] PGD 0
[ 73.878998] Oops: 0002 [#1] PREEMPT SMP
[ 73.882938] Modules linked in: af_packet mperf kvm_intel coretemp kvm crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd sr_mod lrw cdrom aes_x86_64 ses pcspkr xts i7core_edac ata_piix enclosure lpc_ich dcdbas sg gf128mul mfd_core bnx2 edac_core wmi acpi_power_meter button serio_raw joydev microcode autofs4 processor thermal_sys scsi_dh_rdac scsi_dh_hp_sw scsi_dh_alua scsi_dh_emc scsi_dh ata_generic megaraid_sas pata_atiixp [last unloaded: oprofile]
[ 73.924659] CPU 0
[ 73.926493] Pid: 0, comm: swapper/0 Not tainted 3.7.0-rc4-schednuma-v2r3 #1 Dell Inc. PowerEdge R810/0TT6JF
[ 73.936380] RIP: 0010:[<ffffffff8146feaa>] [<ffffffff8146feaa>] skb_gro_receive+0xaa/0x590
[ 73.944714] RSP: 0018:ffff88047f803b50 EFLAGS: 00010282
[ 73.950004] RAX: 0000000000000000 RBX: ffff88046c2bdbc0 RCX: 0000000000000900
[ 73.957113] RDX: 00000000000005a8 RSI: ffff88046c2bdbc0 RDI: ffff88046eadb800
[ 73.964221] RBP: ffff88047f803bb0 R08: 00000000000005dc R09: ffff88046ddeccc0
[ 73.971328] R10: ffff88086d795d78 R11: 0000000000000001 R12: ffff880462b282c0
[ 73.978436] R13: 0000000000000034 R14: 00000000000005a8 R15: ffff88046eadbec0
[ 73.985543] FS: 0000000000000000(0000) GS:ffff88047f800000(0000) knlGS:0000000000000000
[ 73.993602] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 73.999326] CR2: 0000000000000000 CR3: 0000000001a0c000 CR4: 00000000000007f0
[ 74.006435] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 74.013543] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 74.020651] Process swapper/0 (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a14420)
[ 74.030885] 0000000000000060 ffff880462b282c0 ffff88086d795d78 ffffffff000005dc
[ 74.038300] ffff88046e5f46c0 000000606a275ec0 0000000000000000 ffff88046c2bdbc0
[ 74.045715] 00000000000005a8 ffff88086d795d78 00000000000005a8 000000006c001080
[ 74.055567] <IRQ>
[ 74.057486] [<ffffffff814b9573>] tcp_gro_receive+0x213/0x2b0
[ 74.063419] [<ffffffff814cff49>] tcp4_gro_receive+0x99/0x110
[ 74.069150] [<ffffffff814e096d>] inet_gro_receive+0x1cd/0x200
[ 74.074965] [<ffffffff8147b30a>] dev_gro_receive+0x1ba/0x2b0
[ 74.080691] [<ffffffff8147b6e3>] napi_gro_receive+0xe3/0x130
[ 74.086426] [<ffffffffa009fda8>] bnx2_rx_int+0x3e8/0xf10 [bnx2]
[ 74.092416] [<ffffffffa00a0cbd>] bnx2_poll_work+0x3ed/0x450 [bnx2]
[ 74.098666] [<ffffffffa00a0d5e>] bnx2_poll_msix+0x3e/0xc0 [bnx2]
[ 74.104739] [<ffffffff8147b969>] net_rx_action+0x159/0x290
[ 74.110298] [<ffffffff8104d148>] __do_softirq+0xc8/0x250
[ 74.115682] [<ffffffff8107bf9e>] ? sched_clock_idle_wakeup_event+0x1e/0x20
[ 74.122625] [<ffffffff81577c9c>] call_softirq+0x1c/0x30
[ 74.127922] [<ffffffff8100470d>] do_softirq+0x6d/0xa0
[ 74.133041] [<ffffffff8104d44d>] irq_exit+0xad/0xc0
[ 74.137996] [<ffffffff8107779d>] scheduler_ipi+0x5d/0x110
[ 74.143469] [<ffffffff8102b7a4>] ? native_apic_msr_eoi_write+0x14/0x20
[ 74.150060] [<ffffffff810257d5>] smp_reschedule_interrupt+0x25/0x30
[ 74.156394] [<ffffffff8157785d>] reschedule_interrupt+0x6d/0x80
[ 74.162376] <EOI>
[ 74.164295] [<ffffffff81316798>] ? intel_idle+0xe8/0x150
[ 74.169875] [<ffffffff81316779>] ? intel_idle+0xc9/0x150
[ 74.175259] [<ffffffff8143de99>] cpuidle_enter+0x19/0x20
[ 74.180642] [<ffffffff8143e522>] cpuidle_idle_call+0xa2/0x340
[ 74.186458] [<ffffffff8100baca>] cpu_idle+0x7a/0xf0
[ 74.191410] [<ffffffff8154b44b>] rest_init+0x7b/0x80
[ 74.196447] [<ffffffff81ac3be2>] start_kernel+0x38f/0x39c
[ 74.201913] [<ffffffff81ac3652>] ? repair_env_string+0x5e/0x5e
[ 74.207815] [<ffffffff81ac3335>] x86_64_start_reservations+0x131/0x135
[ 74.214407] [<ffffffff81ac3439>] x86_64_start_kernel+0x100/0x10f
[ 74.220475] Code: 8b e8 00 00 00 0f 87 86 00 00 00 8b 53 68 8b 43 6c 44 29 ea 39 d0 89 53 68 0f 87 c7 04 00 00 4c 01 ab e0 00 00 00 49 8b 44 24 08 <48> 89 18 49 89 5c 24 08 0f b6 43 7c a8 10 0f 85 ac 04 00 00 83
[ 74.240051] RIP [<ffffffff8146feaa>] skb_gro_receive+0xaa/0x590
[ 74.246046] RSP <ffff88047f803b50>
[ 74.249518] CR2: 0000000000000000
[ 74.252821] ---[ end trace 97cb529523f52c9b ]---
[ 74.258895] Kernel panic - not syncing: Fatal exception in interrupt
-- 0:console -- time-stamp -- Nov/15/12 3:09:06 --
I've no idea if it is directly related to your patches and I didn't try
to reproduce it yet.
Post by Ingo Molnar
generation tool: 'perf bench numa' (I'll post it later in a
separate reply).
Via 'perf bench numa' we can generate arbitrary process and
thread layouts, with arbitrary memory sharing arrangements
between them.
Here are various comparisons to the vanilla kernel (higher
#
# 4 processes with 4 threads per process, sharing 4x 1GB of
#
# perf bench numa mem -l 100 -zZ0 -p 4 -t 4 -P 1024 -T 0
#
v3.7-vanilla: 14.8 GB/sec
v3.7-NUMA: 32.9 GB/sec [ +122.3% ]
2.2 times faster.
#
# 4 processes with 4 threads per process, sharing 4x 1GB of
#
# perf bench numa mem -l 100 -zZ0 -p 4 -t 4 -P 0 -T 1024
#
v3.7-vanilla: 17.0 GB/sec
v3.7-NUMA: 36.3 GB/sec [ +113.5% ]
2.1 times faster.
That is really cool.
Post by Ingo Molnar
So it's a nice improvement all around. With this version the
regressions that Mel Gorman reported a week ago appear to be
fixed as well.
Unfortunately I cannot concur. I'm still seeing high system CPU usage in
places and the specjbb figures are rather unfortunate.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Mel Gorman
2012-11-15 21:30:04 UTC
Permalink
Post by Rik van Riel
Post by Mel Gorman
Post by Ingo Molnar
Here are some preliminary performance figures, comparing the
vanilla kernel against the CONFIG_SCHED_NUMA=y kernel.
Java SPEC benchmark, running on a 4 node, 64 GB, 32-way server
Ok, I used a 4-node, 64G, 48-way server system. We have different CPUs
but the same number of nodes. In case it makes a difference each of my
machines nodes are the same size.
Mel, do you have info on exactly what model system you
were running these tests on?
Dell PowerEdge R810
CPU Intel(R) Xeon(R) CPU E7- 4807 @ 1.87GHz
RAM 64G
Single disk

4 JVMs, one per node
SpecJBB configured to run in multi JVM configuration
No special binding
JVM switches -Xmx12882m

All run through an unreleased version of MMTests. I'll make a release of
mmtests either tomorrow or Monday when I get the chance.
Post by Rik van Riel
Obviously your results are very different from the ones
that Ingo saw. It would be most helpful if we could find
a similar system in one of the Red Hat labs, so Ingo can
play around with it and see what's going on :)
Also compare how the benchmark is actually configured and which figures
he's reporting. I'm posting up the throughput for each warehouse and the
peak throughput.

It is possible Ingo's figures are based on other patches in the tip tree
that have not been identified. If that's the case it's interesting in
itself.
--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
2012-11-15 20:40:03 UTC
Permalink
Ugh.

According to these numbers, the latest sched-numa actually regresses
against mainline on Specjbb.

No way is this even close to ready for merging in the 3.8 timeframe.

I would ask the invilved people to please come up with a set of
initial patches that people agree on, so that we can at least start
merging some of the infrastructure, and see how far we can get on at
least getting *started*. As I mentioned to Andrew and Mel separately,
nobody seems to disagree with the TLB optimization patches. What else?
Is Mel's set of early patches still considered a reasonable starting
point for everybody?

Ingo? Andrea? With the understanding that we're not going to merge the
actual full schednuma/autonuma, what are the initial parts we can
*agree* on?

Linus
Post by Mel Gorman
SPECJBB BOPS
3.7.0 3.7.0 3.7.0
rc4-stats-v2r34 rc4-schednuma-v2r3 rc4-autonuma-v28fast
Mean 1 25034.25 ( 0.00%) 20598.50 (-17.72%) 25192.25 ( 0.63%)
Mean 2 53176.00 ( 0.00%) 43906.50 (-17.43%) 55508.25 ( 4.39%)
Mean 3 77350.50 ( 0.00%) 60342.75 (-21.99%) 82122.50 ( 6.17%)
Mean 4 99919.50 ( 0.00%) 80781.75 (-19.15%) 107233.25 ( 7.32%)
Mean 5 119797.00 ( 0.00%) 97870.00 (-18.30%) 131016.00 ( 9.37%)
Mean 6 135858.00 ( 0.00%) 123912.50 ( -8.79%) 152444.75 ( 12.21%)
Mean 7 136074.00 ( 0.00%) 126574.25 ( -6.98%) 157372.75 ( 15.65%)
Mean 8 132426.25 ( 0.00%) 121766.00 ( -8.05%) 161655.25 ( 22.07%)
Mean 9 129432.75 ( 0.00%) 114224.25 (-11.75%) 160530.50 ( 24.03%)
Mean 10 118399.75 ( 0.00%) 109040.50 ( -7.90%) 158692.00 ( 34.03%)
Mean 11 119604.00 ( 0.00%) 105566.50 (-11.74%) 154462.00 ( 29.14%)
Mean 12 112742.25 ( 0.00%) 101728.75 ( -9.77%) 149546.00 ( 32.64%)
Mean 13 109480.75 ( 0.00%) 103737.50 ( -5.25%) 144929.25 ( 32.38%)
Mean 14 109724.00 ( 0.00%) 103516.00 ( -5.66%) 143804.50 ( 31.06%)
Mean 15 109111.75 ( 0.00%) 100817.00 ( -7.60%) 141878.00 ( 30.03%)
Mean 16 105385.75 ( 0.00%) 99327.25 ( -5.75%) 140156.75 ( 32.99%)
Mean 17 101903.50 ( 0.00%) 96464.50 ( -5.34%) 138402.00 ( 35.82%)
Mean 18 103632.50 ( 0.00%) 95632.50 ( -7.72%) 137781.50 ( 32.95%)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Rik van Riel
2012-11-15 22:10:02 UTC
Permalink
Post by Linus Torvalds
Ugh.
According to these numbers, the latest sched-numa actually regresses
against mainline on Specjbb.
No way is this even close to ready for merging in the 3.8 timeframe.
I would ask the invilved people to please come up with a set of
initial patches that people agree on, so that we can at least start
merging some of the infrastructure, and see how far we can get on at
least getting *started*. As I mentioned to Andrew and Mel separately,
nobody seems to disagree with the TLB optimization patches. What else?
Is Mel's set of early patches still considered a reasonable starting
point for everybody?
Mel's infrastructure patches, 1-14 and 17 out
of his latest series, could be a great starting
point.

Ingo is trying to get the mm/ code in his tree
to be mostly the same to Mel's code anyway, so
that is the infrastructure everybody wants.

At that point, we can focus our discussions on
just the policy side, which could help us zoom in
on the issues.

It would also make it possible for us to do apple
to apple comparisons between the various policy
decisions, allowing us to reach a decision based
on data, not just gut feel.

As long as each tree has its own basic infrastructure,
we cannot do apples to apples comparisons; this has
frustrated the discussion for months.

Having all that basic infrastructure upstream should
short-circuit that part of the discussion.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Mel Gorman
2012-11-16 14:20:01 UTC
Permalink
Post by Rik van Riel
Post by Linus Torvalds
Ugh.
According to these numbers, the latest sched-numa actually regresses
against mainline on Specjbb.
No way is this even close to ready for merging in the 3.8 timeframe.
I would ask the invilved people to please come up with a set of
initial patches that people agree on, so that we can at least start
merging some of the infrastructure, and see how far we can get on at
least getting *started*. As I mentioned to Andrew and Mel separately,
nobody seems to disagree with the TLB optimization patches. What else?
Is Mel's set of early patches still considered a reasonable starting
point for everybody?
Mel's infrastructure patches, 1-14 and 17 out
of his latest series, could be a great starting
point.
V3 increased a lot in size due to rate-limiting of migration which was
yanked out of autonuma. The rate limiting has two obvious purposes. One,
during periods of fast convergency it will prevent the memory bus being
saturated with traffic and causing stalls. As a side effect it should
decrease system CPU usage in some cases. Two, if the placement policy
completely breaks down, it will help contain the damage. If we added a vmstat
that increments when the rate limiting kicked in then users could report
broken policies by checking if the migration and rate-limited counter are
increasing. If they are both increasing rapidly then the placement policy
is broken. I think identifying when it's broken is just as important as
identifying when it's working.

The equivalent numbered patches in the new series to match what Rik suggests
above are Patches 1-17, 19. I'll swap patches 19 and 18 to avoid this mess.
The TLB patches are 33-35 but are not contested. I am going to move them
to the start of the series.

With some shuffling the question on what to consider for merging
becomes

1. TLB optimisation patches 1-3? Patches 1-3
2. Stats for migration? Patches 4-6
3. Common NUMA infrastructure? Patches 7-21
4. Basic fault-driven policy, stats, ratelimits Patches 22-35

Patches 36-43 are complete cabbage and should not be considered at this
stage. It should be possible to build the placement policies and the
scheduling decisions from schednuma, autonuma, some combination of the
above or something completely different on top of patches 1-35.

Peter, Ingo, Andrea?

I know that other common patches that should exist but they are
optimisations to the policies and not a fundamental design choice.
Post by Rik van Riel
Ingo is trying to get the mm/ code in his tree
to be mostly the same to Mel's code anyway, so
that is the infrastructure everybody wants.
At that point, we can focus our discussions on
just the policy side, which could help us zoom in
on the issues.
Preferably yes and we'd have a comparison points of mainline and the most
basic of placement policies to work with that should be bisectable as a
last resort.
Post by Rik van Riel
It would also make it possible for us to do apple
to apple comparisons between the various policy
decisions, allowing us to reach a decision based
on data, not just gut feel.
As long as each tree has its own basic infrastructure,
we cannot do apples to apples comparisons; this has
frustrated the discussion for months.
Having all that basic infrastructure upstream should
short-circuit that part of the discussion.
Agreed.
--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andrea Arcangeli
2012-11-16 20:00:02 UTC
Permalink
Hi,
Post by Mel Gorman
With some shuffling the question on what to consider for merging
becomes
1. TLB optimisation patches 1-3? Patches 1-3
I assume you mean simply reshuffling 33-35 as 1-3.
Post by Mel Gorman
2. Stats for migration? Patches 4-6
3. Common NUMA infrastructure? Patches 7-21
4. Basic fault-driven policy, stats, ratelimits Patches 22-35
Patches 36-43 are complete cabbage and should not be considered at this
stage. It should be possible to build the placement policies and the
scheduling decisions from schednuma, autonuma, some combination of the
above or something completely different on top of patches 1-35.
Peter, Ingo, Andrea?
The patches 1-35 looks a great foundation so I think they'd be an
ideal candidate for a first upstream inclusion.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Mel Gorman
2012-11-16 20:10:02 UTC
Permalink
Post by Andrea Arcangeli
Hi,
Post by Mel Gorman
With some shuffling the question on what to consider for merging
becomes
1. TLB optimisation patches 1-3? Patches 1-3
I assume you mean simply reshuffling 33-35 as 1-3.
Yes.
Post by Andrea Arcangeli
Post by Mel Gorman
2. Stats for migration? Patches 4-6
3. Common NUMA infrastructure? Patches 7-21
4. Basic fault-driven policy, stats, ratelimits Patches 22-35
Patches 36-43 are complete cabbage and should not be considered at this
stage. It should be possible to build the placement policies and the
scheduling decisions from schednuma, autonuma, some combination of the
above or something completely different on top of patches 1-35.
Peter, Ingo, Andrea?
The patches 1-35 looks a great foundation so I think they'd be an
ideal candidate for a first upstream inclusion.
Thanks.
--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ingo Molnar
2012-11-16 16:20:01 UTC
Permalink
[...]
I would ask the involved people to please come up with a set
of initial patches that people agree on, so that we can at
least start merging some of the infrastructure, and see how
far we can get on at least getting *started*.
That would definitely be a step forward.
[...] As I mentioned to Andrew and Mel separately, nobody
seems to disagree with the TLB optimization patches. What
else? Is Mel's set of early patches still considered a
reasonable starting point for everybody?
My suggestion for a 'foundation' would be all the non-policy
bits in numa/core:

c740b1cccdcb x86/mm: Completely drop the TLB flush from ptep_set_access_flags()
02743c9c03f1 mm/mpol: Use special PROT_NONE to migrate pages
b33467764d8a mm/migrate: Introduce migrate_misplaced_page()
db4aa58db59a numa, mm: Support NUMA hinting page faults from gup/gup_fast
ca2ea0747a5b mm/mpol: Add MPOL_MF_LAZY
f05ea0948708 mm/mpol: Create special PROT_NONE infrastructure
37081a3de2bf mm/mpol: Check for misplaced page
cd203e33c39d mm/mpol: Add MPOL_MF_NOOP
88f4670789e3 mm/mpol: Make MPOL_LOCAL a real policy
83babc0d2944 mm/pgprot: Move the pgprot_modify() fallback definition to mm.h
536165ead34b sched, numa, mm, MIPS/thp: Add pmd_pgprot() implementation
6fe64360a759 mm: Only flush the TLB when clearing an accessible pte
e9df40bfeb25 x86/mm: Introduce pte_accessible()
3f2b613771ec mm/thp: Preserve pgprot across huge page split
a5a608d83e0e sched, numa, mm, s390/thp: Implement pmd_pgprot() for s390
995334a2ee83 sched, numa, mm: Describe the NUMA scheduling problem formally
7ee9d9209c57 sched, numa, mm: Make find_busiest_queue() a method
4fd98847ba5c x86/mm: Only do a local tlb flush in ptep_set_access_flags()
d24fc0571afb mm/generic: Only flush the local TLB in ptep_set_access_flags()

Which I've pushed out into the separate numa/base tree:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git numa/base

These are just the minimal set of patches needed to get to be
able to concentrate on the real details.

AFAICS Mel started going in this design direction as well in his
latest patches, so there should be no real technical objections
to this other than any details I might have missed: and I'll
rebase this tree if the mm/ folks have any other suggestions for
improvement, as that seems the be the preferred mm workflow.

Andrea, Mel?

Getting this out of the way would be a big help.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ingo Molnar
2012-11-16 16:00:03 UTC
Permalink
Post by Mel Gorman
It is important to know how this was configured. I was running
one JVM per node and the JVMs were sized that they should fit
in the node. [...]
That is not what I tested: as I described it in the mail I
tested 32 warehouses: i.e. spanning the whole system.

You tested 4 parallel JVMs running one per node, right?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Mel Gorman
2012-11-16 16:30:03 UTC
Permalink
Post by Ingo Molnar
Post by Mel Gorman
It is important to know how this was configured. I was running
one JVM per node and the JVMs were sized that they should fit
in the node. [...]
That is not what I tested: as I described it in the mail I
tested 32 warehouses: i.e. spanning the whole system.
Good (sortof) because that's my preferred explanation as to why we are
seeing different results. Different machines and different kernels would
be a lot more problematic.
Post by Ingo Molnar
You tested 4 parallel JVMs running one per node, right?
4 parallel JVMs sized so they they could fit one-per-node. However, I did
*not* bind them to nodes because that would be completely pointless for
this type of test.

I've queued up another set of tests and added a single-JVM configuration
to the mix. The kernels will have debugging, lockstat enabled and will
be running two passes with the second pass running profiling so the
results will not be directly comparable. However, I'll keep a close eye
on the Single vs Multi JVM results.

Thanks.
--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ingo Molnar
2012-11-16 17:50:02 UTC
Permalink
Post by Ingo Molnar
Post by Mel Gorman
It is important to know how this was configured. I was running
one JVM per node and the JVMs were sized that they should fit
in the node. [...]
That is not what I tested: as I described it in the mail I
tested 32 warehouses: i.e. spanning the whole system.
Good (sortof) [...]
Not just 'sortof' good but it appears it's unconditionally good:
meanwhile other testers have reproduced the single-JVM speedup
with the latest numa/core code as well, so the speedup is not
just on my system.

Please post your kernel .config so I can check why the 4x JVM
test does not perform so well on your system. Maybe there's
something special to your system.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Mel Gorman
2012-11-16 19:10:01 UTC
Permalink
Post by Ingo Molnar
Post by Ingo Molnar
Post by Mel Gorman
It is important to know how this was configured. I was running
one JVM per node and the JVMs were sized that they should fit
in the node. [...]
That is not what I tested: as I described it in the mail I
tested 32 warehouses: i.e. spanning the whole system.
Good (sortof) [...]
I was referring to the very strong likelihood that this was our major
source of difference. I would not go as far as "unconditionally good" but
I'm very happy that this was the major difference so it was "sortof good"
:) The alternatives completely sucked.
Post by Ingo Molnar
meanwhile other testers have reproduced the single-JVM speedup
with the latest numa/core code as well, so the speedup is not
just on my system.
Good. Did they also test kernel building because that showed around a 50%
regression for me as posted elsewhere. It's for catching things like this
I wanted the logical progression so we can catch exactly where problems
got introduced and figure out what was missed at each time.
Post by Ingo Molnar
Please post your kernel .config so I can check why the 4x JVM
test does not perform so well on your system. Maybe there's
something special to your system.
I do not store configs from tests but this is the config that should
have been generated for the test.

#
# Automatically generated file; DO NOT EDIT.
# Linux/x86_64 3.7.0-rc4 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_GPIO=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_ARCH_HAS_CPU_AUTOPROBE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11"
CONFIG_ARCH_CPU_PROBE_RELEASE=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_HAVE_IRQ_WORK=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_FHANDLE=y
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y
# CONFIG_AUDIT_LOGINUID_IMMUTABLE is not set
CONFIG_HAVE_GENERIC_HARDIRQS=y

#
# IRQ subsystem
#
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_GENERIC_IRQ_CHIP=y
CONFIG_IRQ_DOMAIN=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y

#
# RCU Subsystem
#
CONFIG_TREE_PREEMPT_RCU=y
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_USER_QS is not set
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FANOUT_EXACT is not set
CONFIG_RCU_FAST_NO_HZ=y
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_RCU_BOOST=y
CONFIG_RCU_BOOST_PRIO=1
CONFIG_RCU_BOOST_DELAY=500
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=18
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_NUMA_BALANCING=y
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_RESOURCE_COUNTERS=y
# CONFIG_MEMCG is not set
# CONFIG_CGROUP_HUGETLB is not set
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
# CONFIG_CHECKPOINT_RESTORE is not set
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
CONFIG_PID_NS=y
CONFIG_NET_NS=y
CONFIG_SCHED_AUTOGROUP=y
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
CONFIG_RD_LZO=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
# CONFIG_EXPERT is not set
CONFIG_HAVE_UID16=y
CONFIG_UID16=y
# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
# CONFIG_EMBEDDED is not set
CONFIG_HAVE_PERF_EVENTS=y

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_PCI_QUIRKS=y
# CONFIG_COMPAT_BRK is not set
CONFIG_SLAB=y
# CONFIG_SLUB is not set
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
CONFIG_OPROFILE=m
# CONFIG_OPROFILE_EVENT_MULTIPLEX is not set
CONFIG_HAVE_OPROFILE=y
CONFIG_OPROFILE_NMI_TIMER=y
CONFIG_KPROBES=y
CONFIG_JUMP_LABEL=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_KRETPROBES=y
CONFIG_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_ATTRS=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION=y
CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y
CONFIG_GENERIC_KERNEL_THREAD=y
CONFIG_GENERIC_KERNEL_EXECVE=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP_FILTER=y
CONFIG_HAVE_RCU_USER_QS=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_MODULES_USE_ELF_RELA=y

#
# GCOV-based kernel profiling
#
# CONFIG_GCOV_KERNEL is not set
# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_FORCE_LOAD=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
CONFIG_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y
# CONFIG_MODULE_SIG is not set
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV_BSG=y
CONFIG_BLK_DEV_BSGLIB=y
CONFIG_BLK_DEV_INTEGRITY=y
CONFIG_BLK_DEV_THROTTLING=y

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
CONFIG_OSF_PARTITION=y
# CONFIG_AMIGA_PARTITION is not set
CONFIG_ATARI_PARTITION=y
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
# CONFIG_MINIX_SUBPARTITION is not set
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
CONFIG_LDM_PARTITION=y
# CONFIG_LDM_DEBUG is not set
CONFIG_SGI_PARTITION=y
CONFIG_ULTRIX_PARTITION=y
CONFIG_SUN_PARTITION=y
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
CONFIG_SYSV68_PARTITION=y
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_CFQ_GROUP_IOSCHED=y
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_PADATA=y
CONFIG_UNINLINE_SPIN_UNLOCK=y
CONFIG_MUTEX_SPIN_ON_OWNER=y
CONFIG_FREEZER=y

#
# Processor type and features
#
CONFIG_ZONE_DMA=y
CONFIG_SMP=y
CONFIG_X86_X2APIC=y
CONFIG_X86_MPPARSE=y
CONFIG_X86_EXTENDED_PLATFORM=y
# CONFIG_X86_NUMACHIP is not set
# CONFIG_X86_VSMP is not set
CONFIG_X86_UV=y
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_PARAVIRT_GUEST=y
# CONFIG_PARAVIRT_TIME_ACCOUNTING is not set
# CONFIG_XEN is not set
# CONFIG_XEN_PRIVILEGED_GUEST is not set
CONFIG_KVM_GUEST=y
CONFIG_PARAVIRT=y
# CONFIG_PARAVIRT_SPINLOCKS is not set
CONFIG_PARAVIRT_CLOCK=y
# CONFIG_PARAVIRT_DEBUG is not set
CONFIG_NO_BOOTMEM=y
CONFIG_MEMTEST=y
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=6
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_XADD=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
CONFIG_CALGARY_IOMMU=y
# CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT is not set
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
# CONFIG_MAXSMP is not set
CONFIG_NR_CPUS=512
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_COUNT=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y
CONFIG_X86_MCE_INJECT=m
CONFIG_X86_THERMAL_VECTOR=y
CONFIG_I8K=m
CONFIG_MICROCODE=m
CONFIG_MICROCODE_INTEL=y
CONFIG_MICROCODE_AMD=y
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=m
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_DIRECT_GBPAGES=y
CONFIG_NUMA=y
CONFIG_AMD_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_NODES_SPAN_OTHER_NODES=y
CONFIG_NUMA_EMU=y
CONFIG_NODES_SHIFT=9
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ARCH_MEMORY_PROBE=y
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_HAVE_MEMBLOCK=y
CONFIG_HAVE_MEMBLOCK_NODE_MAP=y
CONFIG_ARCH_DISCARD_MEMBLOCK=y
CONFIG_MEMORY_ISOLATION=y
CONFIG_MEMORY_HOTPLUG=y
CONFIG_MEMORY_HOTPLUG_SPARSE=y
CONFIG_MEMORY_HOTREMOVE=y
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_MMU_NOTIFIER=y
CONFIG_KSM=y
CONFIG_DEFAULT_MMAP_MIN_ADDR=65536
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
CONFIG_MEMORY_FAILURE=y
CONFIG_HWPOISON_INJECT=m
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_CLEANCACHE=y
# CONFIG_FRONTSWAP is not set
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y
CONFIG_X86_RESERVE_LOW=64
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=0
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_ARCH_RANDOM=y
CONFIG_X86_SMAP=y
CONFIG_EFI=y
CONFIG_EFI_STUB=y
CONFIG_SECCOMP=y
# CONFIG_CC_STACKPROTECTOR is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
# CONFIG_KEXEC_JUMP is not set
CONFIG_PHYSICAL_START=0x200000
CONFIG_RELOCATABLE=y
CONFIG_PHYSICAL_ALIGN=0x1000000
CONFIG_HOTPLUG_CPU=y
# CONFIG_COMPAT_VDSO is not set
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
CONFIG_USE_PERCPU_NUMA_NODE_ID=y

#
# Power management and ACPI options
#
CONFIG_ARCH_HIBERNATION_HEADER=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
CONFIG_HIBERNATE_CALLBACKS=y
CONFIG_HIBERNATION=y
CONFIG_PM_STD_PARTITION=""
CONFIG_PM_SLEEP=y
CONFIG_PM_SLEEP_SMP=y
# CONFIG_PM_AUTOSLEEP is not set
# CONFIG_PM_WAKELOCKS is not set
CONFIG_PM_RUNTIME=y
CONFIG_PM=y
CONFIG_PM_DEBUG=y
CONFIG_PM_ADVANCED_DEBUG=y
# CONFIG_PM_TEST_SUSPEND is not set
CONFIG_PM_SLEEP_DEBUG=y
CONFIG_PM_TRACE=y
CONFIG_PM_TRACE_RTC=y
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_PROCFS=y
CONFIG_ACPI_PROCFS_POWER=y
CONFIG_ACPI_EC_DEBUGFS=m
CONFIG_ACPI_PROC_EVENT=y
CONFIG_ACPI_AC=m
CONFIG_ACPI_BATTERY=m
CONFIG_ACPI_BUTTON=m
CONFIG_ACPI_VIDEO=m
CONFIG_ACPI_FAN=m
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_PROCESSOR=m
CONFIG_ACPI_IPMI=m
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_PROCESSOR_AGGREGATOR=m
CONFIG_ACPI_THERMAL=m
CONFIG_ACPI_NUMA=y
CONFIG_ACPI_CUSTOM_DSDT_FILE=""
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
CONFIG_ACPI_DEBUG=y
# CONFIG_ACPI_DEBUG_FUNC_TRACE is not set
CONFIG_ACPI_PCI_SLOT=m
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=m
CONFIG_ACPI_HOTPLUG_MEMORY=m
CONFIG_ACPI_SBS=m
CONFIG_ACPI_HED=y
# CONFIG_ACPI_CUSTOM_METHOD is not set
# CONFIG_ACPI_BGRT is not set
CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_GHES=y
CONFIG_ACPI_APEI_PCIEAER=y
CONFIG_ACPI_APEI_MEMORY_FAILURE=y
CONFIG_ACPI_APEI_EINJ=m
# CONFIG_ACPI_APEI_ERST_DEBUG is not set
# CONFIG_SFI is not set

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
CONFIG_CPU_FREQ_STAT=m
CONFIG_CPU_FREQ_STAT_DETAILS=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
CONFIG_CPU_FREQ_GOV_USERSPACE=m
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m

#
# x86 CPU frequency scaling drivers
#
# CONFIG_X86_PCC_CPUFREQ is not set
CONFIG_X86_ACPI_CPUFREQ=m
CONFIG_X86_ACPI_CPUFREQ_CPB=y
CONFIG_X86_POWERNOW_K8=m
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
# CONFIG_X86_P4_CLOCKMOD is not set

#
# shared options
#
# CONFIG_X86_SPEEDSTEP_LIB is not set
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y
# CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set
CONFIG_INTEL_IDLE=y

#
# Memory power savings
#
CONFIG_I7300_IDLE_IOAT_CHANNEL=y
CONFIG_I7300_IDLE=m

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_DOMAINS=y
CONFIG_PCIEPORTBUS=y
CONFIG_HOTPLUG_PCI_PCIE=m
CONFIG_PCIEAER=y
# CONFIG_PCIE_ECRC is not set
CONFIG_PCIEAER_INJECT=m
CONFIG_PCIEASPM=y
# CONFIG_PCIEASPM_DEBUG is not set
CONFIG_PCIEASPM_DEFAULT=y
# CONFIG_PCIEASPM_POWERSAVE is not set
# CONFIG_PCIEASPM_PERFORMANCE is not set
CONFIG_PCIE_PME=y
CONFIG_ARCH_SUPPORTS_MSI=y
CONFIG_PCI_MSI=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_REALLOC_ENABLE_AUTO is not set
CONFIG_PCI_STUB=y
CONFIG_HT_IRQ=y
CONFIG_PCI_ATS=y
CONFIG_PCI_IOV=y
CONFIG_PCI_PRI=y
CONFIG_PCI_PASID=y
CONFIG_PCI_IOAPIC=m
CONFIG_PCI_LABEL=y
CONFIG_ISA_DMA_API=y
CONFIG_AMD_NB=y
# CONFIG_PCCARD is not set
CONFIG_HOTPLUG_PCI=m
CONFIG_HOTPLUG_PCI_ACPI=m
CONFIG_HOTPLUG_PCI_ACPI_IBM=m
CONFIG_HOTPLUG_PCI_CPCI=y
CONFIG_HOTPLUG_PCI_CPCI_ZT5550=m
CONFIG_HOTPLUG_PCI_CPCI_GENERIC=m
CONFIG_HOTPLUG_PCI_SHPC=m
# CONFIG_RAPIDIO is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE=y
CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y
# CONFIG_HAVE_AOUT is not set
CONFIG_BINFMT_MISC=m
CONFIG_COREDUMP=y
CONFIG_IA32_EMULATION=y
CONFIG_IA32_AOUT=m
CONFIG_X86_X32=y
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_KEYS_COMPAT=y
CONFIG_HAVE_TEXT_POKE_SMP=y
CONFIG_X86_DEV_DMA_OPS=y
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=m
# CONFIG_PACKET_DIAG is not set
CONFIG_UNIX=y
CONFIG_UNIX_DIAG=m
CONFIG_XFRM=y
CONFIG_XFRM_ALGO=m
CONFIG_XFRM_USER=m
CONFIG_XFRM_SUB_POLICY=y
CONFIG_XFRM_MIGRATE=y
# CONFIG_XFRM_STATISTICS is not set
CONFIG_XFRM_IPCOMP=m
CONFIG_NET_KEY=m
CONFIG_NET_KEY_MIGRATE=y
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
# CONFIG_IP_FIB_TRIE_STATS is not set
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
CONFIG_IP_ROUTE_CLASSID=y
CONFIG_IP_PNP=y
CONFIG_IP_PNP_DHCP=y
CONFIG_IP_PNP_BOOTP=y
CONFIG_IP_PNP_RARP=y
CONFIG_NET_IPIP=m
CONFIG_NET_IPGRE_DEMUX=m
CONFIG_NET_IPGRE=m
CONFIG_NET_IPGRE_BROADCAST=y
CONFIG_IP_MROUTE=y
CONFIG_IP_MROUTE_MULTIPLE_TABLES=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
# CONFIG_NET_IPVTI is not set
CONFIG_INET_AH=m
CONFIG_INET_ESP=m
CONFIG_INET_IPCOMP=m
CONFIG_INET_XFRM_TUNNEL=m
CONFIG_INET_TUNNEL=m
CONFIG_INET_XFRM_MODE_TRANSPORT=m
CONFIG_INET_XFRM_MODE_TUNNEL=m
CONFIG_INET_XFRM_MODE_BEET=m
CONFIG_INET_LRO=y
CONFIG_INET_DIAG=m
CONFIG_INET_TCP_DIAG=m
CONFIG_INET_UDP_DIAG=m
CONFIG_TCP_CONG_ADVANCED=y
CONFIG_TCP_CONG_BIC=m
CONFIG_TCP_CONG_CUBIC=y
CONFIG_TCP_CONG_WESTWOOD=m
CONFIG_TCP_CONG_HTCP=m
CONFIG_TCP_CONG_HSTCP=m
CONFIG_TCP_CONG_HYBLA=m
CONFIG_TCP_CONG_VEGAS=m
CONFIG_TCP_CONG_SCALABLE=m
CONFIG_TCP_CONG_LP=m
CONFIG_TCP_CONG_VENO=m
CONFIG_TCP_CONG_YEAH=m
CONFIG_TCP_CONG_ILLINOIS=m
CONFIG_DEFAULT_CUBIC=y
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set
CONFIG_IPV6=y
CONFIG_IPV6_PRIVACY=y
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_IPV6_ROUTE_INFO=y
# CONFIG_IPV6_OPTIMISTIC_DAD is not set
CONFIG_INET6_AH=m
CONFIG_INET6_ESP=m
CONFIG_INET6_IPCOMP=m
CONFIG_IPV6_MIP6=m
CONFIG_INET6_XFRM_TUNNEL=m
CONFIG_INET6_TUNNEL=m
CONFIG_INET6_XFRM_MODE_TRANSPORT=m
CONFIG_INET6_XFRM_MODE_TUNNEL=m
CONFIG_INET6_XFRM_MODE_BEET=m
CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION=m
CONFIG_IPV6_SIT=m
# CONFIG_IPV6_SIT_6RD is not set
CONFIG_IPV6_NDISC_NODETYPE=y
CONFIG_IPV6_TUNNEL=m
# CONFIG_IPV6_GRE is not set
CONFIG_IPV6_MULTIPLE_TABLES=y
CONFIG_IPV6_SUBTREES=y
# CONFIG_IPV6_MROUTE is not set
CONFIG_NETLABEL=y
CONFIG_NETWORK_SECMARK=y
# CONFIG_NETWORK_PHY_TIMESTAMPING is not set
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
CONFIG_NETFILTER_ADVANCED=y
CONFIG_BRIDGE_NETFILTER=y

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_NETLINK=m
CONFIG_NETFILTER_NETLINK_ACCT=m
CONFIG_NETFILTER_NETLINK_QUEUE=m
CONFIG_NETFILTER_NETLINK_LOG=m
CONFIG_NF_CONNTRACK=m
CONFIG_NF_CONNTRACK_MARK=y
CONFIG_NF_CONNTRACK_SECMARK=y
CONFIG_NF_CONNTRACK_ZONES=y
CONFIG_NF_CONNTRACK_PROCFS=y
CONFIG_NF_CONNTRACK_EVENTS=y
CONFIG_NF_CONNTRACK_TIMEOUT=y
CONFIG_NF_CONNTRACK_TIMESTAMP=y
CONFIG_NF_CT_PROTO_DCCP=m
CONFIG_NF_CT_PROTO_GRE=m
CONFIG_NF_CT_PROTO_SCTP=m
CONFIG_NF_CT_PROTO_UDPLITE=m
CONFIG_NF_CONNTRACK_AMANDA=m
CONFIG_NF_CONNTRACK_FTP=m
CONFIG_NF_CONNTRACK_H323=m
CONFIG_NF_CONNTRACK_IRC=m
CONFIG_NF_CONNTRACK_BROADCAST=m
CONFIG_NF_CONNTRACK_NETBIOS_NS=m
CONFIG_NF_CONNTRACK_SNMP=m
CONFIG_NF_CONNTRACK_PPTP=m
CONFIG_NF_CONNTRACK_SANE=m
CONFIG_NF_CONNTRACK_SIP=m
CONFIG_NF_CONNTRACK_TFTP=m
CONFIG_NF_CT_NETLINK=m
CONFIG_NF_CT_NETLINK_TIMEOUT=m
# CONFIG_NETFILTER_NETLINK_QUEUE_CT is not set
CONFIG_NETFILTER_TPROXY=m
CONFIG_NETFILTER_XTABLES=m

#
# Xtables combined modules
#
CONFIG_NETFILTER_XT_MARK=m
CONFIG_NETFILTER_XT_CONNMARK=m
CONFIG_NETFILTER_XT_SET=m

#
# Xtables targets
#
CONFIG_NETFILTER_XT_TARGET_AUDIT=m
CONFIG_NETFILTER_XT_TARGET_CHECKSUM=m
CONFIG_NETFILTER_XT_TARGET_CLASSIFY=m
CONFIG_NETFILTER_XT_TARGET_CONNMARK=m
CONFIG_NETFILTER_XT_TARGET_CONNSECMARK=m
CONFIG_NETFILTER_XT_TARGET_CT=m
CONFIG_NETFILTER_XT_TARGET_DSCP=m
CONFIG_NETFILTER_XT_TARGET_HL=m
# CONFIG_NETFILTER_XT_TARGET_HMARK is not set
CONFIG_NETFILTER_XT_TARGET_IDLETIMER=m
CONFIG_NETFILTER_XT_TARGET_LED=m
CONFIG_NETFILTER_XT_TARGET_LOG=m
CONFIG_NETFILTER_XT_TARGET_MARK=m
CONFIG_NETFILTER_XT_TARGET_NFLOG=m
CONFIG_NETFILTER_XT_TARGET_NFQUEUE=m
CONFIG_NETFILTER_XT_TARGET_RATEEST=m
CONFIG_NETFILTER_XT_TARGET_TEE=m
CONFIG_NETFILTER_XT_TARGET_TPROXY=m
CONFIG_NETFILTER_XT_TARGET_TRACE=m
CONFIG_NETFILTER_XT_TARGET_SECMARK=m
CONFIG_NETFILTER_XT_TARGET_TCPMSS=m
CONFIG_NETFILTER_XT_TARGET_TCPOPTSTRIP=m

#
# Xtables matches
#
CONFIG_NETFILTER_XT_MATCH_ADDRTYPE=m
CONFIG_NETFILTER_XT_MATCH_CLUSTER=m
CONFIG_NETFILTER_XT_MATCH_COMMENT=m
CONFIG_NETFILTER_XT_MATCH_CONNBYTES=m
CONFIG_NETFILTER_XT_MATCH_CONNLIMIT=m
CONFIG_NETFILTER_XT_MATCH_CONNMARK=m
CONFIG_NETFILTER_XT_MATCH_CONNTRACK=m
CONFIG_NETFILTER_XT_MATCH_CPU=m
CONFIG_NETFILTER_XT_MATCH_DCCP=m
CONFIG_NETFILTER_XT_MATCH_DEVGROUP=m
CONFIG_NETFILTER_XT_MATCH_DSCP=m
CONFIG_NETFILTER_XT_MATCH_ECN=m
CONFIG_NETFILTER_XT_MATCH_ESP=m
CONFIG_NETFILTER_XT_MATCH_HASHLIMIT=m
CONFIG_NETFILTER_XT_MATCH_HELPER=m
CONFIG_NETFILTER_XT_MATCH_HL=m
CONFIG_NETFILTER_XT_MATCH_IPRANGE=m
CONFIG_NETFILTER_XT_MATCH_IPVS=m
CONFIG_NETFILTER_XT_MATCH_LENGTH=m
CONFIG_NETFILTER_XT_MATCH_LIMIT=m
CONFIG_NETFILTER_XT_MATCH_MAC=m
CONFIG_NETFILTER_XT_MATCH_MARK=m
CONFIG_NETFILTER_XT_MATCH_MULTIPORT=m
CONFIG_NETFILTER_XT_MATCH_NFACCT=m
CONFIG_NETFILTER_XT_MATCH_OSF=m
CONFIG_NETFILTER_XT_MATCH_OWNER=m
CONFIG_NETFILTER_XT_MATCH_POLICY=m
CONFIG_NETFILTER_XT_MATCH_PHYSDEV=m
CONFIG_NETFILTER_XT_MATCH_PKTTYPE=m
CONFIG_NETFILTER_XT_MATCH_QUOTA=m
CONFIG_NETFILTER_XT_MATCH_RATEEST=m
CONFIG_NETFILTER_XT_MATCH_REALM=m
CONFIG_NETFILTER_XT_MATCH_RECENT=m
CONFIG_NETFILTER_XT_MATCH_SCTP=m
CONFIG_NETFILTER_XT_MATCH_SOCKET=m
CONFIG_NETFILTER_XT_MATCH_STATE=m
CONFIG_NETFILTER_XT_MATCH_STATISTIC=m
CONFIG_NETFILTER_XT_MATCH_STRING=m
CONFIG_NETFILTER_XT_MATCH_TCPMSS=m
CONFIG_NETFILTER_XT_MATCH_TIME=m
CONFIG_NETFILTER_XT_MATCH_U32=m
CONFIG_IP_SET=m
CONFIG_IP_SET_MAX=256
CONFIG_IP_SET_BITMAP_IP=m
CONFIG_IP_SET_BITMAP_IPMAC=m
CONFIG_IP_SET_BITMAP_PORT=m
CONFIG_IP_SET_HASH_IP=m
CONFIG_IP_SET_HASH_IPPORT=m
CONFIG_IP_SET_HASH_IPPORTIP=m
CONFIG_IP_SET_HASH_IPPORTNET=m
CONFIG_IP_SET_HASH_NET=m
CONFIG_IP_SET_HASH_NETPORT=m
CONFIG_IP_SET_HASH_NETIFACE=m
CONFIG_IP_SET_LIST_SET=m
CONFIG_IP_VS=m
CONFIG_IP_VS_IPV6=y
# CONFIG_IP_VS_DEBUG is not set
CONFIG_IP_VS_TAB_BITS=12

#
# IPVS transport protocol load balancing support
#
CONFIG_IP_VS_PROTO_TCP=y
CONFIG_IP_VS_PROTO_UDP=y
CONFIG_IP_VS_PROTO_AH_ESP=y
CONFIG_IP_VS_PROTO_ESP=y
CONFIG_IP_VS_PROTO_AH=y
CONFIG_IP_VS_PROTO_SCTP=y

#
# IPVS scheduler
#
CONFIG_IP_VS_RR=m
CONFIG_IP_VS_WRR=m
CONFIG_IP_VS_LC=m
CONFIG_IP_VS_WLC=m
CONFIG_IP_VS_LBLC=m
CONFIG_IP_VS_LBLCR=m
CONFIG_IP_VS_DH=m
CONFIG_IP_VS_SH=m
CONFIG_IP_VS_SED=m
CONFIG_IP_VS_NQ=m

#
# IPVS SH scheduler
#
CONFIG_IP_VS_SH_TAB_BITS=8

#
# IPVS application helper
#
CONFIG_IP_VS_NFCT=y
CONFIG_IP_VS_PE_SIP=m

#
# IP: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV4=m
CONFIG_NF_CONNTRACK_IPV4=m
# CONFIG_NF_CONNTRACK_PROC_COMPAT is not set
CONFIG_IP_NF_QUEUE=m
CONFIG_IP_NF_IPTABLES=m
CONFIG_IP_NF_MATCH_AH=m
CONFIG_IP_NF_MATCH_ECN=m
CONFIG_IP_NF_MATCH_RPFILTER=m
CONFIG_IP_NF_MATCH_TTL=m
CONFIG_IP_NF_FILTER=m
CONFIG_IP_NF_TARGET_REJECT=m
CONFIG_IP_NF_TARGET_ULOG=m
# CONFIG_NF_NAT_IPV4 is not set
CONFIG_IP_NF_MANGLE=m
CONFIG_IP_NF_TARGET_CLUSTERIP=m
CONFIG_IP_NF_TARGET_ECN=m
CONFIG_IP_NF_TARGET_TTL=m
CONFIG_IP_NF_RAW=m
CONFIG_IP_NF_SECURITY=m
CONFIG_IP_NF_ARPTABLES=m
CONFIG_IP_NF_ARPFILTER=m
CONFIG_IP_NF_ARP_MANGLE=m

#
# IPv6: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV6=m
CONFIG_NF_CONNTRACK_IPV6=m
CONFIG_IP6_NF_IPTABLES=m
CONFIG_IP6_NF_MATCH_AH=m
CONFIG_IP6_NF_MATCH_EUI64=m
CONFIG_IP6_NF_MATCH_FRAG=m
CONFIG_IP6_NF_MATCH_OPTS=m
CONFIG_IP6_NF_MATCH_HL=m
CONFIG_IP6_NF_MATCH_IPV6HEADER=m
CONFIG_IP6_NF_MATCH_MH=m
CONFIG_IP6_NF_MATCH_RPFILTER=m
CONFIG_IP6_NF_MATCH_RT=m
CONFIG_IP6_NF_TARGET_HL=m
CONFIG_IP6_NF_FILTER=m
CONFIG_IP6_NF_TARGET_REJECT=m
CONFIG_IP6_NF_MANGLE=m
CONFIG_IP6_NF_RAW=m
CONFIG_IP6_NF_SECURITY=m
# CONFIG_NF_NAT_IPV6 is not set
CONFIG_BRIDGE_NF_EBTABLES=m
CONFIG_BRIDGE_EBT_BROUTE=m
CONFIG_BRIDGE_EBT_T_FILTER=m
CONFIG_BRIDGE_EBT_T_NAT=m
CONFIG_BRIDGE_EBT_802_3=m
CONFIG_BRIDGE_EBT_AMONG=m
CONFIG_BRIDGE_EBT_ARP=m
CONFIG_BRIDGE_EBT_IP=m
CONFIG_BRIDGE_EBT_IP6=m
CONFIG_BRIDGE_EBT_LIMIT=m
CONFIG_BRIDGE_EBT_MARK=m
CONFIG_BRIDGE_EBT_PKTTYPE=m
CONFIG_BRIDGE_EBT_STP=m
CONFIG_BRIDGE_EBT_VLAN=m
CONFIG_BRIDGE_EBT_ARPREPLY=m
CONFIG_BRIDGE_EBT_DNAT=m
CONFIG_BRIDGE_EBT_MARK_T=m
CONFIG_BRIDGE_EBT_REDIRECT=m
CONFIG_BRIDGE_EBT_SNAT=m
CONFIG_BRIDGE_EBT_LOG=m
CONFIG_BRIDGE_EBT_ULOG=m
CONFIG_BRIDGE_EBT_NFLOG=m
CONFIG_IP_DCCP=m
CONFIG_INET_DCCP_DIAG=m

#
# DCCP CCIDs Configuration (EXPERIMENTAL)
#
# CONFIG_IP_DCCP_CCID2_DEBUG is not set
CONFIG_IP_DCCP_CCID3=y
# CONFIG_IP_DCCP_CCID3_DEBUG is not set
CONFIG_IP_DCCP_TFRC_LIB=y

#
# DCCP Kernel Hacking
#
# CONFIG_IP_DCCP_DEBUG is not set
# CONFIG_NET_DCCPPROBE is not set
CONFIG_IP_SCTP=m
CONFIG_NET_SCTPPROBE=m
# CONFIG_SCTP_DBG_MSG is not set
# CONFIG_SCTP_DBG_OBJCNT is not set
# CONFIG_SCTP_HMAC_NONE is not set
# CONFIG_SCTP_HMAC_SHA1 is not set
CONFIG_SCTP_HMAC_MD5=y
CONFIG_RDS=m
CONFIG_RDS_TCP=m
# CONFIG_RDS_DEBUG is not set
# CONFIG_TIPC is not set
CONFIG_ATM=m
CONFIG_ATM_CLIP=m
# CONFIG_ATM_CLIP_NO_ICMP is not set
CONFIG_ATM_LANE=m
CONFIG_ATM_MPOA=m
CONFIG_ATM_BR2684=m
# CONFIG_ATM_BR2684_IPFILTER is not set
# CONFIG_L2TP is not set
CONFIG_STP=m
CONFIG_GARP=m
CONFIG_BRIDGE=m
CONFIG_BRIDGE_IGMP_SNOOPING=y
CONFIG_NET_DSA=y
# CONFIG_NET_DSA_TAG_DSA is not set
# CONFIG_NET_DSA_TAG_EDSA is not set
# CONFIG_NET_DSA_TAG_TRAILER is not set
CONFIG_VLAN_8021Q=m
CONFIG_VLAN_8021Q_GVRP=y
# CONFIG_DECNET is not set
CONFIG_LLC=m
CONFIG_LLC2=m
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_WAN_ROUTER is not set
# CONFIG_PHONET is not set
CONFIG_IEEE802154=m
CONFIG_IEEE802154_6LOWPAN=m
# CONFIG_MAC802154 is not set
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
CONFIG_NET_SCH_CBQ=m
CONFIG_NET_SCH_HTB=m
CONFIG_NET_SCH_HFSC=m
CONFIG_NET_SCH_ATM=m
CONFIG_NET_SCH_PRIO=m
CONFIG_NET_SCH_MULTIQ=m
CONFIG_NET_SCH_RED=m
CONFIG_NET_SCH_SFB=m
CONFIG_NET_SCH_SFQ=m
CONFIG_NET_SCH_TEQL=m
CONFIG_NET_SCH_TBF=m
CONFIG_NET_SCH_GRED=m
CONFIG_NET_SCH_DSMARK=m
CONFIG_NET_SCH_NETEM=m
CONFIG_NET_SCH_DRR=m
CONFIG_NET_SCH_MQPRIO=m
CONFIG_NET_SCH_CHOKE=m
CONFIG_NET_SCH_QFQ=m
# CONFIG_NET_SCH_CODEL is not set
# CONFIG_NET_SCH_FQ_CODEL is not set
CONFIG_NET_SCH_INGRESS=m
CONFIG_NET_SCH_PLUG=m

#
# Classification
#
CONFIG_NET_CLS=y
CONFIG_NET_CLS_BASIC=m
CONFIG_NET_CLS_TCINDEX=m
CONFIG_NET_CLS_ROUTE4=m
CONFIG_NET_CLS_FW=m
CONFIG_NET_CLS_U32=m
CONFIG_CLS_U32_PERF=y
CONFIG_CLS_U32_MARK=y
CONFIG_NET_CLS_RSVP=m
CONFIG_NET_CLS_RSVP6=m
CONFIG_NET_CLS_FLOW=m
CONFIG_NET_CLS_CGROUP=y
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
CONFIG_NET_EMATCH_CMP=m
CONFIG_NET_EMATCH_NBYTE=m
CONFIG_NET_EMATCH_U32=m
CONFIG_NET_EMATCH_META=m
CONFIG_NET_EMATCH_TEXT=m
# CONFIG_NET_EMATCH_IPSET is not set
CONFIG_NET_CLS_ACT=y
CONFIG_NET_ACT_POLICE=m
CONFIG_NET_ACT_GACT=m
CONFIG_GACT_PROB=y
CONFIG_NET_ACT_MIRRED=m
CONFIG_NET_ACT_IPT=m
CONFIG_NET_ACT_NAT=m
CONFIG_NET_ACT_PEDIT=m
CONFIG_NET_ACT_SIMP=m
CONFIG_NET_ACT_SKBEDIT=m
CONFIG_NET_ACT_CSUM=m
CONFIG_NET_CLS_IND=y
CONFIG_NET_SCH_FIFO=y
CONFIG_DCB=y
CONFIG_DNS_RESOLVER=y
# CONFIG_BATMAN_ADV is not set
CONFIG_OPENVSWITCH=m
CONFIG_RPS=y
CONFIG_RFS_ACCEL=y
CONFIG_XPS=y
CONFIG_NETPRIO_CGROUP=m
CONFIG_BQL=y
CONFIG_BPF_JIT=y

#
# Network testing
#
CONFIG_NET_PKTGEN=m
CONFIG_NET_TCPPROBE=m
# CONFIG_NET_DROP_MONITOR is not set
# CONFIG_HAMRADIO is not set
# CONFIG_CAN is not set
# CONFIG_IRDA is not set
# CONFIG_BT is not set
# CONFIG_AF_RXRPC is not set
CONFIG_FIB_RULES=y
# CONFIG_WIRELESS is not set
# CONFIG_WIMAX is not set
CONFIG_RFKILL=m
CONFIG_RFKILL_LEDS=y
CONFIG_RFKILL_INPUT=y
# CONFIG_NET_9P is not set
# CONFIG_CAIF is not set
CONFIG_CEPH_LIB=m
CONFIG_CEPH_LIB_PRETTYDEBUG=y
# CONFIG_CEPH_LIB_USE_DNS_RESOLVER is not set
CONFIG_NFC=m
CONFIG_NFC_NCI=m
# CONFIG_NFC_HCI is not set
CONFIG_NFC_LLCP=y

#
# Near Field Communication (NFC) devices
#
CONFIG_NFC_PN533=m
CONFIG_NFC_WILINK=m
CONFIG_HAVE_BPF_JIT=y

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH=""
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
# CONFIG_STANDALONE is not set
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
# CONFIG_FIRMWARE_IN_KERNEL is not set
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
# CONFIG_SYS_HYPERVISOR is not set
# CONFIG_GENERIC_CPU_DEVICES is not set
CONFIG_REGMAP=y
CONFIG_REGMAP_I2C=m
CONFIG_DMA_SHARED_BUFFER=y

#
# Bus devices
#
# CONFIG_OMAP_OCP2SCP is not set
CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y
# CONFIG_MTD is not set
# CONFIG_PARPORT is not set
CONFIG_PNP=y
# CONFIG_PNP_DEBUG_MESSAGES is not set

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
CONFIG_BLK_DEV_FD=m
CONFIG_BLK_DEV_PCIESSD_MTIP32XX=m
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=m
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
CONFIG_BLK_DEV_CRYPTOLOOP=m
# CONFIG_BLK_DEV_DRBD is not set
CONFIG_BLK_DEV_NBD=m
# CONFIG_BLK_DEV_NVME is not set
# CONFIG_BLK_DEV_OSD is not set
CONFIG_BLK_DEV_SX8=m
CONFIG_BLK_DEV_RAM=m
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=131072
CONFIG_BLK_DEV_XIP=y
CONFIG_CDROM_PKTCDVD=m
CONFIG_CDROM_PKTCDVD_BUFFERS=8
CONFIG_CDROM_PKTCDVD_WCACHE=y
CONFIG_ATA_OVER_ETH=m
CONFIG_VIRTIO_BLK=m
# CONFIG_BLK_DEV_HD is not set
CONFIG_BLK_DEV_RBD=m

#
# Misc devices
#
CONFIG_SENSORS_LIS3LV02D=m
CONFIG_AD525X_DPOT=m
CONFIG_AD525X_DPOT_I2C=m
CONFIG_AD525X_DPOT_SPI=m
CONFIG_IBM_ASM=m
CONFIG_PHANTOM=m
# CONFIG_INTEL_MID_PTI is not set
CONFIG_SGI_IOC4=m
CONFIG_TIFM_CORE=m
CONFIG_TIFM_7XX1=m
CONFIG_ICS932S401=m
CONFIG_ENCLOSURE_SERVICES=m
CONFIG_SGI_XP=m
CONFIG_HP_ILO=m
CONFIG_SGI_GRU=m
# CONFIG_SGI_GRU_DEBUG is not set
# CONFIG_APDS9802ALS is not set
# CONFIG_ISL29003 is not set
CONFIG_ISL29020=m
CONFIG_SENSORS_TSL2550=m
CONFIG_SENSORS_BH1780=m
CONFIG_SENSORS_BH1770=m
CONFIG_SENSORS_APDS990X=m
CONFIG_HMC6352=m
CONFIG_DS1682=m
CONFIG_TI_DAC7512=m
CONFIG_VMWARE_BALLOON=m
# CONFIG_BMP085_I2C is not set
# CONFIG_BMP085_SPI is not set
CONFIG_PCH_PHUB=m
CONFIG_USB_SWITCH_FSA9480=m
CONFIG_C2PORT=m
CONFIG_C2PORT_DURAMAR_2150=m

#
# EEPROM support
#
CONFIG_EEPROM_AT24=m
CONFIG_EEPROM_AT25=m
CONFIG_EEPROM_LEGACY=m
CONFIG_EEPROM_MAX6875=m
CONFIG_EEPROM_93CX6=m
CONFIG_EEPROM_93XX46=m
CONFIG_CB710_CORE=m
# CONFIG_CB710_DEBUG is not set
CONFIG_CB710_DEBUG_ASSUMPTIONS=y

#
# Texas Instruments shared transport line discipline
#
CONFIG_TI_ST=m
CONFIG_SENSORS_LIS3_I2C=m

#
# Altera FPGA firmware download module
#
CONFIG_ALTERA_STAPL=m
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
CONFIG_SCSI_MOD=y
CONFIG_RAID_ATTRS=m
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
CONFIG_SCSI_TGT=m
CONFIG_SCSI_NETLINK=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
CONFIG_CHR_DEV_ST=m
CONFIG_CHR_DEV_OSST=m
CONFIG_BLK_DEV_SR=m
# CONFIG_BLK_DEV_SR_VENDOR is not set
CONFIG_CHR_DEV_SG=m
CONFIG_CHR_DEV_SCH=m
CONFIG_SCSI_ENCLOSURE=m
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
# CONFIG_SCSI_SCAN_ASYNC is not set

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=m
CONFIG_SCSI_FC_ATTRS=m
CONFIG_SCSI_FC_TGT_ATTRS=y
CONFIG_SCSI_ISCSI_ATTRS=m
CONFIG_SCSI_SAS_ATTRS=m
CONFIG_SCSI_SAS_LIBSAS=m
CONFIG_SCSI_SAS_ATA=y
CONFIG_SCSI_SAS_HOST_SMP=y
CONFIG_SCSI_SRP_ATTRS=m
CONFIG_SCSI_SRP_TGT_ATTRS=y
CONFIG_SCSI_LOWLEVEL=y
CONFIG_ISCSI_TCP=m
CONFIG_ISCSI_BOOT_SYSFS=m
CONFIG_SCSI_CXGB3_ISCSI=m
CONFIG_SCSI_CXGB4_ISCSI=m
CONFIG_SCSI_BNX2_ISCSI=m
CONFIG_SCSI_BNX2X_FCOE=m
CONFIG_BE2ISCSI=m
CONFIG_BLK_DEV_3W_XXXX_RAID=m
CONFIG_SCSI_HPSA=m
CONFIG_SCSI_3W_9XXX=m
CONFIG_SCSI_3W_SAS=m
CONFIG_SCSI_ACARD=m
CONFIG_SCSI_AACRAID=m
CONFIG_SCSI_AIC7XXX=m
CONFIG_AIC7XXX_CMDS_PER_DEVICE=32
CONFIG_AIC7XXX_RESET_DELAY_MS=15000
# CONFIG_AIC7XXX_DEBUG_ENABLE is not set
CONFIG_AIC7XXX_DEBUG_MASK=0
CONFIG_AIC7XXX_REG_PRETTY_PRINT=y
CONFIG_SCSI_AIC7XXX_OLD=m
CONFIG_SCSI_AIC79XX=m
CONFIG_AIC79XX_CMDS_PER_DEVICE=32
CONFIG_AIC79XX_RESET_DELAY_MS=5000
# CONFIG_AIC79XX_DEBUG_ENABLE is not set
CONFIG_AIC79XX_DEBUG_MASK=0
CONFIG_AIC79XX_REG_PRETTY_PRINT=y
CONFIG_SCSI_AIC94XX=m
# CONFIG_AIC94XX_DEBUG is not set
CONFIG_SCSI_MVSAS=m
# CONFIG_SCSI_MVSAS_DEBUG is not set
CONFIG_SCSI_MVSAS_TASKLET=y
CONFIG_SCSI_MVUMI=m
CONFIG_SCSI_DPT_I2O=m
CONFIG_SCSI_ADVANSYS=m
CONFIG_SCSI_ARCMSR=m
CONFIG_MEGARAID_NEWGEN=y
CONFIG_MEGARAID_MM=m
CONFIG_MEGARAID_MAILBOX=m
CONFIG_MEGARAID_LEGACY=m
CONFIG_MEGARAID_SAS=m
CONFIG_SCSI_MPT2SAS=m
CONFIG_SCSI_MPT2SAS_MAX_SGE=128
# CONFIG_SCSI_MPT2SAS_LOGGING is not set
CONFIG_SCSI_UFSHCD=m
CONFIG_SCSI_HPTIOP=m
CONFIG_SCSI_BUSLOGIC=m
CONFIG_VMWARE_PVSCSI=m
CONFIG_HYPERV_STORAGE=m
CONFIG_LIBFC=m
CONFIG_LIBFCOE=m
CONFIG_FCOE=m
CONFIG_FCOE_FNIC=m
CONFIG_SCSI_DMX3191D=m
CONFIG_SCSI_EATA=m
CONFIG_SCSI_EATA_TAGGED_QUEUE=y
CONFIG_SCSI_EATA_LINKED_COMMANDS=y
CONFIG_SCSI_EATA_MAX_TAGS=16
CONFIG_SCSI_FUTURE_DOMAIN=m
CONFIG_SCSI_GDTH=m
CONFIG_SCSI_ISCI=m
CONFIG_SCSI_IPS=m
CONFIG_SCSI_INITIO=m
CONFIG_SCSI_INIA100=m
CONFIG_SCSI_STEX=m
CONFIG_SCSI_SYM53C8XX_2=m
CONFIG_SCSI_SYM53C8XX_DMA_ADDRESSING_MODE=1
CONFIG_SCSI_SYM53C8XX_DEFAULT_TAGS=16
CONFIG_SCSI_SYM53C8XX_MAX_TAGS=64
CONFIG_SCSI_SYM53C8XX_MMIO=y
CONFIG_SCSI_IPR=m
CONFIG_SCSI_IPR_TRACE=y
CONFIG_SCSI_IPR_DUMP=y
CONFIG_SCSI_QLOGIC_1280=m
CONFIG_SCSI_QLA_FC=m
# CONFIG_TCM_QLA2XXX is not set
CONFIG_SCSI_QLA_ISCSI=m
CONFIG_SCSI_LPFC=m
# CONFIG_SCSI_LPFC_DEBUG_FS is not set
CONFIG_SCSI_DC395x=m
CONFIG_SCSI_DC390T=m
CONFIG_SCSI_DEBUG=m
CONFIG_SCSI_PMCRAID=m
CONFIG_SCSI_PM8001=m
CONFIG_SCSI_SRP=m
CONFIG_SCSI_BFA_FC=m
CONFIG_SCSI_VIRTIO=m
CONFIG_SCSI_DH=m
CONFIG_SCSI_DH_RDAC=m
CONFIG_SCSI_DH_HP_SW=m
CONFIG_SCSI_DH_EMC=m
CONFIG_SCSI_DH_ALUA=m
CONFIG_SCSI_OSD_INITIATOR=m
CONFIG_SCSI_OSD_ULD=m
CONFIG_SCSI_OSD_DPRINT_SENSE=1
# CONFIG_SCSI_OSD_DEBUG is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_VERBOSE_ERROR=y
CONFIG_ATA_ACPI=y
CONFIG_SATA_PMP=y

#
# Controllers with non-SFF native interface
#
CONFIG_SATA_AHCI=y
CONFIG_SATA_AHCI_PLATFORM=m
CONFIG_SATA_INIC162X=m
CONFIG_SATA_ACARD_AHCI=m
CONFIG_SATA_SIL24=m
CONFIG_ATA_SFF=y

#
# SFF controllers with custom DMA interface
#
CONFIG_PDC_ADMA=m
CONFIG_SATA_QSTOR=m
CONFIG_SATA_SX4=m
CONFIG_ATA_BMDMA=y

#
# SATA SFF controllers with BMDMA
#
CONFIG_ATA_PIIX=m
# CONFIG_SATA_HIGHBANK is not set
CONFIG_SATA_MV=m
CONFIG_SATA_NV=m
CONFIG_SATA_PROMISE=m
CONFIG_SATA_SIL=m
CONFIG_SATA_SIS=m
CONFIG_SATA_SVW=m
CONFIG_SATA_ULI=m
CONFIG_SATA_VIA=m
CONFIG_SATA_VITESSE=m

#
# PATA SFF controllers with BMDMA
#
CONFIG_PATA_ALI=m
CONFIG_PATA_AMD=m
CONFIG_PATA_ARASAN_CF=m
CONFIG_PATA_ARTOP=m
CONFIG_PATA_ATIIXP=m
CONFIG_PATA_ATP867X=m
CONFIG_PATA_CMD64X=m
CONFIG_PATA_CS5520=m
CONFIG_PATA_CS5530=m
CONFIG_PATA_CS5536=m
CONFIG_PATA_CYPRESS=m
CONFIG_PATA_EFAR=m
CONFIG_PATA_HPT366=m
CONFIG_PATA_HPT37X=m
CONFIG_PATA_HPT3X2N=m
CONFIG_PATA_HPT3X3=m
# CONFIG_PATA_HPT3X3_DMA is not set
CONFIG_PATA_IT8213=m
CONFIG_PATA_IT821X=m
CONFIG_PATA_JMICRON=m
CONFIG_PATA_MARVELL=m
CONFIG_PATA_NETCELL=m
CONFIG_PATA_NINJA32=m
CONFIG_PATA_NS87415=m
CONFIG_PATA_OLDPIIX=m
CONFIG_PATA_OPTIDMA=m
CONFIG_PATA_PDC2027X=m
CONFIG_PATA_PDC_OLD=m
CONFIG_PATA_RADISYS=m
CONFIG_PATA_RDC=m
CONFIG_PATA_SC1200=m
CONFIG_PATA_SCH=m
CONFIG_PATA_SERVERWORKS=m
CONFIG_PATA_SIL680=m
CONFIG_PATA_SIS=m
CONFIG_PATA_TOSHIBA=m
CONFIG_PATA_TRIFLEX=m
CONFIG_PATA_VIA=m
CONFIG_PATA_WINBOND=m

#
# PIO-only SFF controllers
#
CONFIG_PATA_CMD640_PCI=m
CONFIG_PATA_MPIIX=m
CONFIG_PATA_NS87410=m
CONFIG_PATA_OPTI=m
CONFIG_PATA_RZ1000=m

#
# Generic fallback / legacy drivers
#
CONFIG_PATA_ACPI=m
CONFIG_ATA_GENERIC=m
# CONFIG_PATA_LEGACY is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
CONFIG_MD_LINEAR=m
CONFIG_MD_RAID0=m
CONFIG_MD_RAID1=m
CONFIG_MD_RAID10=m
CONFIG_MD_RAID456=m
# CONFIG_MULTICORE_RAID456 is not set
CONFIG_MD_MULTIPATH=m
CONFIG_MD_FAULTY=m
CONFIG_BLK_DEV_DM=m
# CONFIG_DM_DEBUG is not set
CONFIG_DM_BUFIO=m
CONFIG_DM_BIO_PRISON=m
CONFIG_DM_PERSISTENT_DATA=m
CONFIG_DM_CRYPT=m
CONFIG_DM_SNAPSHOT=m
CONFIG_DM_THIN_PROVISIONING=m
# CONFIG_DM_DEBUG_BLOCK_STACK_TRACING is not set
CONFIG_DM_MIRROR=m
CONFIG_DM_RAID=m
CONFIG_DM_LOG_USERSPACE=m
CONFIG_DM_ZERO=m
CONFIG_DM_MULTIPATH=m
CONFIG_DM_MULTIPATH_QL=m
CONFIG_DM_MULTIPATH_ST=m
CONFIG_DM_DELAY=m
CONFIG_DM_UEVENT=y
CONFIG_DM_FLAKEY=m
CONFIG_DM_VERITY=m
CONFIG_TARGET_CORE=m
CONFIG_TCM_IBLOCK=m
CONFIG_TCM_FILEIO=m
CONFIG_TCM_PSCSI=m
CONFIG_LOOPBACK_TARGET=m
CONFIG_TCM_FC=m
CONFIG_ISCSI_TARGET=m
# CONFIG_SBP_TARGET is not set
# CONFIG_FUSION is not set

#
# IEEE 1394 (FireWire) support
#
CONFIG_FIREWIRE=m
CONFIG_FIREWIRE_OHCI=m
CONFIG_FIREWIRE_SBP2=m
CONFIG_FIREWIRE_NET=m
CONFIG_FIREWIRE_NOSY=m
CONFIG_I2O=m
CONFIG_I2O_LCT_NOTIFY_ON_CHANGES=y
CONFIG_I2O_EXT_ADAPTEC=y
CONFIG_I2O_EXT_ADAPTEC_DMA64=y
CONFIG_I2O_CONFIG=m
CONFIG_I2O_CONFIG_OLD_IOCTL=y
CONFIG_I2O_BUS=m
CONFIG_I2O_BLOCK=m
CONFIG_I2O_SCSI=m
CONFIG_I2O_PROC=m
# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
CONFIG_NET_CORE=y
CONFIG_BONDING=m
CONFIG_DUMMY=m
CONFIG_EQUALIZER=m
CONFIG_NET_FC=y
CONFIG_MII=y
CONFIG_IFB=m
CONFIG_NET_TEAM=m
# CONFIG_NET_TEAM_MODE_BROADCAST is not set
CONFIG_NET_TEAM_MODE_ROUNDROBIN=m
CONFIG_NET_TEAM_MODE_ACTIVEBACKUP=m
# CONFIG_NET_TEAM_MODE_LOADBALANCE is not set
CONFIG_MACVLAN=m
CONFIG_MACVTAP=m
# CONFIG_VXLAN is not set
CONFIG_NETCONSOLE=m
CONFIG_NETCONSOLE_DYNAMIC=y
CONFIG_NETPOLL=y
CONFIG_NETPOLL_TRAP=y
CONFIG_NET_POLL_CONTROLLER=y
CONFIG_TUN=m
CONFIG_VETH=m
CONFIG_VIRTIO_NET=m
# CONFIG_ARCNET is not set
CONFIG_ATM_DRIVERS=y
CONFIG_ATM_DUMMY=m
CONFIG_ATM_TCP=m
CONFIG_ATM_LANAI=m
CONFIG_ATM_ENI=m
# CONFIG_ATM_ENI_DEBUG is not set
CONFIG_ATM_ENI_TUNE_BURST=y
CONFIG_ATM_ENI_BURST_TX_16W=y
CONFIG_ATM_ENI_BURST_TX_8W=y
CONFIG_ATM_ENI_BURST_TX_4W=y
CONFIG_ATM_ENI_BURST_TX_2W=y
CONFIG_ATM_ENI_BURST_RX_16W=y
CONFIG_ATM_ENI_BURST_RX_8W=y
CONFIG_ATM_ENI_BURST_RX_4W=y
CONFIG_ATM_ENI_BURST_RX_2W=y
CONFIG_ATM_FIRESTREAM=m
CONFIG_ATM_ZATM=m
# CONFIG_ATM_ZATM_DEBUG is not set
CONFIG_ATM_NICSTAR=m
CONFIG_ATM_NICSTAR_USE_SUNI=y
CONFIG_ATM_NICSTAR_USE_IDT77105=y
CONFIG_ATM_IDT77252=m
# CONFIG_ATM_IDT77252_DEBUG is not set
# CONFIG_ATM_IDT77252_RCV_ALL is not set
CONFIG_ATM_IDT77252_USE_SUNI=y
CONFIG_ATM_AMBASSADOR=m
# CONFIG_ATM_AMBASSADOR_DEBUG is not set
CONFIG_ATM_HORIZON=m
# CONFIG_ATM_HORIZON_DEBUG is not set
CONFIG_ATM_IA=m
# CONFIG_ATM_IA_DEBUG is not set
CONFIG_ATM_FORE200E=m
CONFIG_ATM_FORE200E_USE_TASKLET=y
CONFIG_ATM_FORE200E_TX_RETRY=16
CONFIG_ATM_FORE200E_DEBUG=0
CONFIG_ATM_HE=m
CONFIG_ATM_HE_USE_SUNI=y
CONFIG_ATM_SOLOS=m

#
# CAIF transport drivers
#

#
# Distributed Switch Architecture drivers
#
# CONFIG_NET_DSA_MV88E6XXX is not set
# CONFIG_NET_DSA_MV88E6060 is not set
# CONFIG_NET_DSA_MV88E6XXX_NEED_PPU is not set
# CONFIG_NET_DSA_MV88E6131 is not set
# CONFIG_NET_DSA_MV88E6123_61_65 is not set
CONFIG_ETHERNET=y
CONFIG_MDIO=m
CONFIG_NET_VENDOR_3COM=y
CONFIG_VORTEX=m
CONFIG_TYPHOON=m
# CONFIG_NET_VENDOR_ADAPTEC is not set
# CONFIG_NET_VENDOR_ALTEON is not set
CONFIG_NET_VENDOR_AMD=y
CONFIG_AMD8111_ETH=m
CONFIG_PCNET32=m
# CONFIG_NET_VENDOR_ATHEROS is not set
CONFIG_NET_VENDOR_BROADCOM=y
CONFIG_B44=m
CONFIG_B44_PCI_AUTOSELECT=y
CONFIG_B44_PCICORE_AUTOSELECT=y
CONFIG_B44_PCI=y
CONFIG_BNX2=m
CONFIG_CNIC=m
CONFIG_TIGON3=m
CONFIG_BNX2X=m
# CONFIG_NET_VENDOR_BROCADE is not set
# CONFIG_NET_CALXEDA_XGMAC is not set
CONFIG_NET_VENDOR_CHELSIO=y
# CONFIG_CHELSIO_T1 is not set
CONFIG_CHELSIO_T3=m
CONFIG_CHELSIO_T4=m
# CONFIG_CHELSIO_T4VF is not set
# CONFIG_NET_VENDOR_CISCO is not set
# CONFIG_DNET is not set
# CONFIG_NET_VENDOR_DEC is not set
# CONFIG_NET_VENDOR_DLINK is not set
# CONFIG_NET_VENDOR_EMULEX is not set
# CONFIG_NET_VENDOR_EXAR is not set
# CONFIG_NET_VENDOR_HP is not set
CONFIG_NET_VENDOR_INTEL=y
CONFIG_E100=m
CONFIG_E1000=m
CONFIG_E1000E=m
CONFIG_IGB=m
CONFIG_IGB_DCA=y
# CONFIG_IGB_PTP is not set
CONFIG_IGBVF=m
CONFIG_IXGB=m
CONFIG_IXGBE=m
CONFIG_IXGBE_HWMON=y
CONFIG_IXGBE_DCA=y
CONFIG_IXGBE_DCB=y
# CONFIG_IXGBE_PTP is not set
CONFIG_IXGBEVF=m
CONFIG_NET_VENDOR_I825XX=y
CONFIG_ZNET=m
CONFIG_IP1000=m
# CONFIG_JME is not set
# CONFIG_NET_VENDOR_MARVELL is not set
# CONFIG_NET_VENDOR_MELLANOX is not set
# CONFIG_NET_VENDOR_MICREL is not set
CONFIG_NET_VENDOR_MICROCHIP=y
# CONFIG_ENC28J60 is not set
# CONFIG_NET_VENDOR_MYRI is not set
# CONFIG_FEALNX is not set
# CONFIG_NET_VENDOR_NATSEMI is not set
CONFIG_NET_VENDOR_NVIDIA=y
CONFIG_FORCEDETH=m
# CONFIG_NET_VENDOR_OKI is not set
# CONFIG_ETHOC is not set
CONFIG_NET_PACKET_ENGINE=y
CONFIG_HAMACHI=m
CONFIG_YELLOWFIN=m
# CONFIG_NET_VENDOR_QLOGIC is not set
CONFIG_NET_VENDOR_REALTEK=y
CONFIG_8139CP=m
CONFIG_8139TOO=m
# CONFIG_8139TOO_PIO is not set
# CONFIG_8139TOO_TUNE_TWISTER is not set
CONFIG_8139TOO_8129=y
# CONFIG_8139_OLD_RX_RESET is not set
CONFIG_R8169=m
# CONFIG_NET_VENDOR_RDC is not set
# CONFIG_NET_VENDOR_SEEQ is not set
# CONFIG_NET_VENDOR_SILAN is not set
# CONFIG_NET_VENDOR_SIS is not set
# CONFIG_SFC is not set
# CONFIG_NET_VENDOR_SMSC is not set
# CONFIG_NET_VENDOR_STMICRO is not set
# CONFIG_NET_VENDOR_SUN is not set
# CONFIG_NET_VENDOR_TEHUTI is not set
# CONFIG_NET_VENDOR_TI is not set
# CONFIG_NET_VENDOR_VIA is not set
CONFIG_NET_VENDOR_WIZNET=y
# CONFIG_WIZNET_W5100 is not set
# CONFIG_WIZNET_W5300 is not set
# CONFIG_FDDI is not set
CONFIG_HIPPI=y
# CONFIG_ROADRUNNER is not set
# CONFIG_NET_SB1000 is not set
CONFIG_PHYLIB=y

#
# MII PHY device drivers
#
# CONFIG_AT803X_PHY is not set
CONFIG_AMD_PHY=m
CONFIG_MARVELL_PHY=m
CONFIG_DAVICOM_PHY=m
CONFIG_QSEMI_PHY=m
CONFIG_LXT_PHY=m
CONFIG_CICADA_PHY=m
CONFIG_VITESSE_PHY=m
CONFIG_SMSC_PHY=m
CONFIG_BROADCOM_PHY=m
# CONFIG_BCM87XX_PHY is not set
CONFIG_ICPLUS_PHY=m
CONFIG_REALTEK_PHY=m
CONFIG_NATIONAL_PHY=m
CONFIG_STE10XP=m
CONFIG_LSI_ET1011C_PHY=m
CONFIG_MICREL_PHY=m
CONFIG_FIXED_PHY=y
CONFIG_MDIO_BITBANG=m
CONFIG_MDIO_GPIO=m
# CONFIG_MICREL_KS8995MA is not set
CONFIG_PPP=m
# CONFIG_PPP_BSDCOMP is not set
CONFIG_PPP_DEFLATE=m
CONFIG_PPP_FILTER=y
CONFIG_PPP_MPPE=m
CONFIG_PPP_MULTILINK=y
CONFIG_PPPOATM=m
CONFIG_PPPOE=m
CONFIG_PPTP=m
CONFIG_PPP_ASYNC=m
CONFIG_PPP_SYNC_TTY=m
CONFIG_SLIP=m
CONFIG_SLHC=m
CONFIG_SLIP_COMPRESSED=y
# CONFIG_SLIP_SMART is not set
# CONFIG_SLIP_MODE_SLIP6 is not set

#
# USB Network Adapters
#
CONFIG_USB_CATC=m
CONFIG_USB_KAWETH=m
CONFIG_USB_PEGASUS=m
CONFIG_USB_RTL8150=m
CONFIG_USB_USBNET=m
CONFIG_USB_NET_AX8817X=m
CONFIG_USB_NET_CDCETHER=m
CONFIG_USB_NET_CDC_EEM=m
CONFIG_USB_NET_CDC_NCM=m
CONFIG_USB_NET_DM9601=m
CONFIG_USB_NET_SMSC75XX=m
CONFIG_USB_NET_SMSC95XX=m
CONFIG_USB_NET_GL620A=m
CONFIG_USB_NET_NET1080=m
CONFIG_USB_NET_PLUSB=m
CONFIG_USB_NET_MCS7830=m
CONFIG_USB_NET_RNDIS_HOST=m
CONFIG_USB_NET_CDC_SUBSET=m
CONFIG_USB_ALI_M5632=y
CONFIG_USB_AN2720=y
CONFIG_USB_BELKIN=y
CONFIG_USB_ARMLINUX=y
CONFIG_USB_EPSON2888=y
CONFIG_USB_KC2190=y
CONFIG_USB_NET_ZAURUS=m
CONFIG_USB_NET_CX82310_ETH=m
CONFIG_USB_NET_KALMIA=m
CONFIG_USB_NET_QMI_WWAN=m
CONFIG_USB_HSO=m
CONFIG_USB_NET_INT51X1=m
CONFIG_USB_IPHETH=m
CONFIG_USB_SIERRA_NET=m
CONFIG_USB_VL600=m
# CONFIG_WLAN is not set

#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#
# CONFIG_WAN is not set
CONFIG_IEEE802154_DRIVERS=m
CONFIG_IEEE802154_FAKEHARD=m
# CONFIG_VMXNET3 is not set
# CONFIG_HYPERV_NET is not set
CONFIG_ISDN=y
CONFIG_ISDN_I4L=m
CONFIG_ISDN_PPP=y
CONFIG_ISDN_PPP_VJ=y
CONFIG_ISDN_MPP=y
CONFIG_IPPP_FILTER=y
CONFIG_ISDN_PPP_BSDCOMP=m
CONFIG_ISDN_AUDIO=y
CONFIG_ISDN_TTY_FAX=y

#
# ISDN feature submodules
#
CONFIG_ISDN_DIVERSION=m

#
# ISDN4Linux hardware drivers
#

#
# Passive cards
#
CONFIG_ISDN_DRV_HISAX=m

#
# D-channel protocol features
#
CONFIG_HISAX_EURO=y
CONFIG_DE_AOC=y
# CONFIG_HISAX_NO_SENDCOMPLETE is not set
# CONFIG_HISAX_NO_LLC is not set
# CONFIG_HISAX_NO_KEYPAD is not set
CONFIG_HISAX_1TR6=y
CONFIG_HISAX_NI1=y
CONFIG_HISAX_MAX_CARDS=8

#
# HiSax supported cards
#
CONFIG_HISAX_16_3=y
CONFIG_HISAX_TELESPCI=y
CONFIG_HISAX_S0BOX=y
CONFIG_HISAX_FRITZPCI=y
CONFIG_HISAX_AVM_A1_PCMCIA=y
CONFIG_HISAX_ELSA=y
CONFIG_HISAX_DIEHLDIVA=y
CONFIG_HISAX_SEDLBAUER=y
CONFIG_HISAX_NETJET=y
CONFIG_HISAX_NETJET_U=y
CONFIG_HISAX_NICCY=y
CONFIG_HISAX_BKM_A4T=y
CONFIG_HISAX_SCT_QUADRO=y
CONFIG_HISAX_GAZEL=y
CONFIG_HISAX_HFC_PCI=y
CONFIG_HISAX_W6692=y
CONFIG_HISAX_HFC_SX=y
CONFIG_HISAX_ENTERNOW_PCI=y
# CONFIG_HISAX_DEBUG is not set

#
# HiSax PCMCIA card service modules
#

#
# HiSax sub driver modules
#
CONFIG_HISAX_ST5481=m
CONFIG_HISAX_HFCUSB=m
CONFIG_HISAX_HFC4S8S=m
CONFIG_HISAX_FRITZ_PCIPNP=m

#
# Active cards
#
CONFIG_ISDN_CAPI=m
CONFIG_ISDN_DRV_AVMB1_VERBOSE_REASON=y
CONFIG_CAPI_TRACE=y
CONFIG_ISDN_CAPI_MIDDLEWARE=y
CONFIG_ISDN_CAPI_CAPI20=m
CONFIG_ISDN_CAPI_CAPIDRV=m

#
# CAPI hardware drivers
#
CONFIG_CAPI_AVM=y
CONFIG_ISDN_DRV_AVMB1_B1PCI=m
CONFIG_ISDN_DRV_AVMB1_B1PCIV4=y
CONFIG_ISDN_DRV_AVMB1_T1PCI=m
CONFIG_ISDN_DRV_AVMB1_C4=m
CONFIG_CAPI_EICON=y
CONFIG_ISDN_DIVAS=m
CONFIG_ISDN_DIVAS_BRIPCI=y
CONFIG_ISDN_DIVAS_PRIPCI=y
CONFIG_ISDN_DIVAS_DIVACAPI=m
CONFIG_ISDN_DIVAS_USERIDI=m
CONFIG_ISDN_DIVAS_MAINT=m
CONFIG_ISDN_DRV_GIGASET=m
CONFIG_GIGASET_CAPI=y
# CONFIG_GIGASET_I4L is not set
# CONFIG_GIGASET_DUMMYLL is not set
CONFIG_GIGASET_BASE=m
CONFIG_GIGASET_M105=m
CONFIG_GIGASET_M101=m
# CONFIG_GIGASET_DEBUG is not set
CONFIG_HYSDN=m
CONFIG_HYSDN_CAPI=y
CONFIG_MISDN=m
CONFIG_MISDN_DSP=m
CONFIG_MISDN_L1OIP=m

#
# mISDN hardware drivers
#
CONFIG_MISDN_HFCPCI=m
CONFIG_MISDN_HFCMULTI=m
CONFIG_MISDN_HFCUSB=m
CONFIG_MISDN_AVMFRITZ=m
CONFIG_MISDN_SPEEDFAX=m
CONFIG_MISDN_INFINEON=m
CONFIG_MISDN_W6692=m
CONFIG_MISDN_NETJET=m
CONFIG_MISDN_IPAC=m
CONFIG_MISDN_ISAR=m
CONFIG_ISDN_HDLC=m

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=y
CONFIG_INPUT_POLLDEV=m
CONFIG_INPUT_SPARSEKMAP=m
CONFIG_INPUT_MATRIXKMAP=m

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
CONFIG_INPUT_JOYDEV=m
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ADP5588=m
CONFIG_KEYBOARD_ADP5589=m
CONFIG_KEYBOARD_ATKBD=y
CONFIG_KEYBOARD_QT1070=m
CONFIG_KEYBOARD_QT2160=m
# CONFIG_KEYBOARD_LKKBD is not set
CONFIG_KEYBOARD_GPIO=m
CONFIG_KEYBOARD_GPIO_POLLED=m
CONFIG_KEYBOARD_TCA6416=m
CONFIG_KEYBOARD_TCA8418=m
CONFIG_KEYBOARD_MATRIX=m
CONFIG_KEYBOARD_LM8323=m
# CONFIG_KEYBOARD_LM8333 is not set
CONFIG_KEYBOARD_MAX7359=m
CONFIG_KEYBOARD_MCS=m
CONFIG_KEYBOARD_MPR121=m
CONFIG_KEYBOARD_NEWTON=m
CONFIG_KEYBOARD_OPENCORES=m
# CONFIG_KEYBOARD_STOWAWAY is not set
CONFIG_KEYBOARD_SUNKBD=m
CONFIG_KEYBOARD_OMAP4=m
CONFIG_KEYBOARD_XTKBD=m
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
CONFIG_MOUSE_PS2_ELANTECH=y
CONFIG_MOUSE_PS2_SENTELIC=y
CONFIG_MOUSE_PS2_TOUCHKIT=y
CONFIG_MOUSE_SERIAL=m
CONFIG_MOUSE_APPLETOUCH=m
CONFIG_MOUSE_BCM5974=m
CONFIG_MOUSE_VSXXXAA=m
CONFIG_MOUSE_GPIO=m
CONFIG_MOUSE_SYNAPTICS_I2C=m
CONFIG_MOUSE_SYNAPTICS_USB=m
CONFIG_INPUT_JOYSTICK=y
CONFIG_JOYSTICK_ANALOG=m
CONFIG_JOYSTICK_A3D=m
CONFIG_JOYSTICK_ADI=m
CONFIG_JOYSTICK_COBRA=m
CONFIG_JOYSTICK_GF2K=m
CONFIG_JOYSTICK_GRIP=m
CONFIG_JOYSTICK_GRIP_MP=m
CONFIG_JOYSTICK_GUILLEMOT=m
CONFIG_JOYSTICK_INTERACT=m
CONFIG_JOYSTICK_SIDEWINDER=m
CONFIG_JOYSTICK_TMDC=m
CONFIG_JOYSTICK_IFORCE=m
CONFIG_JOYSTICK_IFORCE_USB=y
CONFIG_JOYSTICK_IFORCE_232=y
CONFIG_JOYSTICK_WARRIOR=m
CONFIG_JOYSTICK_MAGELLAN=m
CONFIG_JOYSTICK_SPACEORB=m
CONFIG_JOYSTICK_SPACEBALL=m
CONFIG_JOYSTICK_STINGER=m
CONFIG_JOYSTICK_TWIDJOY=m
CONFIG_JOYSTICK_ZHENHUA=m
CONFIG_JOYSTICK_AS5011=m
CONFIG_JOYSTICK_JOYDUMP=m
CONFIG_JOYSTICK_XPAD=m
CONFIG_JOYSTICK_XPAD_FF=y
CONFIG_JOYSTICK_XPAD_LEDS=y
CONFIG_INPUT_TABLET=y
CONFIG_TABLET_USB_ACECAD=m
CONFIG_TABLET_USB_AIPTEK=m
CONFIG_TABLET_USB_GTCO=m
CONFIG_TABLET_USB_HANWANG=m
CONFIG_TABLET_USB_KBTAB=m
CONFIG_TABLET_USB_WACOM=m
CONFIG_INPUT_TOUCHSCREEN=y
CONFIG_TOUCHSCREEN_ADS7846=m
CONFIG_TOUCHSCREEN_AD7877=m
CONFIG_TOUCHSCREEN_AD7879=m
CONFIG_TOUCHSCREEN_AD7879_I2C=m
CONFIG_TOUCHSCREEN_AD7879_SPI=m
CONFIG_TOUCHSCREEN_ATMEL_MXT=m
CONFIG_TOUCHSCREEN_AUO_PIXCIR=m
CONFIG_TOUCHSCREEN_BU21013=m
CONFIG_TOUCHSCREEN_CY8CTMG110=m
CONFIG_TOUCHSCREEN_CYTTSP_CORE=m
CONFIG_TOUCHSCREEN_CYTTSP_I2C=m
CONFIG_TOUCHSCREEN_CYTTSP_SPI=m
CONFIG_TOUCHSCREEN_DYNAPRO=m
CONFIG_TOUCHSCREEN_HAMPSHIRE=m
CONFIG_TOUCHSCREEN_EETI=m
CONFIG_TOUCHSCREEN_FUJITSU=m
CONFIG_TOUCHSCREEN_ILI210X=m
CONFIG_TOUCHSCREEN_GUNZE=m
CONFIG_TOUCHSCREEN_ELO=m
CONFIG_TOUCHSCREEN_WACOM_W8001=m
# CONFIG_TOUCHSCREEN_WACOM_I2C is not set
CONFIG_TOUCHSCREEN_MAX11801=m
CONFIG_TOUCHSCREEN_MCS5000=m
# CONFIG_TOUCHSCREEN_MMS114 is not set
CONFIG_TOUCHSCREEN_MTOUCH=m
CONFIG_TOUCHSCREEN_INEXIO=m
CONFIG_TOUCHSCREEN_MK712=m
CONFIG_TOUCHSCREEN_PENMOUNT=m
# CONFIG_TOUCHSCREEN_EDT_FT5X06 is not set
CONFIG_TOUCHSCREEN_TOUCHRIGHT=m
CONFIG_TOUCHSCREEN_TOUCHWIN=m
CONFIG_TOUCHSCREEN_PIXCIR=m
CONFIG_TOUCHSCREEN_USB_COMPOSITE=m
CONFIG_TOUCHSCREEN_USB_EGALAX=y
CONFIG_TOUCHSCREEN_USB_PANJIT=y
CONFIG_TOUCHSCREEN_USB_3M=y
CONFIG_TOUCHSCREEN_USB_ITM=y
CONFIG_TOUCHSCREEN_USB_ETURBO=y
CONFIG_TOUCHSCREEN_USB_GUNZE=y
CONFIG_TOUCHSCREEN_USB_DMC_TSC10=y
CONFIG_TOUCHSCREEN_USB_IRTOUCH=y
CONFIG_TOUCHSCREEN_USB_IDEALTEK=y
CONFIG_TOUCHSCREEN_USB_GENERAL_TOUCH=y
CONFIG_TOUCHSCREEN_USB_GOTOP=y
CONFIG_TOUCHSCREEN_USB_JASTEC=y
CONFIG_TOUCHSCREEN_USB_ELO=y
CONFIG_TOUCHSCREEN_USB_E2I=y
CONFIG_TOUCHSCREEN_USB_ZYTRONIC=y
CONFIG_TOUCHSCREEN_USB_ETT_TC45USB=y
CONFIG_TOUCHSCREEN_USB_NEXIO=y
CONFIG_TOUCHSCREEN_USB_EASYTOUCH=y
CONFIG_TOUCHSCREEN_TOUCHIT213=m
CONFIG_TOUCHSCREEN_TSC_SERIO=m
CONFIG_TOUCHSCREEN_TSC2005=m
CONFIG_TOUCHSCREEN_TSC2007=m
CONFIG_TOUCHSCREEN_PCAP=m
CONFIG_TOUCHSCREEN_ST1232=m
CONFIG_TOUCHSCREEN_TPS6507X=m
CONFIG_INPUT_MISC=y
CONFIG_INPUT_AD714X=m
CONFIG_INPUT_AD714X_I2C=m
CONFIG_INPUT_AD714X_SPI=m
CONFIG_INPUT_BMA150=m
CONFIG_INPUT_PCSPKR=m
CONFIG_INPUT_MMA8450=m
CONFIG_INPUT_MPU3050=m
CONFIG_INPUT_APANEL=m
CONFIG_INPUT_GP2A=m
CONFIG_INPUT_GPIO_TILT_POLLED=m
CONFIG_INPUT_ATLAS_BTNS=m
CONFIG_INPUT_ATI_REMOTE2=m
CONFIG_INPUT_KEYSPAN_REMOTE=m
CONFIG_INPUT_KXTJ9=m
# CONFIG_INPUT_KXTJ9_POLLED_MODE is not set
CONFIG_INPUT_POWERMATE=m
CONFIG_INPUT_YEALINK=m
CONFIG_INPUT_CM109=m
CONFIG_INPUT_UINPUT=m
CONFIG_INPUT_PCF8574=m
CONFIG_INPUT_GPIO_ROTARY_ENCODER=m
CONFIG_INPUT_PCAP=m
CONFIG_INPUT_ADXL34X=m
CONFIG_INPUT_ADXL34X_I2C=m
CONFIG_INPUT_ADXL34X_SPI=m
CONFIG_INPUT_CMA3000=m
CONFIG_INPUT_CMA3000_I2C=m

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=m
CONFIG_SERIO_CT82C710=m
CONFIG_SERIO_PCIPS2=m
CONFIG_SERIO_LIBPS2=y
CONFIG_SERIO_RAW=m
CONFIG_SERIO_ALTERA_PS2=m
CONFIG_SERIO_PS2MULT=m
CONFIG_GAMEPORT=m
CONFIG_GAMEPORT_NS558=m
CONFIG_GAMEPORT_L4=m
CONFIG_GAMEPORT_EMU10K1=m
CONFIG_GAMEPORT_FM801=m

#
# Character devices
#
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_VT_CONSOLE_SLEEP=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_UNIX98_PTYS=y
CONFIG_DEVPTS_MULTIPLE_INSTANCES=y
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=0
CONFIG_SERIAL_NONSTANDARD=y
CONFIG_ROCKETPORT=m
CONFIG_CYCLADES=m
# CONFIG_CYZ_INTR is not set
CONFIG_MOXA_INTELLIO=m
CONFIG_MOXA_SMARTIO=m
CONFIG_SYNCLINK=m
CONFIG_SYNCLINKMP=m
CONFIG_SYNCLINK_GT=m
CONFIG_NOZOMI=m
CONFIG_ISI=m
CONFIG_N_HDLC=m
CONFIG_N_GSM=m
CONFIG_TRACE_ROUTER=m
CONFIG_TRACE_SINK=m
CONFIG_DEVKMEM=y
CONFIG_STALDRV=y

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_NR_UARTS=16
CONFIG_SERIAL_8250_RUNTIME_UARTS=8
# CONFIG_SERIAL_8250_EXTENDED is not set

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_KGDB_NMI is not set
# CONFIG_SERIAL_MAX3100 is not set
# CONFIG_SERIAL_MAX310X is not set
# CONFIG_SERIAL_MRST_MAX3110 is not set
# CONFIG_SERIAL_MFD_HSU is not set
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_CONSOLE_POLL=y
CONFIG_SERIAL_JSM=m
# CONFIG_SERIAL_SCCNXP is not set
CONFIG_SERIAL_TIMBERDALE=m
CONFIG_SERIAL_ALTERA_JTAGUART=m
CONFIG_SERIAL_ALTERA_UART=m
CONFIG_SERIAL_ALTERA_UART_MAXPORTS=4
CONFIG_SERIAL_ALTERA_UART_BAUDRATE=115200
CONFIG_SERIAL_IFX6X60=m
CONFIG_SERIAL_PCH_UART=m
CONFIG_SERIAL_XILINX_PS_UART=m
CONFIG_HVC_DRIVER=y
CONFIG_VIRTIO_CONSOLE=m
CONFIG_IPMI_HANDLER=m
CONFIG_IPMI_PANIC_EVENT=y
# CONFIG_IPMI_PANIC_STRING is not set
CONFIG_IPMI_DEVICE_INTERFACE=m
CONFIG_IPMI_SI=m
# CONFIG_IPMI_WATCHDOG is not set
CONFIG_IPMI_POWEROFF=m
CONFIG_HW_RANDOM=y
CONFIG_HW_RANDOM_TIMERIOMEM=m
CONFIG_HW_RANDOM_INTEL=m
CONFIG_HW_RANDOM_AMD=m
CONFIG_HW_RANDOM_VIA=m
CONFIG_HW_RANDOM_VIRTIO=m
CONFIG_HW_RANDOM_TPM=m
CONFIG_NVRAM=y
CONFIG_R3964=m
CONFIG_APPLICOM=m
CONFIG_MWAVE=m
CONFIG_RAW_DRIVER=m
CONFIG_MAX_RAW_DEVS=4096
CONFIG_HPET=y
CONFIG_HPET_MMAP=y
CONFIG_HANGCHECK_TIMER=m
CONFIG_UV_MMTIMER=m
CONFIG_TCG_TPM=m
CONFIG_TCG_TIS=m
# CONFIG_TCG_TIS_I2C_INFINEON is not set
CONFIG_TCG_NSC=m
CONFIG_TCG_ATMEL=m
CONFIG_TCG_INFINEON=m
CONFIG_TELCLOCK=m
CONFIG_DEVPORT=y
CONFIG_I2C=y
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
CONFIG_I2C_CHARDEV=m
CONFIG_I2C_MUX=m

#
# Multiplexer I2C Chip support
#
CONFIG_I2C_MUX_GPIO=m
CONFIG_I2C_MUX_PCA9541=m
CONFIG_I2C_MUX_PCA954x=m
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_SMBUS=m
CONFIG_I2C_ALGOBIT=m
CONFIG_I2C_ALGOPCA=m

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
CONFIG_I2C_ALI1535=m
CONFIG_I2C_ALI1563=m
CONFIG_I2C_ALI15X3=m
CONFIG_I2C_AMD756=m
CONFIG_I2C_AMD756_S4882=m
CONFIG_I2C_AMD8111=m
CONFIG_I2C_I801=m
CONFIG_I2C_ISCH=m
CONFIG_I2C_PIIX4=m
CONFIG_I2C_NFORCE2=m
CONFIG_I2C_NFORCE2_S4985=m
CONFIG_I2C_SIS5595=m
CONFIG_I2C_SIS630=m
CONFIG_I2C_SIS96X=m
CONFIG_I2C_VIA=m
CONFIG_I2C_VIAPRO=m

#
# ACPI drivers
#
CONFIG_I2C_SCMI=m

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
CONFIG_I2C_DESIGNWARE_CORE=m
CONFIG_I2C_DESIGNWARE_PCI=m
CONFIG_I2C_EG20T=m
CONFIG_I2C_GPIO=m
# CONFIG_I2C_INTEL_MID is not set
CONFIG_I2C_OCORES=m
CONFIG_I2C_PCA_PLATFORM=m
# CONFIG_I2C_PXA_PCI is not set
# CONFIG_I2C_SIMTEC is not set
# CONFIG_I2C_XILINX is not set

#
# External I2C/SMBus adapter drivers
#
CONFIG_I2C_DIOLAN_U2C=m
CONFIG_I2C_PARPORT_LIGHT=m
CONFIG_I2C_TAOS_EVM=m
CONFIG_I2C_TINY_USB=m

#
# Other I2C/SMBus bus drivers
#
CONFIG_I2C_STUB=m
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
CONFIG_SPI=y
# CONFIG_SPI_DEBUG is not set
CONFIG_SPI_MASTER=y

#
# SPI Master Controller Drivers
#
CONFIG_SPI_ALTERA=m
CONFIG_SPI_BITBANG=m
CONFIG_SPI_GPIO=m
CONFIG_SPI_OC_TINY=m
# CONFIG_SPI_PXA2XX_PCI is not set
# CONFIG_SPI_SC18IS602 is not set
CONFIG_SPI_TOPCLIFF_PCH=m
# CONFIG_SPI_XCOMM is not set
CONFIG_SPI_XILINX=m
CONFIG_SPI_DESIGNWARE=y
CONFIG_SPI_DW_PCI=m
# CONFIG_SPI_DW_MID_DMA is not set

#
# SPI Protocol Masters
#
CONFIG_SPI_SPIDEV=m
CONFIG_SPI_TLE62X0=m
# CONFIG_HSI is not set

#
# PPS support
#
# CONFIG_PPS is not set

#
# PPS generators support
#

#
# PTP clock support
#

#
# Enable Device Drivers -> PPS to see the PTP clock options.
#
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
CONFIG_GPIOLIB=y
# CONFIG_DEBUG_GPIO is not set
CONFIG_GPIO_SYSFS=y
CONFIG_GPIO_GENERIC=m
CONFIG_GPIO_MAX730X=m

#
# Memory mapped GPIO drivers:
#
CONFIG_GPIO_GENERIC_PLATFORM=m
CONFIG_GPIO_IT8761E=m
# CONFIG_GPIO_SCH is not set
# CONFIG_GPIO_ICH is not set
CONFIG_GPIO_VX855=m

#
# I2C GPIO expanders:
#
CONFIG_GPIO_MAX7300=m
CONFIG_GPIO_MAX732X=m
CONFIG_GPIO_PCA953X=m
CONFIG_GPIO_PCF857X=m
# CONFIG_GPIO_SX150X is not set
CONFIG_GPIO_ADP5588=m

#
# PCI GPIO expanders:
#
# CONFIG_GPIO_BT8XX is not set
# CONFIG_GPIO_AMD8111 is not set
# CONFIG_GPIO_LANGWELL is not set
# CONFIG_GPIO_PCH is not set
CONFIG_GPIO_ML_IOH=m
# CONFIG_GPIO_RDC321X is not set

#
# SPI GPIO expanders:
#
CONFIG_GPIO_MAX7301=m
CONFIG_GPIO_MCP23S08=m
CONFIG_GPIO_MC33880=m
CONFIG_GPIO_74X164=m

#
# AC97 GPIO expanders:
#

#
# MODULbus GPIO expanders:
#
CONFIG_W1=m
CONFIG_W1_CON=y

#
# 1-wire Bus Masters
#
CONFIG_W1_MASTER_MATROX=m
CONFIG_W1_MASTER_DS2490=m
CONFIG_W1_MASTER_DS2482=m
CONFIG_W1_MASTER_DS1WM=m
CONFIG_W1_MASTER_GPIO=m
# CONFIG_HDQ_MASTER_OMAP is not set

#
# 1-wire Slaves
#
CONFIG_W1_SLAVE_THERM=m
CONFIG_W1_SLAVE_SMEM=m
CONFIG_W1_SLAVE_DS2408=m
CONFIG_W1_SLAVE_DS2423=m
CONFIG_W1_SLAVE_DS2431=m
CONFIG_W1_SLAVE_DS2433=m
CONFIG_W1_SLAVE_DS2433_CRC=y
CONFIG_W1_SLAVE_DS2760=m
CONFIG_W1_SLAVE_DS2780=m
CONFIG_W1_SLAVE_DS2781=m
# CONFIG_W1_SLAVE_DS28E04 is not set
CONFIG_W1_SLAVE_BQ27000=m
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
CONFIG_PDA_POWER=m
# CONFIG_TEST_POWER is not set
CONFIG_BATTERY_DS2760=m
CONFIG_BATTERY_DS2780=m
CONFIG_BATTERY_DS2781=m
CONFIG_BATTERY_DS2782=m
CONFIG_BATTERY_SBS=m
CONFIG_BATTERY_BQ27x00=m
CONFIG_BATTERY_BQ27X00_I2C=y
CONFIG_BATTERY_BQ27X00_PLATFORM=y
CONFIG_BATTERY_MAX17040=m
CONFIG_BATTERY_MAX17042=m
CONFIG_CHARGER_MAX8903=m
CONFIG_CHARGER_LP8727=m
CONFIG_CHARGER_GPIO=m
CONFIG_CHARGER_SMB347=m
# CONFIG_POWER_AVS is not set
CONFIG_HWMON=y
CONFIG_HWMON_VID=m
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Native drivers
#
CONFIG_SENSORS_ABITUGURU=m
CONFIG_SENSORS_ABITUGURU3=m
CONFIG_SENSORS_AD7314=m
CONFIG_SENSORS_AD7414=m
CONFIG_SENSORS_AD7418=m
CONFIG_SENSORS_ADCXX=m
CONFIG_SENSORS_ADM1021=m
CONFIG_SENSORS_ADM1025=m
CONFIG_SENSORS_ADM1026=m
CONFIG_SENSORS_ADM1029=m
CONFIG_SENSORS_ADM1031=m
CONFIG_SENSORS_ADM9240=m
# CONFIG_SENSORS_ADT7410 is not set
CONFIG_SENSORS_ADT7411=m
CONFIG_SENSORS_ADT7462=m
CONFIG_SENSORS_ADT7470=m
CONFIG_SENSORS_ADT7475=m
CONFIG_SENSORS_ASC7621=m
CONFIG_SENSORS_K8TEMP=m
CONFIG_SENSORS_K10TEMP=m
CONFIG_SENSORS_FAM15H_POWER=m
CONFIG_SENSORS_ASB100=m
CONFIG_SENSORS_ATXP1=m
CONFIG_SENSORS_DS620=m
CONFIG_SENSORS_DS1621=m
CONFIG_SENSORS_I5K_AMB=m
CONFIG_SENSORS_F71805F=m
CONFIG_SENSORS_F71882FG=m
CONFIG_SENSORS_F75375S=m
CONFIG_SENSORS_FSCHMD=m
CONFIG_SENSORS_G760A=m
CONFIG_SENSORS_GL518SM=m
CONFIG_SENSORS_GL520SM=m
CONFIG_SENSORS_GPIO_FAN=m
# CONFIG_SENSORS_HIH6130 is not set
CONFIG_SENSORS_CORETEMP=m
CONFIG_SENSORS_IBMAEM=m
CONFIG_SENSORS_IBMPEX=m
CONFIG_SENSORS_IT87=m
CONFIG_SENSORS_JC42=m
CONFIG_SENSORS_LINEAGE=m
CONFIG_SENSORS_LM63=m
CONFIG_SENSORS_LM70=m
CONFIG_SENSORS_LM73=m
CONFIG_SENSORS_LM75=m
CONFIG_SENSORS_LM77=m
CONFIG_SENSORS_LM78=m
CONFIG_SENSORS_LM80=m
CONFIG_SENSORS_LM83=m
CONFIG_SENSORS_LM85=m
CONFIG_SENSORS_LM87=m
CONFIG_SENSORS_LM90=m
CONFIG_SENSORS_LM92=m
CONFIG_SENSORS_LM93=m
CONFIG_SENSORS_LTC4151=m
CONFIG_SENSORS_LTC4215=m
CONFIG_SENSORS_LTC4245=m
CONFIG_SENSORS_LTC4261=m
CONFIG_SENSORS_LM95241=m
CONFIG_SENSORS_LM95245=m
CONFIG_SENSORS_MAX1111=m
CONFIG_SENSORS_MAX16065=m
CONFIG_SENSORS_MAX1619=m
CONFIG_SENSORS_MAX1668=m
# CONFIG_SENSORS_MAX197 is not set
CONFIG_SENSORS_MAX6639=m
CONFIG_SENSORS_MAX6642=m
CONFIG_SENSORS_MAX6650=m
CONFIG_SENSORS_MCP3021=m
CONFIG_SENSORS_NTC_THERMISTOR=m
CONFIG_SENSORS_PC87360=m
CONFIG_SENSORS_PC87427=m
CONFIG_SENSORS_PCF8591=m
CONFIG_PMBUS=m
CONFIG_SENSORS_PMBUS=m
CONFIG_SENSORS_ADM1275=m
CONFIG_SENSORS_LM25066=m
CONFIG_SENSORS_LTC2978=m
CONFIG_SENSORS_MAX16064=m
CONFIG_SENSORS_MAX34440=m
CONFIG_SENSORS_MAX8688=m
CONFIG_SENSORS_UCD9000=m
CONFIG_SENSORS_UCD9200=m
CONFIG_SENSORS_ZL6100=m
CONFIG_SENSORS_SHT15=m
CONFIG_SENSORS_SHT21=m
CONFIG_SENSORS_SIS5595=m
CONFIG_SENSORS_SMM665=m
CONFIG_SENSORS_DME1737=m
CONFIG_SENSORS_EMC1403=m
CONFIG_SENSORS_EMC2103=m
CONFIG_SENSORS_EMC6W201=m
CONFIG_SENSORS_SMSC47M1=m
CONFIG_SENSORS_SMSC47M192=m
CONFIG_SENSORS_SMSC47B397=m
# CONFIG_SENSORS_SCH56XX_COMMON is not set
CONFIG_SENSORS_ADS1015=m
CONFIG_SENSORS_ADS7828=m
CONFIG_SENSORS_ADS7871=m
CONFIG_SENSORS_AMC6821=m
# CONFIG_SENSORS_INA2XX is not set
CONFIG_SENSORS_THMC50=m
CONFIG_SENSORS_TMP102=m
CONFIG_SENSORS_TMP401=m
CONFIG_SENSORS_TMP421=m
CONFIG_SENSORS_VIA_CPUTEMP=m
CONFIG_SENSORS_VIA686A=m
CONFIG_SENSORS_VT1211=m
CONFIG_SENSORS_VT8231=m
CONFIG_SENSORS_W83781D=m
CONFIG_SENSORS_W83791D=m
CONFIG_SENSORS_W83792D=m
CONFIG_SENSORS_W83793=m
CONFIG_SENSORS_W83795=m
# CONFIG_SENSORS_W83795_FANCTRL is not set
CONFIG_SENSORS_W83L785TS=m
CONFIG_SENSORS_W83L786NG=m
CONFIG_SENSORS_W83627HF=m
CONFIG_SENSORS_W83627EHF=m
CONFIG_SENSORS_APPLESMC=m

#
# ACPI drivers
#
CONFIG_SENSORS_ACPI_POWER=m
CONFIG_SENSORS_ATK0110=m
CONFIG_THERMAL=m
CONFIG_THERMAL_HWMON=y
# CONFIG_CPU_THERMAL is not set
# CONFIG_WATCHDOG is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
CONFIG_SSB=m
CONFIG_SSB_SPROM=y
CONFIG_SSB_PCIHOST_POSSIBLE=y
CONFIG_SSB_PCIHOST=y
# CONFIG_SSB_B43_PCI_BRIDGE is not set
# CONFIG_SSB_DEBUG is not set
CONFIG_SSB_DRIVER_PCICORE_POSSIBLE=y
CONFIG_SSB_DRIVER_PCICORE=y
CONFIG_BCMA_POSSIBLE=y

#
# Broadcom specific AMBA
#
CONFIG_BCMA=m
CONFIG_BCMA_HOST_PCI_POSSIBLE=y
CONFIG_BCMA_HOST_PCI=y
# CONFIG_BCMA_DRIVER_GMAC_CMN is not set
# CONFIG_BCMA_DEBUG is not set

#
# Multifunction device drivers
#
CONFIG_MFD_CORE=m
# CONFIG_MFD_88PM860X is not set
# CONFIG_MFD_88PM800 is not set
# CONFIG_MFD_88PM805 is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_HTC_I2CPLD is not set
# CONFIG_MFD_LM3533 is not set
# CONFIG_TPS6105X is not set
# CONFIG_TPS65010 is not set
# CONFIG_TPS6507X is not set
# CONFIG_MFD_TPS65217 is not set
# CONFIG_MFD_TPS65910 is not set
# CONFIG_MFD_TPS65912_I2C is not set
# CONFIG_MFD_TPS65912_SPI is not set
# CONFIG_TWL4030_CORE is not set
# CONFIG_TWL6040_CORE is not set
# CONFIG_MFD_STMPE is not set
# CONFIG_MFD_TC3589X is not set
# CONFIG_MFD_TMIO is not set
# CONFIG_MFD_SMSC is not set
# CONFIG_PMIC_DA903X is not set
# CONFIG_MFD_DA9052_SPI is not set
# CONFIG_MFD_DA9052_I2C is not set
# CONFIG_MFD_DA9055 is not set
# CONFIG_PMIC_ADP5520 is not set
# CONFIG_MFD_LP8788 is not set
# CONFIG_MFD_MAX77686 is not set
# CONFIG_MFD_MAX77693 is not set
# CONFIG_MFD_MAX8907 is not set
# CONFIG_MFD_MAX8925 is not set
# CONFIG_MFD_MAX8997 is not set
# CONFIG_MFD_MAX8998 is not set
# CONFIG_MFD_SEC_CORE is not set
# CONFIG_MFD_ARIZONA_I2C is not set
# CONFIG_MFD_ARIZONA_SPI is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_WM831X_I2C is not set
# CONFIG_MFD_WM831X_SPI is not set
# CONFIG_MFD_WM8350_I2C is not set
# CONFIG_MFD_WM8994 is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_MFD_MC13XXX_SPI is not set
# CONFIG_MFD_MC13XXX_I2C is not set
# CONFIG_ABX500_CORE is not set
CONFIG_EZX_PCAP=y
# CONFIG_MFD_CS5535 is not set
# CONFIG_MFD_TIMBERDALE is not set
CONFIG_LPC_SCH=m
CONFIG_LPC_ICH=m
# CONFIG_MFD_RDC321X is not set
# CONFIG_MFD_JANZ_CMODIO is not set
CONFIG_MFD_VX855=m
# CONFIG_MFD_WL1273_CORE is not set
# CONFIG_MFD_TPS65090 is not set
# CONFIG_MFD_AAT2870_CORE is not set
# CONFIG_MFD_RC5T583 is not set
# CONFIG_MFD_PALMAS is not set
# CONFIG_REGULATOR is not set
# CONFIG_MEDIA_SUPPORT is not set

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_AGP_INTEL=y
CONFIG_AGP_SIS=y
CONFIG_AGP_VIA=y
CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=16
CONFIG_VGA_SWITCHEROO=y
CONFIG_DRM=m
CONFIG_DRM_USB=m
CONFIG_DRM_KMS_HELPER=m
CONFIG_DRM_LOAD_EDID_FIRMWARE=y
CONFIG_DRM_TTM=m
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
CONFIG_DRM_RADEON=m
CONFIG_DRM_RADEON_KMS=y
CONFIG_DRM_NOUVEAU=m
CONFIG_NOUVEAU_DEBUG=5
CONFIG_NOUVEAU_DEBUG_DEFAULT=3
CONFIG_DRM_NOUVEAU_BACKLIGHT=y

#
# I2C encoder or helper chips
#
CONFIG_DRM_I2C_CH7006=m
CONFIG_DRM_I2C_SIL164=m
CONFIG_DRM_I915=m
CONFIG_DRM_I915_KMS=y
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_SIS is not set
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set
CONFIG_DRM_VMWGFX=m
# CONFIG_DRM_VMWGFX_FBCON is not set
CONFIG_DRM_GMA500=m
# CONFIG_DRM_GMA600 is not set
CONFIG_DRM_GMA3600=y
CONFIG_DRM_UDL=m
# CONFIG_DRM_AST is not set
# CONFIG_DRM_MGAG200 is not set
# CONFIG_DRM_CIRRUS_QEMU is not set
# CONFIG_STUB_POULSBO is not set
CONFIG_VGASTATE=m
CONFIG_VIDEO_OUTPUT_CONTROL=y
CONFIG_FB=y
CONFIG_FIRMWARE_EDID=y
CONFIG_FB_DDC=m
CONFIG_FB_BOOT_VESA_SUPPORT=y
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
CONFIG_FB_SYS_FILLRECT=m
CONFIG_FB_SYS_COPYAREA=m
CONFIG_FB_SYS_IMAGEBLIT=m
# CONFIG_FB_FOREIGN_ENDIAN is not set
# CONFIG_FB_SYS_FOPS is not set
# CONFIG_FB_WMT_GE_ROPS is not set
CONFIG_FB_DEFERRED_IO=y
# CONFIG_FB_SVGALIB is not set
# CONFIG_FB_MACMODES is not set
CONFIG_FB_BACKLIGHT=y
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
CONFIG_FB_VGA16=m
CONFIG_FB_UVESA=m
CONFIG_FB_VESA=y
CONFIG_FB_EFI=y
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
CONFIG_FB_NVIDIA=m
CONFIG_FB_NVIDIA_I2C=y
# CONFIG_FB_NVIDIA_DEBUG is not set
CONFIG_FB_NVIDIA_BACKLIGHT=y
CONFIG_FB_RIVA=m
CONFIG_FB_RIVA_I2C=y
# CONFIG_FB_RIVA_DEBUG is not set
CONFIG_FB_RIVA_BACKLIGHT=y
CONFIG_FB_I740=m
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_MATROX is not set
CONFIG_FB_RADEON=m
CONFIG_FB_RADEON_I2C=y
CONFIG_FB_RADEON_BACKLIGHT=y
# CONFIG_FB_RADEON_DEBUG is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_VIA is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_GEODE is not set
# CONFIG_FB_TMIO is not set
# CONFIG_FB_SMSCUFX is not set
# CONFIG_FB_UDL is not set
# CONFIG_FB_VIRTUAL is not set
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_BROADSHEET is not set
# CONFIG_FB_AUO_K190X is not set
# CONFIG_EXYNOS_VIDEO is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
# CONFIG_LCD_CLASS_DEVICE is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
# CONFIG_BACKLIGHT_GENERIC is not set
# CONFIG_BACKLIGHT_APPLE is not set
# CONFIG_BACKLIGHT_SAHARA is not set
# CONFIG_BACKLIGHT_ADP8860 is not set
# CONFIG_BACKLIGHT_ADP8870 is not set
# CONFIG_BACKLIGHT_LM3630 is not set
# CONFIG_BACKLIGHT_LM3639 is not set
# CONFIG_BACKLIGHT_LP855X is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY=y
# CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
# CONFIG_LOGO is not set
# CONFIG_SOUND is not set

#
# HID support
#
CONFIG_HID=y
# CONFIG_HID_BATTERY_STRENGTH is not set
# CONFIG_HIDRAW is not set
# CONFIG_UHID is not set
CONFIG_HID_GENERIC=y

#
# Special HID drivers
#
CONFIG_HID_A4TECH=y
# CONFIG_HID_ACRUX is not set
CONFIG_HID_APPLE=y
# CONFIG_HID_AUREAL is not set
CONFIG_HID_BELKIN=y
CONFIG_HID_CHERRY=y
CONFIG_HID_CHICONY=y
CONFIG_HID_CYPRESS=y
# CONFIG_HID_DRAGONRISE is not set
# CONFIG_HID_EMS_FF is not set
CONFIG_HID_EZKEY=y
# CONFIG_HID_HOLTEK is not set
# CONFIG_HID_KEYTOUCH is not set
# CONFIG_HID_KYE is not set
# CONFIG_HID_UCLOGIC is not set
# CONFIG_HID_WALTOP is not set
# CONFIG_HID_GYRATION is not set
# CONFIG_HID_TWINHAN is not set
CONFIG_HID_KENSINGTON=y
# CONFIG_HID_LCPOWER is not set
# CONFIG_HID_LENOVO_TPKBD is not set
CONFIG_HID_LOGITECH=y
# CONFIG_HID_LOGITECH_DJ is not set
# CONFIG_LOGITECH_FF is not set
# CONFIG_LOGIRUMBLEPAD2_FF is not set
# CONFIG_LOGIG940_FF is not set
# CONFIG_LOGIWHEELS_FF is not set
CONFIG_HID_MICROSOFT=y
CONFIG_HID_MONTEREY=y
# CONFIG_HID_MULTITOUCH is not set
# CONFIG_HID_NTRIG is not set
# CONFIG_HID_ORTEK is not set
# CONFIG_HID_PANTHERLORD is not set
# CONFIG_HID_PETALYNX is not set
# CONFIG_HID_PICOLCD is not set
# CONFIG_HID_PRIMAX is not set
# CONFIG_HID_ROCCAT is not set
# CONFIG_HID_SAITEK is not set
# CONFIG_HID_SAMSUNG is not set
# CONFIG_HID_SONY is not set
# CONFIG_HID_SPEEDLINK is not set
# CONFIG_HID_SUNPLUS is not set
# CONFIG_HID_GREENASIA is not set
# CONFIG_HID_HYPERV_MOUSE is not set
# CONFIG_HID_SMARTJOYPLUS is not set
# CONFIG_HID_TIVO is not set
# CONFIG_HID_TOPSEED is not set
# CONFIG_HID_THRUSTMASTER is not set
# CONFIG_HID_ZEROPLUS is not set
# CONFIG_HID_ZYDACRON is not set
# CONFIG_HID_SENSOR_HUB is not set

#
# USB HID support
#
CONFIG_USB_HID=y
# CONFIG_HID_PID is not set
# CONFIG_USB_HIDDEV is not set
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB_ARCH_HAS_XHCI=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_COMMON=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set
CONFIG_USB_ANNOUNCE_NEW_DEVICES=y

#
# Miscellaneous USB options
#
# CONFIG_USB_DYNAMIC_MINORS is not set
CONFIG_USB_SUSPEND=y
# CONFIG_USB_OTG is not set
# CONFIG_USB_MON is not set
# CONFIG_USB_WUSB_CBAF is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
# CONFIG_USB_XHCI_HCD is not set
# CONFIG_USB_EHCI_HCD is not set
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_ISP1760_HCD is not set
# CONFIG_USB_ISP1362_HCD is not set
CONFIG_USB_OHCI_HCD=y
# CONFIG_USB_OHCI_HCD_PLATFORM is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_HCD_BCMA is not set
# CONFIG_USB_HCD_SSB is not set
# CONFIG_USB_CHIPIDEA is not set

#
# USB Device Class drivers
#
CONFIG_USB_ACM=m
# CONFIG_USB_PRINTER is not set
CONFIG_USB_WDM=m
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
CONFIG_USB_STORAGE=m
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_REALTEK is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
# CONFIG_USB_STORAGE_ISD200 is not set
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_ONETOUCH is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_STORAGE_CYPRESS_ATACB is not set
# CONFIG_USB_STORAGE_ENE_UB6250 is not set
# CONFIG_USB_UAS is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set

#
# USB port drivers
#
# CONFIG_USB_SERIAL is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_YUREX is not set
# CONFIG_USB_EZUSB_FX2 is not set

#
# USB Physical Layer drivers
#
# CONFIG_OMAP_USB2 is not set
# CONFIG_USB_ISP1301 is not set
# CONFIG_USB_ATM is not set
# CONFIG_USB_GADGET is not set

#
# OTG and related infrastructure
#
# CONFIG_USB_GPIO_VBUS is not set
# CONFIG_NOP_USB_XCEIV is not set
# CONFIG_UWB is not set
# CONFIG_MMC is not set
# CONFIG_MEMSTICK is not set
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=y

#
# LED drivers
#
CONFIG_LEDS_LM3530=m
# CONFIG_LEDS_LM3642 is not set
CONFIG_LEDS_PCA9532=m
CONFIG_LEDS_PCA9532_GPIO=y
CONFIG_LEDS_GPIO=m
CONFIG_LEDS_LP3944=m
CONFIG_LEDS_LP5521=m
CONFIG_LEDS_LP5523=m
CONFIG_LEDS_CLEVO_MAIL=m
CONFIG_LEDS_PCA955X=m
CONFIG_LEDS_PCA9633=m
CONFIG_LEDS_DAC124S085=m
CONFIG_LEDS_BD2802=m
CONFIG_LEDS_INTEL_SS4200=m
CONFIG_LEDS_LT3593=m
CONFIG_LEDS_DELL_NETBOOKS=m
CONFIG_LEDS_TCA6507=m
# CONFIG_LEDS_LM355x is not set
CONFIG_LEDS_OT200=m
# CONFIG_LEDS_BLINKM is not set
CONFIG_LEDS_TRIGGERS=y

#
# LED Triggers
#
CONFIG_LEDS_TRIGGER_TIMER=m
# CONFIG_LEDS_TRIGGER_ONESHOT is not set
CONFIG_LEDS_TRIGGER_HEARTBEAT=m
CONFIG_LEDS_TRIGGER_BACKLIGHT=m
# CONFIG_LEDS_TRIGGER_CPU is not set
CONFIG_LEDS_TRIGGER_GPIO=m
CONFIG_LEDS_TRIGGER_DEFAULT_ON=m

#
# iptables trigger is under Netfilter config (LED target)
#
# CONFIG_LEDS_TRIGGER_TRANSIENT is not set
# CONFIG_ACCESSIBILITY is not set
# CONFIG_INFINIBAND is not set
CONFIG_EDAC=y

#
# Reporting subsystems
#
CONFIG_EDAC_LEGACY_SYSFS=y
# CONFIG_EDAC_DEBUG is not set
CONFIG_EDAC_DECODE_MCE=m
CONFIG_EDAC_MCE_INJ=m
CONFIG_EDAC_MM_EDAC=m
CONFIG_EDAC_AMD64=m
CONFIG_EDAC_AMD64_ERROR_INJECTION=y
CONFIG_EDAC_E752X=m
CONFIG_EDAC_I82975X=m
CONFIG_EDAC_I3000=m
CONFIG_EDAC_I3200=m
CONFIG_EDAC_X38=m
CONFIG_EDAC_I5400=m
CONFIG_EDAC_I7CORE=m
CONFIG_EDAC_I5000=m
CONFIG_EDAC_I5100=m
CONFIG_EDAC_I7300=m
CONFIG_EDAC_SBRIDGE=m
CONFIG_RTC_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
# CONFIG_RTC_DEBUG is not set

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
CONFIG_RTC_DRV_DS1307=m
CONFIG_RTC_DRV_DS1374=m
CONFIG_RTC_DRV_DS1672=m
CONFIG_RTC_DRV_DS3232=m
CONFIG_RTC_DRV_MAX6900=m
CONFIG_RTC_DRV_RS5C372=m
CONFIG_RTC_DRV_ISL1208=m
CONFIG_RTC_DRV_ISL12022=m
CONFIG_RTC_DRV_X1205=m
CONFIG_RTC_DRV_PCF8563=m
CONFIG_RTC_DRV_PCF8583=m
CONFIG_RTC_DRV_M41T80=m
CONFIG_RTC_DRV_M41T80_WDT=y
CONFIG_RTC_DRV_BQ32K=m
CONFIG_RTC_DRV_S35390A=m
CONFIG_RTC_DRV_FM3130=m
CONFIG_RTC_DRV_RX8581=m
CONFIG_RTC_DRV_RX8025=m
CONFIG_RTC_DRV_EM3027=m
CONFIG_RTC_DRV_RV3029C2=m

#
# SPI RTC drivers
#
CONFIG_RTC_DRV_M41T93=m
CONFIG_RTC_DRV_M41T94=m
CONFIG_RTC_DRV_DS1305=m
CONFIG_RTC_DRV_DS1390=m
CONFIG_RTC_DRV_MAX6902=m
CONFIG_RTC_DRV_R9701=m
CONFIG_RTC_DRV_RS5C348=m
CONFIG_RTC_DRV_DS3234=m
CONFIG_RTC_DRV_PCF2123=m

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
CONFIG_RTC_DRV_DS1286=m
CONFIG_RTC_DRV_DS1511=m
CONFIG_RTC_DRV_DS1553=m
CONFIG_RTC_DRV_DS1742=m
CONFIG_RTC_DRV_STK17TA8=m
CONFIG_RTC_DRV_M48T86=m
CONFIG_RTC_DRV_M48T35=m
CONFIG_RTC_DRV_M48T59=m
CONFIG_RTC_DRV_MSM6242=m
CONFIG_RTC_DRV_BQ4802=m
CONFIG_RTC_DRV_RP5C01=m
CONFIG_RTC_DRV_V3020=m
# CONFIG_RTC_DRV_DS2404 is not set

#
# on-CPU RTC drivers
#
CONFIG_RTC_DRV_PCAP=m
CONFIG_DMADEVICES=y
# CONFIG_DMADEVICES_DEBUG is not set

#
# DMA Devices
#
CONFIG_INTEL_MID_DMAC=m
CONFIG_INTEL_IOATDMA=m
CONFIG_TIMB_DMA=m
CONFIG_PCH_DMA=m
CONFIG_DMA_ENGINE=y

#
# DMA Clients
#
CONFIG_NET_DMA=y
CONFIG_ASYNC_TX_DMA=y
# CONFIG_DMATEST is not set
CONFIG_DCA=m
CONFIG_AUXDISPLAY=y
CONFIG_UIO=m
CONFIG_UIO_CIF=m
CONFIG_UIO_PDRV=m
CONFIG_UIO_PDRV_GENIRQ=m
CONFIG_UIO_AEC=m
CONFIG_UIO_SERCOS3=m
CONFIG_UIO_PCI_GENERIC=m
CONFIG_UIO_NETX=m
# CONFIG_VFIO is not set
CONFIG_VIRTIO=m

#
# Virtio drivers
#
CONFIG_VIRTIO_PCI=m
CONFIG_VIRTIO_BALLOON=m
CONFIG_VIRTIO_MMIO=m
# CONFIG_VIRTIO_MMIO_CMDLINE_DEVICES is not set

#
# Microsoft Hyper-V guest support
#
CONFIG_HYPERV=m
CONFIG_HYPERV_UTILS=m
# CONFIG_STAGING is not set
CONFIG_X86_PLATFORM_DEVICES=y
CONFIG_ACER_WMI=m
CONFIG_ACERHDF=m
CONFIG_ASUS_LAPTOP=m
CONFIG_DELL_LAPTOP=m
CONFIG_DELL_WMI=m
CONFIG_DELL_WMI_AIO=m
CONFIG_FUJITSU_LAPTOP=m
# CONFIG_FUJITSU_LAPTOP_DEBUG is not set
CONFIG_FUJITSU_TABLET=m
CONFIG_AMILO_RFKILL=m
CONFIG_HP_ACCEL=m
CONFIG_HP_WMI=m
CONFIG_MSI_LAPTOP=m
CONFIG_PANASONIC_LAPTOP=m
CONFIG_COMPAL_LAPTOP=m
CONFIG_SONY_LAPTOP=m
CONFIG_SONYPI_COMPAT=y
CONFIG_IDEAPAD_LAPTOP=m
CONFIG_THINKPAD_ACPI=m
# CONFIG_THINKPAD_ACPI_DEBUGFACILITIES is not set
# CONFIG_THINKPAD_ACPI_DEBUG is not set
# CONFIG_THINKPAD_ACPI_UNSAFE_LEDS is not set
CONFIG_THINKPAD_ACPI_VIDEO=y
CONFIG_THINKPAD_ACPI_HOTKEY_POLL=y
CONFIG_SENSORS_HDAPS=m
CONFIG_INTEL_MENLOW=m
CONFIG_EEEPC_LAPTOP=m
CONFIG_ASUS_WMI=m
CONFIG_ASUS_NB_WMI=m
CONFIG_EEEPC_WMI=m
CONFIG_ACPI_WMI=m
CONFIG_MSI_WMI=m
CONFIG_TOPSTAR_LAPTOP=m
CONFIG_ACPI_TOSHIBA=m
CONFIG_TOSHIBA_BT_RFKILL=m
CONFIG_ACPI_CMPC=m
CONFIG_INTEL_IPS=m
CONFIG_IBM_RTL=m
CONFIG_XO15_EBOOK=m
CONFIG_SAMSUNG_LAPTOP=m
CONFIG_MXM_WMI=m
CONFIG_INTEL_OAKTRAIL=m
CONFIG_SAMSUNG_Q10=m
CONFIG_APPLE_GMUX=m

#
# Hardware Spinlock drivers
#
CONFIG_CLKEVT_I8253=y
CONFIG_I8253_LOCK=y
CONFIG_CLKBLD_I8253=y
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y
CONFIG_AMD_IOMMU=y
# CONFIG_AMD_IOMMU_STATS is not set
CONFIG_AMD_IOMMU_V2=m
CONFIG_DMAR_TABLE=y
CONFIG_INTEL_IOMMU=y
# CONFIG_INTEL_IOMMU_DEFAULT_ON is not set
CONFIG_INTEL_IOMMU_FLOPPY_WA=y
CONFIG_IRQ_REMAP=y

#
# Remoteproc drivers (EXPERIMENTAL)
#
# CONFIG_STE_MODEM_RPROC is not set

#
# Rpmsg drivers (EXPERIMENTAL)
#
CONFIG_VIRT_DRIVERS=y
# CONFIG_PM_DEVFREQ is not set
# CONFIG_EXTCON is not set
# CONFIG_MEMORY is not set
# CONFIG_IIO is not set
# CONFIG_VME_BUS is not set
# CONFIG_PWM is not set

#
# Firmware Drivers
#
CONFIG_EDD=m
# CONFIG_EDD_OFF is not set
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_EFI_VARS=y
CONFIG_DELL_RBU=m
CONFIG_DCDBAS=m
CONFIG_DMIID=y
CONFIG_DMI_SYSFS=m
CONFIG_ISCSI_IBFT_FIND=y
CONFIG_ISCSI_IBFT=m
# CONFIG_GOOGLE_FIRMWARE is not set

#
# File systems
#
CONFIG_DCACHE_WORD_ACCESS=y
# CONFIG_EXT2_FS is not set
CONFIG_EXT3_FS=y
CONFIG_EXT3_DEFAULTS_TO_ORDERED=y
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
CONFIG_EXT4_FS=y
CONFIG_EXT4_USE_FOR_EXT23=y
CONFIG_EXT4_FS_XATTR=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
# CONFIG_EXT4_DEBUG is not set
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
CONFIG_JBD2=y
# CONFIG_JBD2_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
CONFIG_XFS_FS=m
CONFIG_XFS_QUOTA=y
CONFIG_XFS_POSIX_ACL=y
CONFIG_XFS_RT=y
# CONFIG_XFS_DEBUG is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
CONFIG_BTRFS_FS=m
CONFIG_BTRFS_FS_POSIX_ACL=y
# CONFIG_BTRFS_FS_CHECK_INTEGRITY is not set
# CONFIG_NILFS2_FS is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_EXPORTFS=y
CONFIG_FILE_LOCKING=y
CONFIG_FSNOTIFY=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_FANOTIFY=y
CONFIG_FANOTIFY_ACCESS_PERMISSIONS=y
CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_PRINT_QUOTA_WARNING=y
# CONFIG_QUOTA_DEBUG is not set
CONFIG_QUOTA_TREE=m
CONFIG_QFMT_V1=m
CONFIG_QFMT_V2=m
CONFIG_QUOTACTL=y
CONFIG_QUOTACTL_COMPAT=y
CONFIG_AUTOFS4_FS=m
CONFIG_FUSE_FS=m
CONFIG_CUSE=m
CONFIG_GENERIC_ACL=y

#
# Caches
#
# CONFIG_FSCACHE is not set

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_UDF_FS=m
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_VMCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_CONFIGFS_FS=m
CONFIG_MISC_FILESYSTEMS=y
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_ECRYPT_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_LOGFS is not set
CONFIG_CRAMFS=m
CONFIG_SQUASHFS=m
CONFIG_SQUASHFS_XATTR=y
CONFIG_SQUASHFS_ZLIB=y
CONFIG_SQUASHFS_LZO=y
CONFIG_SQUASHFS_XZ=y
# CONFIG_SQUASHFS_4K_DEVBLK_SIZE is not set
# CONFIG_SQUASHFS_EMBEDDED is not set
CONFIG_SQUASHFS_FRAGMENT_CACHE_SIZE=3
# CONFIG_VXFS_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_QNX6FS_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_PSTORE=y
# CONFIG_PSTORE_CONSOLE is not set
# CONFIG_PSTORE_RAM is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
# CONFIG_EXOFS_FS is not set
CONFIG_ORE=m
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=m
CONFIG_NFS_V2=m
CONFIG_NFS_V3=m
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=m
# CONFIG_NFS_SWAP is not set
CONFIG_NFS_V4_1=y
CONFIG_PNFS_FILE_LAYOUT=m
CONFIG_PNFS_BLOCK=m
CONFIG_PNFS_OBJLAYOUT=m
CONFIG_NFS_V4_1_IMPLEMENTATION_ID_DOMAIN="kernel.org"
# CONFIG_NFS_USE_LEGACY_DNS is not set
CONFIG_NFS_USE_KERNEL_DNS=y
CONFIG_NFS_DEBUG=y
# CONFIG_NFSD is not set
CONFIG_LOCKD=m
CONFIG_LOCKD_V4=y
CONFIG_NFS_ACL_SUPPORT=m
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=m
CONFIG_SUNRPC_GSS=m
CONFIG_SUNRPC_BACKCHANNEL=y
# CONFIG_RPCSEC_GSS_KRB5 is not set
CONFIG_SUNRPC_DEBUG=y
# CONFIG_CEPH_FS is not set
# CONFIG_CIFS is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="utf8"
CONFIG_NLS_CODEPAGE_437=m
CONFIG_NLS_CODEPAGE_737=m
CONFIG_NLS_CODEPAGE_775=m
CONFIG_NLS_CODEPAGE_850=m
CONFIG_NLS_CODEPAGE_852=m
CONFIG_NLS_CODEPAGE_855=m
CONFIG_NLS_CODEPAGE_857=m
CONFIG_NLS_CODEPAGE_860=m
CONFIG_NLS_CODEPAGE_861=m
CONFIG_NLS_CODEPAGE_862=m
CONFIG_NLS_CODEPAGE_863=m
CONFIG_NLS_CODEPAGE_864=m
CONFIG_NLS_CODEPAGE_865=m
CONFIG_NLS_CODEPAGE_866=m
CONFIG_NLS_CODEPAGE_869=m
CONFIG_NLS_CODEPAGE_936=m
CONFIG_NLS_CODEPAGE_950=m
CONFIG_NLS_CODEPAGE_932=m
CONFIG_NLS_CODEPAGE_949=m
CONFIG_NLS_CODEPAGE_874=m
CONFIG_NLS_ISO8859_8=m
CONFIG_NLS_CODEPAGE_1250=m
CONFIG_NLS_CODEPAGE_1251=m
CONFIG_NLS_ASCII=m
CONFIG_NLS_ISO8859_1=m
CONFIG_NLS_ISO8859_2=m
CONFIG_NLS_ISO8859_3=m
CONFIG_NLS_ISO8859_4=m
CONFIG_NLS_ISO8859_5=m
CONFIG_NLS_ISO8859_6=m
CONFIG_NLS_ISO8859_7=m
CONFIG_NLS_ISO8859_9=m
CONFIG_NLS_ISO8859_13=m
CONFIG_NLS_ISO8859_14=m
CONFIG_NLS_ISO8859_15=m
CONFIG_NLS_KOI8_R=m
CONFIG_NLS_KOI8_U=m
# CONFIG_NLS_MAC_ROMAN is not set
# CONFIG_NLS_MAC_CELTIC is not set
# CONFIG_NLS_MAC_CENTEURO is not set
# CONFIG_NLS_MAC_CROATIAN is not set
# CONFIG_NLS_MAC_CYRILLIC is not set
# CONFIG_NLS_MAC_GAELIC is not set
# CONFIG_NLS_MAC_GREEK is not set
# CONFIG_NLS_MAC_ICELAND is not set
# CONFIG_NLS_MAC_INUIT is not set
# CONFIG_NLS_MAC_ROMANIAN is not set
# CONFIG_NLS_MAC_TURKISH is not set
CONFIG_NLS_UTF8=m
CONFIG_DLM=m
CONFIG_DLM_DEBUG=y

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_PRINTK_TIME=y
CONFIG_DEFAULT_MESSAGE_LOGLEVEL=4
CONFIG_ENABLE_WARN_DEPRECATED=y
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=2048
CONFIG_MAGIC_SYSRQ=y
CONFIG_STRIP_ASM_SYMS=y
# CONFIG_READABLE_ASM is not set
CONFIG_UNUSED_SYMBOLS=y
CONFIG_DEBUG_FS=y
CONFIG_HEADERS_CHECK=y
CONFIG_DEBUG_SECTION_MISMATCH=y
CONFIG_DEBUG_KERNEL=y
# CONFIG_DEBUG_SHIRQ is not set
CONFIG_LOCKUP_DETECTOR=y
CONFIG_HARDLOCKUP_DETECTOR=y
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC=y
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=1
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
# CONFIG_PANIC_ON_OOPS is not set
CONFIG_PANIC_ON_OOPS_VALUE=0
CONFIG_DETECT_HUNG_TASK=y
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=480
# CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set
CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE=0
CONFIG_SCHED_DEBUG=y
CONFIG_SCHEDSTATS=y
CONFIG_TIMER_STATS=y
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_DEBUG_SLAB is not set
CONFIG_HAVE_DEBUG_KMEMLEAK=y
# CONFIG_DEBUG_KMEMLEAK is not set
# CONFIG_DEBUG_PREEMPT is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_PROVE_RCU_DELAY is not set
# CONFIG_SPARSE_RCU_POINTER is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_ATOMIC_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
CONFIG_STACKTRACE=y
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_INFO=y
# CONFIG_DEBUG_INFO_REDUCED is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_VIRTUAL is not set
# CONFIG_DEBUG_WRITECOUNT is not set
CONFIG_DEBUG_MEMORY_INIT=y
# CONFIG_DEBUG_LIST is not set
# CONFIG_TEST_LIST_SORT is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
# CONFIG_DEBUG_CREDENTIALS is not set
CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
CONFIG_RCU_CPU_STALL_VERBOSE=y
# CONFIG_RCU_CPU_STALL_INFO is not set
# CONFIG_RCU_TRACE is not set
# CONFIG_KPROBES_SANITY_TEST is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
CONFIG_DEBUG_FORCE_WEAK_PER_CPU=y
# CONFIG_DEBUG_PER_CPU_MAPS is not set
CONFIG_LKDTM=m
# CONFIG_NOTIFIER_ERROR_INJECTION is not set
# CONFIG_FAULT_INJECTION is not set
CONFIG_LATENCYTOP=y
# CONFIG_DEBUG_PAGEALLOC is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_FENTRY=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_TRACE_CLOCK=y
CONFIG_RING_BUFFER=y
CONFIG_EVENT_TRACING=y
CONFIG_EVENT_POWER_TRACING_DEPRECATED=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_RING_BUFFER_ALLOW_SWAP=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
# CONFIG_FUNCTION_TRACER is not set
# CONFIG_IRQSOFF_TRACER is not set
# CONFIG_PREEMPT_TRACER is not set
# CONFIG_SCHED_TRACER is not set
# CONFIG_FTRACE_SYSCALLS is not set
CONFIG_BRANCH_PROFILE_NONE=y
# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
# CONFIG_PROFILE_ALL_BRANCHES is not set
# CONFIG_STACK_TRACER is not set
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_KPROBE_EVENT=y
# CONFIG_UPROBE_EVENT is not set
CONFIG_PROBE_EVENTS=y
# CONFIG_FTRACE_STARTUP_TEST is not set
# CONFIG_MMIOTRACE is not set
CONFIG_RING_BUFFER_BENCHMARK=m
# CONFIG_RBTREE_TEST is not set
# CONFIG_INTERVAL_TREE_TEST is not set
CONFIG_PROVIDE_OHCI1394_DMA_INIT=y
CONFIG_FIREWIRE_OHCI_REMOTE_DMA=y
CONFIG_BUILD_DOCSRC=y
CONFIG_DYNAMIC_DEBUG=y
# CONFIG_DMA_API_DEBUG is not set
# CONFIG_ATOMIC64_SELFTEST is not set
CONFIG_ASYNC_RAID6_TEST=m
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
CONFIG_KGDB=y
CONFIG_KGDB_SERIAL_CONSOLE=m
# CONFIG_KGDB_TESTS is not set
CONFIG_KGDB_LOW_LEVEL_TRAP=y
CONFIG_KGDB_KDB=y
CONFIG_KDB_KEYBOARD=y
CONFIG_HAVE_ARCH_KMEMCHECK=y
# CONFIG_KMEMCHECK is not set
# CONFIG_TEST_KSTRTOX is not set
# CONFIG_STRICT_DEVMEM is not set
# CONFIG_X86_VERBOSE_BOOTUP is not set
CONFIG_EARLY_PRINTK=y
CONFIG_EARLY_PRINTK_DBGP=y
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_X86_PTDUMP is not set
CONFIG_DEBUG_RODATA=y
# CONFIG_DEBUG_RODATA_TEST is not set
CONFIG_DEBUG_SET_MODULE_RONX=y
# CONFIG_DEBUG_NX_TEST is not set
# CONFIG_DEBUG_TLBFLUSH is not set
# CONFIG_IOMMU_DEBUG is not set
# CONFIG_IOMMU_STRESS is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
# CONFIG_X86_DECODER_SELFTEST is not set
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
# CONFIG_DEBUG_BOOT_PARAMS is not set
# CONFIG_CPA_DEBUG is not set
CONFIG_OPTIMIZE_INLINING=y
# CONFIG_DEBUG_STRICT_USER_COPY_CHECKS is not set
# CONFIG_DEBUG_NMI_SELFTEST is not set

#
# Security options
#
CONFIG_KEYS=y
CONFIG_TRUSTED_KEYS=m
CONFIG_ENCRYPTED_KEYS=m
# CONFIG_KEYS_DEBUG_PROC_KEYS is not set
# CONFIG_SECURITY_DMESG_RESTRICT is not set
CONFIG_SECURITY=y
CONFIG_SECURITYFS=y
CONFIG_SECURITY_NETWORK=y
CONFIG_SECURITY_NETWORK_XFRM=y
CONFIG_SECURITY_PATH=y
# CONFIG_INTEL_TXT is not set
CONFIG_LSM_MMAP_MIN_ADDR=0
CONFIG_SECURITY_SELINUX=y
CONFIG_SECURITY_SELINUX_BOOTPARAM=y
CONFIG_SECURITY_SELINUX_BOOTPARAM_VALUE=0
CONFIG_SECURITY_SELINUX_DISABLE=y
CONFIG_SECURITY_SELINUX_DEVELOP=y
CONFIG_SECURITY_SELINUX_AVC_STATS=y
CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=1
# CONFIG_SECURITY_SELINUX_POLICYDB_VERSION_MAX is not set
# CONFIG_SECURITY_SMACK is not set
# CONFIG_SECURITY_TOMOYO is not set
# CONFIG_SECURITY_APPARMOR is not set
# CONFIG_SECURITY_YAMA is not set
# CONFIG_IMA is not set
CONFIG_DEFAULT_SECURITY_SELINUX=y
# CONFIG_DEFAULT_SECURITY_DAC is not set
CONFIG_DEFAULT_SECURITY="selinux"
CONFIG_XOR_BLOCKS=m
CONFIG_ASYNC_CORE=m
CONFIG_ASYNC_MEMCPY=m
CONFIG_ASYNC_XOR=m
CONFIG_ASYNC_PQ=m
CONFIG_ASYNC_RAID6_RECOV=m
CONFIG_ASYNC_TX_DISABLE_PQ_VAL_DMA=y
CONFIG_ASYNC_TX_DISABLE_XOR_VAL_DMA=y
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=m
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER=m
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG=m
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_PCOMP=m
CONFIG_CRYPTO_PCOMP2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
CONFIG_CRYPTO_USER=m
CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=y
CONFIG_CRYPTO_GF128MUL=m
CONFIG_CRYPTO_NULL=m
CONFIG_CRYPTO_PCRYPT=m
CONFIG_CRYPTO_WORKQUEUE=y
CONFIG_CRYPTO_CRYPTD=m
CONFIG_CRYPTO_AUTHENC=m
# CONFIG_CRYPTO_TEST is not set
CONFIG_CRYPTO_ABLK_HELPER_X86=m
CONFIG_CRYPTO_GLUE_HELPER_X86=m

#
# Authenticated Encryption with Associated Data
#
CONFIG_CRYPTO_CCM=m
CONFIG_CRYPTO_GCM=m
CONFIG_CRYPTO_SEQIV=m

#
# Block modes
#
CONFIG_CRYPTO_CBC=m
CONFIG_CRYPTO_CTR=m
CONFIG_CRYPTO_CTS=m
CONFIG_CRYPTO_ECB=m
CONFIG_CRYPTO_LRW=m
CONFIG_CRYPTO_PCBC=m
CONFIG_CRYPTO_XTS=m

#
# Hash modes
#
CONFIG_CRYPTO_HMAC=y
CONFIG_CRYPTO_XCBC=m
CONFIG_CRYPTO_VMAC=m

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
CONFIG_CRYPTO_CRC32C_INTEL=m
CONFIG_CRYPTO_GHASH=m
CONFIG_CRYPTO_MD4=m
CONFIG_CRYPTO_MD5=m
CONFIG_CRYPTO_MICHAEL_MIC=m
CONFIG_CRYPTO_RMD128=m
CONFIG_CRYPTO_RMD160=m
CONFIG_CRYPTO_RMD256=m
CONFIG_CRYPTO_RMD320=m
CONFIG_CRYPTO_SHA1=m
CONFIG_CRYPTO_SHA1_SSSE3=m
CONFIG_CRYPTO_SHA256=m
CONFIG_CRYPTO_SHA512=m
CONFIG_CRYPTO_TGR192=m
CONFIG_CRYPTO_WP512=m
CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL=m

#
# Ciphers
#
CONFIG_CRYPTO_AES=y
CONFIG_CRYPTO_AES_X86_64=m
CONFIG_CRYPTO_AES_NI_INTEL=m
CONFIG_CRYPTO_ANUBIS=m
CONFIG_CRYPTO_ARC4=m
CONFIG_CRYPTO_BLOWFISH=m
CONFIG_CRYPTO_BLOWFISH_COMMON=m
CONFIG_CRYPTO_BLOWFISH_X86_64=m
CONFIG_CRYPTO_CAMELLIA=m
CONFIG_CRYPTO_CAMELLIA_X86_64=m
CONFIG_CRYPTO_CAST5=m
# CONFIG_CRYPTO_CAST5_AVX_X86_64 is not set
CONFIG_CRYPTO_CAST6=m
# CONFIG_CRYPTO_CAST6_AVX_X86_64 is not set
CONFIG_CRYPTO_DES=m
CONFIG_CRYPTO_FCRYPT=m
CONFIG_CRYPTO_KHAZAD=m
CONFIG_CRYPTO_SALSA20=m
CONFIG_CRYPTO_SALSA20_X86_64=m
CONFIG_CRYPTO_SEED=m
CONFIG_CRYPTO_SERPENT=m
CONFIG_CRYPTO_SERPENT_SSE2_X86_64=m
# CONFIG_CRYPTO_SERPENT_AVX_X86_64 is not set
CONFIG_CRYPTO_TEA=m
# CONFIG_CRYPTO_TWOFISH is not set
CONFIG_CRYPTO_TWOFISH_COMMON=m
CONFIG_CRYPTO_TWOFISH_X86_64=m
CONFIG_CRYPTO_TWOFISH_X86_64_3WAY=m
# CONFIG_CRYPTO_TWOFISH_AVX_X86_64 is not set

#
# Compression
#
CONFIG_CRYPTO_DEFLATE=m
CONFIG_CRYPTO_ZLIB=m
CONFIG_CRYPTO_LZO=y

#
# Random Number Generation
#
CONFIG_CRYPTO_ANSI_CPRNG=m
CONFIG_CRYPTO_USER_API=m
CONFIG_CRYPTO_USER_API_HASH=m
CONFIG_CRYPTO_USER_API_SKCIPHER=m
CONFIG_CRYPTO_HW=y
CONFIG_CRYPTO_DEV_PADLOCK=m
CONFIG_CRYPTO_DEV_PADLOCK_AES=m
CONFIG_CRYPTO_DEV_PADLOCK_SHA=m
# CONFIG_ASYMMETRIC_KEY_TYPE is not set
CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_IRQCHIP=y
CONFIG_HAVE_KVM_EVENTFD=y
CONFIG_KVM_APIC_ARCHITECTURE=y
CONFIG_KVM_MMIO=y
CONFIG_KVM_ASYNC_PF=y
CONFIG_HAVE_KVM_MSI=y
CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=m
CONFIG_KVM_INTEL=m
CONFIG_KVM_AMD=m
CONFIG_KVM_MMU_AUDIT=y
CONFIG_VHOST_NET=m
CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_RAID6_PQ=m
CONFIG_BITREVERSE=y
CONFIG_GENERIC_STRNCPY_FROM_USER=y
CONFIG_GENERIC_STRNLEN_USER=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_IO=y
CONFIG_CRC_CCITT=m
CONFIG_CRC16=y
CONFIG_CRC_T10DIF=y
CONFIG_CRC_ITU_T=m
CONFIG_CRC32=y
# CONFIG_CRC32_SELFTEST is not set
CONFIG_CRC32_SLICEBY8=y
# CONFIG_CRC32_SLICEBY4 is not set
# CONFIG_CRC32_SARWATE is not set
# CONFIG_CRC32_BIT is not set
CONFIG_CRC7=m
CONFIG_LIBCRC32C=m
CONFIG_CRC8=m
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=m
CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_XZ_DEC=y
CONFIG_XZ_DEC_X86=y
CONFIG_XZ_DEC_POWERPC=y
CONFIG_XZ_DEC_IA64=y
CONFIG_XZ_DEC_ARM=y
CONFIG_XZ_DEC_ARMTHUMB=y
CONFIG_XZ_DEC_SPARC=y
CONFIG_XZ_DEC_BCJ=y
# CONFIG_XZ_DEC_TEST is not set
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_BZIP2=y
CONFIG_DECOMPRESS_LZMA=y
CONFIG_DECOMPRESS_XZ=y
CONFIG_DECOMPRESS_LZO=y
CONFIG_GENERIC_ALLOCATOR=y
CONFIG_TEXTSEARCH=y
CONFIG_TEXTSEARCH_KMP=m
CONFIG_TEXTSEARCH_BM=m
CONFIG_TEXTSEARCH_FSM=m
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y
CONFIG_CHECK_SIGNATURE=y
CONFIG_CPU_RMAP=y
CONFIG_DQL=y
CONFIG_NLATTR=y
CONFIG_ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE=y
CONFIG_AVERAGE=y
CONFIG_CORDIC=m
# CONFIG_DDR is not set
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Christoph Lameter
2012-11-12 23:50:02 UTC
Permalink
The biggest conceptual addition, beyond the elimination of the home
node, is that the scheduler is now able to recognize 'private' versus
'shared' pages, by carefully analyzing the pattern of how CPUs touch the
working set pages. The scheduler automatically recognizes tasks that
share memory with each other (and make dominant use of that memory) -
versus tasks that allocate and use their working set privately.
That is a key distinction to make and if this really works then that is
major progress.
This new scheduler code is then able to group tasks that are "memory
related" via their memory access patterns together: in the NUMA context
moving them on the same node if possible, and spreading them amongst
nodes if they use private memory.
What happens if processes memory accesses are related but the
common set of data does not fit into the memory provided by a single node?

The correct resolution usually is in that case to interleasve the pages
over both nodes in use.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ingo Molnar
2012-11-13 07:30:01 UTC
Permalink
Post by Christoph Lameter
The biggest conceptual addition, beyond the elimination of
the home node, is that the scheduler is now able to
recognize 'private' versus 'shared' pages, by carefully
analyzing the pattern of how CPUs touch the working set
pages. The scheduler automatically recognizes tasks that
share memory with each other (and make dominant use of that
memory) - versus tasks that allocate and use their working
set privately.
That is a key distinction to make and if this really works
then that is major progress.
I posted updated benchmark results yesterday, and the approach
is indeed a performance breakthrough:

http://lkml.org/lkml/2012/11/12/330

It also made the code more generic and more maintainable from a
scheduler POV.
Post by Christoph Lameter
This new scheduler code is then able to group tasks that are
in the NUMA context moving them on the same node if
possible, and spreading them amongst nodes if they use
private memory.
What happens if processes memory accesses are related but the
common set of data does not fit into the memory provided by a
single node?
The other (very common) node-overload case is that there are
more tasks for a shared piece of memory than fits on a single
node.

I have measured two such workloads, one is the Java SPEC
benchmark:

v3.7-vanilla: 494828 transactions/sec
v3.7-NUMA: 627228 transactions/sec [ +26.7% ]

the other is the 'numa01' testcase of autonumabench:

v3.7-vanilla: 340.3 seconds
v3.7-NUMA: 216.9 seconds [ +56% ]
Post by Christoph Lameter
The correct resolution usually is in that case to interleasve
the pages over both nodes in use.
I'd not go as far as to claim that to be a general rule: the
correct placement depends on the system and workload specifics:
how much memory is on each node, how many tasks run on each
node, and whether the access patterns and working set of the
tasks is symmetric amongst each other - which is not a given at
all.

Say consider a database server that executes small and large
queries over a large, memory-shared database, and has worker
tasks to clients, to serve each query. Depending on the nature
of the queries, interleaving can easily be the wrong thing to
do.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Christoph Lameter
2012-11-15 14:30:01 UTC
Permalink
Post by Ingo Molnar
Post by Christoph Lameter
the pages over both nodes in use.
I'd not go as far as to claim that to be a general rule: the
how much memory is on each node, how many tasks run on each
node, and whether the access patterns and working set of the
tasks is symmetric amongst each other - which is not a given at
all.
Say consider a database server that executes small and large
queries over a large, memory-shared database, and has worker
tasks to clients, to serve each query. Depending on the nature
of the queries, interleaving can easily be the wrong thing to
do.
The interleaving of memory areas that have an equal amount of shared
accesses from multiple nodes is essential to limit the traffic on the
interconnect and get top performance.

I guess through that in a non HPC environment where you are not interested
in one specific load running at top speed varying contention on the
interconnect and memory busses are acceptable. But this means that HPC
loads cannot be auto tuned.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ingo Molnar
2012-11-16 16:00:03 UTC
Permalink
Post by Christoph Lameter
Post by Ingo Molnar
Post by Christoph Lameter
the pages over both nodes in use.
I'd not go as far as to claim that to be a general rule: the
correct placement depends on the system and workload
specifics: how much memory is on each node, how many tasks
run on each node, and whether the access patterns and
working set of the tasks is symmetric amongst each other -
which is not a given at all.
Say consider a database server that executes small and large
queries over a large, memory-shared database, and has worker
tasks to clients, to serve each query. Depending on the
nature of the queries, interleaving can easily be the wrong
thing to do.
The interleaving of memory areas that have an equal amount of
shared accesses from multiple nodes is essential to limit the
traffic on the interconnect and get top performance.
That is true only if the load is symmetric.
Post by Christoph Lameter
I guess through that in a non HPC environment where you are
not interested in one specific load running at top speed
varying contention on the interconnect and memory busses are
acceptable. But this means that HPC loads cannot be auto
tuned.
I'm not against improving these workloads (at all) - I just
pointed out that interleaving isn't necessarily the best
placement strategy for 'large' workloads.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Christoph Lameter
2012-11-16 21:00:02 UTC
Permalink
Post by Ingo Molnar
Post by Christoph Lameter
The interleaving of memory areas that have an equal amount of
shared accesses from multiple nodes is essential to limit the
traffic on the interconnect and get top performance.
That is true only if the load is symmetric.
Which is usually true of an HPC workload.
Post by Ingo Molnar
Post by Christoph Lameter
I guess through that in a non HPC environment where you are
not interested in one specific load running at top speed
varying contention on the interconnect and memory busses are
acceptable. But this means that HPC loads cannot be auto
tuned.
I'm not against improving these workloads (at all) - I just
pointed out that interleaving isn't necessarily the best
placement strategy for 'large' workloads.
Depends on what you mean by "large" workloads. If it is a typically large
HPC workload with data structures distributed over nodes then the
placement of those data structure spread over all nodes is the best
placement startegy.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Continue reading on narkive:
Loading...