[PATCH 3/6] mm: Defer TLB flush after unmap as long as possible

Discussion:

(too old to reply)

Mel Gorman

2015-04-21 10:50:02 UTC

If a PTE is unmapped and it's dirty then it was writable recently. Due
to deferred TLB flushing, it's best to assume a writable TLB cache entry
exists. With that assumption, the TLB must be flushed before any IO can
start or the page is freed to avoid lost writes or data corruption. Prior
to this patch, such PFNs were simply flushed immediately. In this patch,
the caller is informed that such entries potentially exist and it's up to
the caller to flush before pages are freed or IO can start.

Signed-off-by: Mel Gorman <***@suse.de>
---
include/linux/rmap.h | 10 ++++++----
mm/rmap.c | 55 ++++++++++++++++++++++++++++++++++++++--------------
mm/vmscan.c | 9 ++++++++-
3 files changed, 54 insertions(+), 20 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 8d23914b219e..5bbaec19cb21 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -290,9 +290,11 @@ static inline int page_mkclean(struct page *page)
/*
* Return values of try_to_unmap
*/
-#define SWAP_SUCCESS 0
-#define SWAP_AGAIN 1
-#define SWAP_FAIL 2
-#define SWAP_MLOCK 3
+#define SWAP_SUCCESS 0
+#define SWAP_SUCCESS_CACHED 1
+#define SWAP_AGAIN 2
+#define SWAP_AGAIN_CACHED 3
+#define SWAP_FAIL 4
+#define SWAP_MLOCK 5

#endif /* _LINUX_RMAP_H */
diff --git a/mm/rmap.c b/mm/rmap.c
index c5badb6c72c9..dcf1df16bf4d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1262,6 +1262,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
spinlock_t *ptl;
int ret = SWAP_AGAIN;
bool deferred;
+ bool dirty_cached = false;
enum ttu_flags flags = (enum ttu_flags)arg;

pte = page_check_address(page, mm, address, &ptl, 0);
@@ -1309,12 +1310,13 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
if (pte_dirty(pteval)) {
/*
* If the PTE was dirty then it's best to assume it's writable.
- * The TLB must be flushed before the page is unlocked as IO
- * can start in parallel. Without the flush, writes could
- * happen and data be potentially lost.
+ * Inform the caller that it is possible there is a writable
+ * cached TLB entry. It is the responsibility of the caller
+ * to flush the TLB before the page is freed or any IO is
+ * initiated.
*/
if (deferred)
- flush_tlb_page(vma, address);
+ dirty_cached = true;

set_page_dirty(page);
}
@@ -1388,6 +1390,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
page_remove_rmap(page);
page_cache_release(page);

+ if (dirty_cached)
+ ret = SWAP_AGAIN_CACHED;
+
out_unmap:
pte_unmap_unlock(pte, ptl);
if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
@@ -1450,10 +1455,11 @@ static int page_not_mapped(struct page *page)
* page, used in the pageout path. Caller must hold the page lock.
* Return values are:
*
- * SWAP_SUCCESS - we succeeded in removing all mappings
- * SWAP_AGAIN - we missed a mapping, try again later
- * SWAP_FAIL - the page is unswappable
- * SWAP_MLOCK - page is mlocked.
+ * SWAP_SUCCESS - we succeeded in removing all mappings
+ * SWAP_SUCCESS_CACHED - Like SWAP_SUCCESS but a writable TLB entry may exist
+ * SWAP_AGAIN - we missed a mapping, try again later
+ * SWAP_FAIL - the page is unswappable
+ * SWAP_MLOCK - page is mlocked.
*/
int try_to_unmap(struct page *page, enum ttu_flags flags)
{
@@ -1481,7 +1487,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
ret = rmap_walk(page, &rwc);

if (ret != SWAP_MLOCK && !page_mapped(page))
- ret = SWAP_SUCCESS;
+ ret = (ret == SWAP_AGAIN_CACHED) ? SWAP_SUCCESS_CACHED : SWAP_SUCCESS;
+
return ret;
}

@@ -1577,15 +1584,24 @@ static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc)
anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
struct vm_area_struct *vma = avc->vma;
unsigned long address = vma_address(page, vma);
+ int this_ret;

if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
continue;

- ret = rwc->rmap_one(page, vma, address, rwc->arg);
- if (ret != SWAP_AGAIN)
+ this_ret = rwc->rmap_one(page, vma, address, rwc->arg);
+ if (this_ret != SWAP_AGAIN && this_ret != SWAP_AGAIN_CACHED) {
+ ret = this_ret;
break;
- if (rwc->done && rwc->done(page))
+ }
+ if (rwc->done && rwc->done(page)) {
+ ret = this_ret;
break;
+ }
+
+ /* Remember if there is possible a writable TLB entry */
+ if (this_ret == SWAP_AGAIN_CACHED)
+ ret = SWAP_AGAIN_CACHED;
}
anon_vma_unlock_read(anon_vma);
return ret;
@@ -1626,15 +1642,24 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc)
i_mmap_lock_read(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
unsigned long address = vma_address(page, vma);
+ int this_ret;

if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
continue;

- ret = rwc->rmap_one(page, vma, address, rwc->arg);
- if (ret != SWAP_AGAIN)
+ this_ret = rwc->rmap_one(page, vma, address, rwc->arg);
+ if (this_ret != SWAP_AGAIN && this_ret != SWAP_AGAIN_CACHED) {
+ ret = this_ret;
goto done;
- if (rwc->done && rwc->done(page))
+ }
+ if (rwc->done && rwc->done(page)) {
+ ret = this_ret;
goto done;
+ }
+
+ /* Remember if there is possible a writable TLB entry */
+ if (this_ret == SWAP_AGAIN_CACHED)
+ ret = SWAP_AGAIN_CACHED;
}

done:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 12ec298087b6..0ad3f435afdd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -860,6 +860,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
unsigned long nr_reclaimed = 0;
unsigned long nr_writeback = 0;
unsigned long nr_immediate = 0;
+ bool tlb_flush_required = false;

cond_resched();

@@ -1032,6 +1033,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
goto keep_locked;
case SWAP_MLOCK:
goto cull_mlocked;
+ case SWAP_SUCCESS_CACHED:
+ /* Must flush before free, fall through */
+ tlb_flush_required = true;
case SWAP_SUCCESS:
; /* try to free the page below */
}
@@ -1176,7 +1180,8 @@ keep:
}

mem_cgroup_uncharge_list(&free_pages);
- try_to_unmap_flush();
+ if (tlb_flush_required)
+ try_to_unmap_flush();
free_hot_cold_page_list(&free_pages, true);

list_splice(&ret_pages, page_list);
@@ -1213,6 +1218,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
ret = shrink_page_list(&clean_pages, zone, &sc,
TTU_UNMAP|TTU_IGNORE_ACCESS,
&dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
+ try_to_unmap_flush();
list_splice(&clean_pages, page_list);
mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
return ret;
@@ -2225,6 +2231,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
scan_adjusted = true;
}
blk_finish_plug(&plug);
+ try_to_unmap_flush();
sc->nr_reclaimed += nr_reclaimed;

/*

--
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Mel Gorman

2015-04-21 10:50:02 UTC

Permalink

After each migration attempt, putback_lru_page() is used to drop the last
reference to the page. This is fine but it prevents the batching of TLB
flushes because the flush must happen before a free. This patch drops all
the migrated pages at once in preparation for batching the TLB flush.

Signed-off-by: Mel Gorman <***@suse.de>
---
mm/migrate.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 85e042686031..82c98c5aa6ed 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -906,7 +906,7 @@ out:
*/
static int unmap_and_move(new_page_t get_new_page, free_page_t put_new_page,
unsigned long private, struct page *page, int force,
- enum migrate_mode mode)
+ enum migrate_mode mode, struct list_head *putback_list)
{
int rc = 0;
int *result = NULL;
@@ -937,7 +937,7 @@ out:
list_del(&page->lru);
dec_zone_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
- putback_lru_page(page);
+ list_add(&page->lru, putback_list);
}

/*
@@ -1086,6 +1086,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
free_page_t put_new_page, unsigned long private,
enum migrate_mode mode, int reason)
{
+ LIST_HEAD(putback_list);
int retry = 1;
int nr_failed = 0;
int nr_succeeded = 0;
@@ -1110,7 +1111,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
pass > 2, mode);
else
rc = unmap_and_move(get_new_page, put_new_page,
- private, page, pass > 2, mode);
+ private, page, pass > 2, mode,
+ &putback_list);

switch(rc) {
case -ENOMEM:
@@ -1135,6 +1137,12 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
}
rc = nr_failed + retry;
out:
+ while (!list_empty(&putback_list)) {
+ page = list_entry(putback_list.prev, struct page, lru);
+ list_del(&page->lru);
+ putback_lru_page(page);
+ }
+
if (nr_succeeded)
count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
if (nr_failed)

Mel Gorman

2015-04-21 10:50:03 UTC

Permalink

The patch "mm: Send a single IPI to TLB flush multiple pages when unmapping"
would batch 32 pages before sending an IPI. This patch increases the size of
the data structure to hold a pages worth of PFNs before sending an IPI. This
is a trade-off between memory usage and reducing IPIS sent. In the ideal
case where multiple processes are reading large mapped files, this patch
reduces interrupts/second from roughly 180K per second to 60K per second.

Signed-off-by: Mel Gorman <***@suse.de>
Reviewed-by: Rik van Riel <***@redhat.com>
---
include/linux/sched.h | 9 +++++----
kernel/fork.c | 6 ++++--
mm/vmscan.c | 5 +++--
3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5c09db02fe78..3e4d3f545005 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1275,16 +1275,17 @@ enum perf_event_task_context {
perf_nr_task_contexts,
};

-/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */
-#define BATCH_TLBFLUSH_SIZE 32UL
-
/* Track pages that require TLB flushes */
struct tlbflush_unmap_batch {
struct cpumask cpumask;
unsigned long nr_pages;
- unsigned long pfns[BATCH_TLBFLUSH_SIZE];
+ unsigned long pfns[0];
};

+/* alloc_tlb_ubc() always allocates a page */
+#define BATCH_TLBFLUSH_SIZE \
+ ((PAGE_SIZE - sizeof(struct tlbflush_unmap_batch)) / sizeof(unsigned long))
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
void *stack;
diff --git a/kernel/fork.c b/kernel/fork.c
index 86c872fec9fb..f260663f209a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -247,8 +247,10 @@ void __put_task_struct(struct task_struct *tsk)
put_signal_struct(tsk->signal);

#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
- kfree(tsk->tlb_ubc);
- tsk->tlb_ubc = NULL;
+ if (tsk->tlb_ubc) {
+ free_page((unsigned long)tsk->tlb_ubc);
+ tsk->tlb_ubc = NULL;
+ }
#endif

if (!profile_handoff_task(tsk))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e39e7c4bf548..080ba929049c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2775,14 +2775,15 @@ out:
/*
* Allocate the control structure for batch TLB flushing. An allocation
* failure is harmless as the reclaimer will send IPIs where necessary.
+ * If the allocation size changes then update BATCH_TLBFLUSH_SIZE.
*/
void alloc_tlb_ubc(void)
{
if (current->tlb_ubc)
return;

- current->tlb_ubc = kmalloc(sizeof(struct tlbflush_unmap_batch),
- GFP_ATOMIC | __GFP_NOWARN);
+ current->tlb_ubc = (struct tlbflush_unmap_batch *)
+ __get_free_page(GFP_KERNEL | __GFP_NOWARN);
if (!current->tlb_ubc)
return;

Mel Gorman

2015-04-21 10:50:02 UTC

Permalink

An IPI is sent to flush remote TLBs when a page is unmapped that was
recently accessed by other CPUs. There are many circumstances where this
happens but the obvious one is kswapd reclaiming pages belonging to a
running process as kswapd and the task are likely running on separate CPUs.

On small machines, this is not a significant problem but as machine
gets larger with more cores and more memory, the cost of these IPIs can
be high. This patch uses a structure similar in principle to a pagevec
to collect a list of PFNs and CPUs that require flushing. It then sends
one IPI to flush the list of PFNs. A new TLB flush helper is required for
this and one is added for x86. Other architectures will need to decide if
batching like this is both safe and worth the memory overhead. Specifically
the requirement is;

If a clean page is unmapped and not immediately flushed, the
architecture must guarantee that a write to that page from a CPU
with a cached TLB entry will trap a page fault.

This is essentially what the kernel already depends on but the window is
much larger with this patch applied and is worth highlighting.

The impact of this patch depends on the workload as measuring any benefit
requires both mapped pages co-located on the LRU and memory pressure. The
case with the biggest impact is multiple processes reading mapped pages
taken from the vm-scalability test suite. The test case uses NR_CPU readers
of mapped files that consume 10*RAM.

vmscale on a 4-node machine with 64G RAM and 48 CPUs
4.0.0 4.0.0
vanilla batchunmap-v3
lru-file-mmap-read-elapsed 159.76 ( 0.00%) 118.92 ( 25.56%)

4.0.0 4.0.0
vanilla batchunmap-v3
User 567.53 609.20
System 5949.65 4117.68
Elapsed 161.08 120.21

This is showing that the readers completed 25% with 30% less CPU time. From
vmstats, it is known that the vanilla kernel was interrupted roughly 900K
times per second during the steady phase of the test and the patched kernel
was interrupts 180K times per second.

The impact is much lower on a small machine

vmscale on a 1-node machine with 8G RAM and 1 CPU
4.0.0 4.0.0
vanilla batchunmap-v2
Ops lru-file-mmap-read-elapsed 22.50 ( 0.00%) 19.82 ( 11.91%)

4.0.0 4.0.0
vanilla batchunmap-v3
User 33.64 32.14
System 36.22 33.68
Elapsed 24.11 21.47

It's still a noticeable improvement with vmstat showing interrupts went
from roughly 500K per second to 45K per second.

The patch will have no impact on workloads with no memory pressure or
have relatively few mapped pages.

Signed-off-by: Mel Gorman <***@suse.de>
Reviewed-by: Rik van Riel <***@redhat.com>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/tlbflush.h | 2 +
include/linux/init_task.h | 8 ++++
include/linux/rmap.h | 3 ++
include/linux/sched.h | 14 ++++++
init/Kconfig | 8 ++++
kernel/fork.c | 5 ++
kernel/sched/core.c | 3 ++
mm/internal.h | 11 +++++
mm/rmap.c | 104 +++++++++++++++++++++++++++++++++++++++-
mm/vmscan.c | 31 +++++++++++-
11 files changed, 187 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca55187..290844263218 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -30,6 +30,7 @@ config X86
select ARCH_MIGHT_HAVE_PC_SERIO
select HAVE_AOUT if X86_32
select HAVE_UNSTABLE_SCHED_CLOCK
+ select ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
select ARCH_SUPPORTS_INT128 if X86_64
select HAVE_IDE
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cd791948b286..96a27051a70a 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -152,6 +152,8 @@ static inline void __flush_tlb_one(unsigned long addr)
* and page-granular flushes are available only on i486 and up.
*/

+#define flush_local_tlb_addr(addr) __flush_tlb_one(addr)
+
#ifndef CONFIG_SMP

/* "_up" is for UniProcessor.
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 696d22312b31..0771937b47e1 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -175,6 +175,13 @@ extern struct task_group root_task_group;
# define INIT_NUMA_BALANCING(tsk)
#endif

+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk) \
+ .tlb_ubc = NULL,
+#else
+# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)
+#endif
+
#ifdef CONFIG_KASAN
# define INIT_KASAN(tsk) \
.kasan_depth = 1,
@@ -257,6 +264,7 @@ extern struct task_group root_task_group;
INIT_RT_MUTEXES(tsk) \
INIT_VTIME(tsk) \
INIT_NUMA_BALANCING(tsk) \
+ INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk) \
INIT_KASAN(tsk) \
}

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c4c559a45dc8..8d23914b219e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -89,6 +89,9 @@ enum ttu_flags {
TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
+ TTU_BATCH_FLUSH = (1 << 11), /* Batch TLB flushes where possible
+ * and caller guarantees they will
+ * do a final flush if necessary */
};

#ifdef CONFIG_MMU
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a419b65770d6..5c09db02fe78 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1275,6 +1275,16 @@ enum perf_event_task_context {
perf_nr_task_contexts,
};

+/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */
+#define BATCH_TLBFLUSH_SIZE 32UL
+
+/* Track pages that require TLB flushes */
+struct tlbflush_unmap_batch {
+ struct cpumask cpumask;
+ unsigned long nr_pages;
+ unsigned long pfns[BATCH_TLBFLUSH_SIZE];
+};
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
void *stack;
@@ -1634,6 +1644,10 @@ struct task_struct {
unsigned long numa_pages_migrated;
#endif /* CONFIG_NUMA_BALANCING */

+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+ struct tlbflush_unmap_batch *tlb_ubc;
+#endif
+
struct rcu_head rcu;

/*
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d4261b..f519fbb6ac35 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -889,6 +889,14 @@ config ARCH_SUPPORTS_NUMA_BALANCING
bool

#
+# For architectures that have a local TLB flush for a PFN without knowledge
+# of the VMA. The architecture must provide guarantees on what happens if
+# a clean TLB cache entry is written after the unmap. Details are in mm/rmap.c
+# near the check for should_defer_flush.
+config ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+ bool
+
+#
# For architectures that know their GCC __int128 support is sound
#
config ARCH_SUPPORTS_INT128
diff --git a/kernel/fork.c b/kernel/fork.c
index cf65139615a0..86c872fec9fb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -246,6 +246,11 @@ void __put_task_struct(struct task_struct *tsk)
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);

+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+ kfree(tsk->tlb_ubc);
+ tsk->tlb_ubc = NULL;
+#endif
+
if (!profile_handoff_task(tsk))
free_task(tsk);
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 62671f53202a..9836a28d001b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1823,6 +1823,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)

p->numa_group = NULL;
#endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+ p->tlb_ubc = NULL;
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
}

#ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/internal.h b/mm/internal.h
index a96da5b0029d..35aba439c275 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -431,4 +431,15 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
#define ALLOC_CMA 0x80 /* allow allocations from CMA areas */
#define ALLOC_FAIR 0x100 /* fair zone allocation */

+enum ttu_flags;
+struct tlbflush_unmap_batch;
+
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+void try_to_unmap_flush(void);
+#else
+static inline void try_to_unmap_flush(void)
+{
+}
+
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
#endif /* __MM_INTERNAL_H */
diff --git a/mm/rmap.c b/mm/rmap.c
index c161a14b6a8f..c5badb6c72c9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -60,6 +60,8 @@

#include <asm/tlbflush.h>

+#include <trace/events/tlb.h>
+
#include "internal.h"

static struct kmem_cache *anon_vma_cachep;
@@ -581,6 +583,79 @@ vma_address(struct page *page, struct vm_area_struct *vma)
return address;
}

+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+static void percpu_flush_tlb_batch_pages(void *data)
+{
+ struct tlbflush_unmap_batch *tlb_ubc = data;
+ int i;
+
+ count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+ for (i = 0; i < tlb_ubc->nr_pages; i++)
+ flush_local_tlb_addr(tlb_ubc->pfns[i] << PAGE_SHIFT);
+}
+
+/*
+ * Flush any pending IPIs. It is important that if a PTE was dirty at the time
+ * it was unmapped at the flush occurs before any IO is initiated on the page
+ * or is about to be freed to prevent lost writes or leakage respectively
+ */
+void try_to_unmap_flush(void)
+{
+ struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
+
+ if (!tlb_ubc || !tlb_ubc->nr_pages)
+ return;
+
+ trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, tlb_ubc->nr_pages);
+ smp_call_function_many(&tlb_ubc->cpumask, percpu_flush_tlb_batch_pages,
+ (void *)tlb_ubc, true);
+ cpumask_clear(&tlb_ubc->cpumask);
+ tlb_ubc->nr_pages = 0;
+}
+
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
+ struct page *page)
+{
+ struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
+
+ cpumask_or(&tlb_ubc->cpumask, &tlb_ubc->cpumask, mm_cpumask(mm));
+ tlb_ubc->pfns[tlb_ubc->nr_pages] = page_to_pfn(page);
+ tlb_ubc->nr_pages++;
+
+ if (tlb_ubc->nr_pages == BATCH_TLBFLUSH_SIZE)
+ try_to_unmap_flush();
+}
+
+/*
+ * Returns true if the TLB flush should be deferred to the end of a batch of
+ * unmap operations to reduce IPIs.
+ */
+static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
+{
+ bool should_defer = false;
+
+ if (!current->tlb_ubc || !(flags & TTU_BATCH_FLUSH))
+ return false;
+
+ /* If remote CPUs need to be flushed then defer batch the flush */
+ if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+ should_defer = true;
+ put_cpu();
+
+ return should_defer;
+}
+#else
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
+ struct page *page)
+{
+}
+
+static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
+{
+ return false;
+}
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
+
/*
* At what user virtual address is page expected in vma?
* Caller should check the page is actually part of the vma.
@@ -1186,6 +1261,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
pte_t pteval;
spinlock_t *ptl;
int ret = SWAP_AGAIN;
+ bool deferred;
enum ttu_flags flags = (enum ttu_flags)arg;

pte = page_check_address(page, mm, address, &ptl, 0);
@@ -1213,11 +1289,35 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,

/* Nuke the page table entry. */
flush_cache_page(vma, address, page_to_pfn(page));
- pteval = ptep_clear_flush(vma, address, pte);
+ deferred = should_defer_flush(mm, flags);
+ if (deferred) {
+ /*
+ * We clear the PTE but do not flush so potentially a remote
+ * CPU could still be writing to the page. If the entry was
+ * previously clean then the architecture must guarantee that
+ * a clear->dirty transition on a cached TLB entry is written
+ * through and traps if the PTE is unmapped. If the entry is
+ * writable then it's handled below.
+ */
+ pteval = ptep_get_and_clear(mm, address, pte);
+ set_tlb_ubc_flush_pending(mm, page);
+ } else {
+ pteval = ptep_clear_flush(vma, address, pte);
+ }

/* Move the dirty bit to the physical page now the pte is gone. */
- if (pte_dirty(pteval))
+ if (pte_dirty(pteval)) {
+ /*
+ * If the PTE was dirty then it's best to assume it's writable.
+ * The TLB must be flushed before the page is unlocked as IO
+ * can start in parallel. Without the flush, writes could
+ * happen and data be potentially lost.
+ */
+ if (deferred)
+ flush_tlb_page(vma, address);
+
set_page_dirty(page);
+ }

/* Update high watermark before we lower rss */
update_hiwater_rss(mm);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd71bac..12ec298087b6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1024,7 +1024,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* processes. Try to unmap it here.
*/
if (page_mapped(page) && mapping) {
- switch (try_to_unmap(page, ttu_flags)) {
+ switch (try_to_unmap(page,
+ ttu_flags|TTU_BATCH_FLUSH)) {
case SWAP_FAIL:
goto activate_locked;
case SWAP_AGAIN:
@@ -1175,6 +1176,7 @@ keep:
}

mem_cgroup_uncharge_list(&free_pages);
+ try_to_unmap_flush();
free_hot_cold_page_list(&free_pages, true);

list_splice(&ret_pages, page_list);
@@ -2762,6 +2764,30 @@ out:
return false;
}

+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+/*
+ * Allocate the control structure for batch TLB flushing. An allocation
+ * failure is harmless as the reclaimer will send IPIs where necessary.
+ */
+static inline void alloc_tlb_ubc(void)
+{
+ if (current->tlb_ubc)
+ return;
+
+ current->tlb_ubc = kmalloc(sizeof(struct tlbflush_unmap_batch),
+ GFP_ATOMIC | __GFP_NOWARN);
+ if (!current->tlb_ubc)
+ return;
+
+ cpumask_clear(&current->tlb_ubc->cpumask);
+ current->tlb_ubc->nr_pages = 0;
+}
+#else
+static inline void alloc_tlb_ubc(void)
+{
+}
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
+
unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *nodemask)
{
@@ -2789,6 +2815,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
sc.may_writepage,
gfp_mask);

+ alloc_tlb_ubc();
nr_reclaimed = do_try_to_free_pages(zonelist, &sc);

trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
@@ -3364,6 +3391,8 @@ static int kswapd(void *p)

lockdep_set_current_reclaim_state(GFP_KERNEL);

+ alloc_tlb_ubc();
+
if (!cpumask_empty(cpumask))
set_cpus_allowed_ptr(tsk, cpumask);
current->reclaim_state = &reclaim_state;

Mel Gorman

2015-04-21 10:50:03 UTC

Permalink

Page reclaim batches multiple TLB flushes into one IPI and this patch teaches
page migration to also batch any necessary flushes. MMtests has a THP scale
microbenchmark that deliberately fragments memory and then allocates THPs
to stress compaction. It's not a page reclaim benchmark and recent kernels
avoid excessive compaction but this patch reduced system CPU usage

4.0.0 4.0.0
baseline batchmigrate-v1
User 970.70 1012.24
System 2067.48 1840.00
Elapsed 1520.63 1529.66

Note that this particular workload was not TLB flush intensive with peaks
in interrupts during the compaction phase. The 4.0 kernel peaked at 345K
interrupts/second, the kernel that batches reclaim TLB entries peaked at
13K interrupts/second and this patch peaked at 10K interrupts/second.

Signed-off-by: Mel Gorman <***@suse.de>
Reviewed-by: Rik van Riel <***@redhat.com>
---
mm/internal.h | 5 +++++
mm/migrate.c | 13 +++++++++++--
mm/vmscan.c | 6 +-----
3 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 35aba439c275..c2481574b41a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -436,10 +436,15 @@ struct tlbflush_unmap_batch;

#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
void try_to_unmap_flush(void);
+void alloc_tlb_ubc(void);
#else
static inline void try_to_unmap_flush(void)
{
}

+static inline void alloc_tlb_ubc(void)
+{
+}
+
#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
#endif /* __MM_INTERNAL_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 82c98c5aa6ed..4a1793dce6e3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -878,8 +878,12 @@ static int __unmap_and_move(struct page *page, struct page *newpage,

/* Establish migration ptes or remove ptes */
if (page_mapped(page)) {
- try_to_unmap(page,
- TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+ int ttu_retval = try_to_unmap(page,
+ TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|TTU_BATCH_FLUSH);
+
+ /* Must flush before copy in case of a writable TLB entry */
+ if (ttu_retval == SWAP_SUCCESS_CACHED)
+ try_to_unmap_flush();
page_was_mapped = 1;
}

@@ -1099,6 +1103,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
if (!swapwrite)
current->flags |= PF_SWAPWRITE;

+ alloc_tlb_ubc();
+
for(pass = 0; pass < 10 && retry; pass++) {
retry = 0;

@@ -1137,6 +1143,9 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
}
rc = nr_failed + retry;
out:
+ /* Must flush before any potential frees */
+ try_to_unmap_flush();
+
while (!list_empty(&putback_list)) {
page = list_entry(putback_list.prev, struct page, lru);
list_del(&page->lru);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0ad3f435afdd..e39e7c4bf548 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2776,7 +2776,7 @@ out:
* Allocate the control structure for batch TLB flushing. An allocation
* failure is harmless as the reclaimer will send IPIs where necessary.
*/
-static inline void alloc_tlb_ubc(void)
+void alloc_tlb_ubc(void)
{
if (current->tlb_ubc)
return;
@@ -2789,10 +2789,6 @@ static inline void alloc_tlb_ubc(void)
cpumask_clear(&current->tlb_ubc->cpumask);
current->tlb_ubc->nr_pages = 0;
}
-#else
-static inline void alloc_tlb_ubc(void)
-{
-}
#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */

unsigned long try_to_free_pages(struct zonelist *zonelist, int order,

Rik van Riel

2015-04-21 20:40:02 UTC

Permalink

Post by Mel Gorman
If a PTE is unmapped and it's dirty then it was writable recently. Due
to deferred TLB flushing, it's best to assume a writable TLB cache entry
exists. With that assumption, the TLB must be flushed before any IO can
start or the page is freed to avoid lost writes or data corruption. Prior
to this patch, such PFNs were simply flushed immediately. In this patch,
the caller is informed that such entries potentially exist and it's up to
the caller to flush before pages are freed or IO can start.
@@ -1450,10 +1455,11 @@ static int page_not_mapped(struct page *page)
* page, used in the pageout path. Caller must hold the page lock.
*
- * SWAP_SUCCESS - we succeeded in removing all mappings
- * SWAP_AGAIN - we missed a mapping, try again later
- * SWAP_FAIL - the page is unswappable
- * SWAP_MLOCK - page is mlocked.
+ * SWAP_SUCCESS - we succeeded in removing all mappings
+ * SWAP_SUCCESS_CACHED - Like SWAP_SUCCESS but a writable TLB entry may exist
+ * SWAP_AGAIN - we missed a mapping, try again later
+ * SWAP_FAIL - the page is unswappable
+ * SWAP_MLOCK - page is mlocked.
*/
int try_to_unmap(struct page *page, enum ttu_flags flags)
{
@@ -1481,7 +1487,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
ret = rmap_walk(page, &rwc);
if (ret != SWAP_MLOCK && !page_mapped(page))
- ret = SWAP_SUCCESS;
+ ret = (ret == SWAP_AGAIN_CACHED) ? SWAP_SUCCESS_CACHED : SWAP_SUCCESS;
+
return ret;
}

This wants a big fat comment explaining why SWAP_AGAIN_CACHED is turned
into SWAP_SUCCESS_CACHED.

I think I understand why this is happening, but I am not sure how to
explain it...

Post by Mel Gorman
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 12ec298087b6..0ad3f435afdd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -860,6 +860,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
unsigned long nr_reclaimed = 0;
unsigned long nr_writeback = 0;
unsigned long nr_immediate = 0;
+ bool tlb_flush_required = false;
cond_resched();
@@ -1032,6 +1033,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
goto keep_locked;
goto cull_mlocked;
+ /* Must flush before free, fall through */
+ tlb_flush_required = true;
; /* try to free the page below */
}
}
mem_cgroup_uncharge_list(&free_pages);
- try_to_unmap_flush();
+ if (tlb_flush_required)
+ try_to_unmap_flush();
free_hot_cold_page_list(&free_pages, true);

Don't we have to flush the TLB before calling pageout() on the page?

In other words, we would also have to batch up calls to pageout(), if
we want to do batched TLB flushing.

This could be accomplished by putting the SWAP_SUCCESS_CACHED pages on
a special list, instead of calling pageout() on them immediately, and
then calling pageout() on all the pages on that list after the batch
flush.

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Mel Gorman

2015-04-21 21:20:02 UTC

Permalink

Post by Rik van Riel

This wants a big fat comment explaining why SWAP_AGAIN_CACHED is turned
into SWAP_SUCCESS_CACHED.

I'll add something in V4. SWAP_AGAIN_CACHED indicates to rmap_walk that
it should continue the rmap but that a write cached PTE was encountered.
SWAP_SUCCESS is what callers of try_to_unmap() expect so
SWAP_SUCCESS_CACHED is the equivalent.

Post by Rik van Riel
I think I understand why this is happening, but I am not sure how to
explain it...

Don't we have to flush the TLB before calling pageout() on the page?

Not any more. It got removed in patch 2 up and I forgot to reintroduce it
with a tlb_flush_required check here. Thanks for that.

Post by Rik van Riel
In other words, we would also have to batch up calls to pageout(), if
we want to do batched TLB flushing.
This could be accomplished by putting the SWAP_SUCCESS_CACHED pages on
a special list, instead of calling pageout() on them immediately, and
then calling pageout() on all the pages on that list after the batch
flush.

True. We had discussed something like that on IRC. It's a good idea but
a separate patch because it's less clear-cut. We're taking a partial pass
through the list in an attempt to reduce IPIs.

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Vlastimil Babka

2015-04-24 14:50:03 UTC

Permalink

Changelog since V2
o Ensure TLBs are flushed before pages are freed (mel)

I admit not reading all the patches thoroughly, but doesn't this change
of ordering mean that you no longer need the architectural guarantee
discussed in patch 2? What's the harm if some other CPU (because the CPU
didn't receive an IPI yet) manages to write to a page that you have
unmapped in the page tables *but not yet freed*?

Vlastimil

Changelog since V1
o Structure and variable renaming (hughd)
o Defer flushes even if the unmapping process is sleeping (huged)
o Alternative sizing of structure (peterz)
o Use GFP_KERNEL instead of GFP_ATOMIC, PF_MEMALLOC protects (andi)
o Immediately flush dirty PTEs to avoid corruption (mel)
o Further clarify docs on the required arch guarantees (mel)
When unmapping pages it is necessary to flush the TLB. If that page was
accessed by another CPU then an IPI is used to flush the remote CPU. That
is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second.
There already is a window between when a page is unmapped and when it is
TLB flushed. This series simply increases the window so multiple pages can
be flushed using a single IPI.
Patch 1 simply made the rest of the series easier to write as ftrace
could identify all the senders of TLB flush IPIS.
Patch 2 collects a list of PFNs and sends one IPI to flush them all
Patch 3 uses more memory so further defer when the IPI gets sent
Patch 4 uses the same infrastructure as patch 2 to batch IPIs sent during
page migration.
The performance impact is documented in the changelogs but in the optimistic
case on a 4-socket machine the full series reduces interrupts from 900K
interrupts/second to 60K interrupts/second.
arch/x86/Kconfig | 1 +
arch/x86/include/asm/tlbflush.h | 2 +
arch/x86/mm/tlb.c | 1 +
include/linux/init_task.h | 8 +++
include/linux/mm_types.h | 1 +
include/linux/rmap.h | 13 ++--
include/linux/sched.h | 15 ++++
include/trace/events/tlb.h | 3 +-
init/Kconfig | 8 +++
kernel/fork.c | 7 ++
kernel/sched/core.c | 3 +
mm/internal.h | 16 +++++
mm/migrate.c | 27 +++++--
mm/rmap.c | 151 ++++++++++++++++++++++++++++++++++++----
mm/vmscan.c | 35 +++++++++-
15 files changed, 267 insertions(+), 24 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Mel Gorman

2015-04-24 15:20:02 UTC

Permalink

Post by Vlastimil Babka

Changelog since V2
o Ensure TLBs are flushed before pages are freed (mel)

I admit not reading all the patches thoroughly, but doesn't this
change of ordering mean that you no longer need the architectural
guarantee discussed in patch 2?

No. If we unmap a page to write it to disk then we cannot allow a CPU to
write to the physical page being written through a cached entry.