Discussion:
[PATCH 0/6] PM / Hibernate: Memory bitmap scalability improvements
(too old to reply)
Joerg Roedel
2014-07-18 12:10:02 UTC
Permalink
Hi,

here is a patch set to improve the scalability of the memory
bitmap implementation used for hibernation. The current
implementation does not scale well to machines with several
TB of memory. A resume on those machines may cause soft
lockups to be reported.

These patches improve the data structure by adding a radix
tree to the linked list structure to improve random access
performance from O(n) to O(log_b(n)), where b depends on the
architecture (b=512 on amd64, 1024 in i386).

A test on a 12TB machine showed an improvement in resume
time from 76s with the old implementation to 2.4s with the
radix tree and the improved swsusp_free function. See below
for details of this test.

Patches 1-3 that add the radix tree while keeping the
existing memory bitmap implementation in place and add code
to compare the results between both implementations. This
was used during development to make sure both data
structures return the same results.

Patch 4 re-implements the swsusp_free() function to not
iterate over all pfns but only over the bits set in the
bitmaps. This showed to scale better on large memory
machines.

Patch 5 removes the old memory bitmap implementation now
that the radix tree is in place and working correctly.

The last patch adds touching the soft lockup watchdog in
rtree_next_node. This is necessary because the worst case
performance (all bits set in the forbidden_pages_map and
free_pages_map) is the same as with the old implementation
and may still cause soft lockups. Patch 6 avoids this.

The code was tested in 32 and 64 bit x86 and showed no
issues there.

Below is an example test that shows the performance
improvement on a 12TB machine. First the test with the old
memory bitmap:

# time perf record /usr/sbin/resume $sdev
resume: libgcrypt version: 1.5.0
[ perf record: Woken up 12 times to write data ]
[ perf record: Captured and wrote 2.882 MB perf.data (~125898 samples) ]

real 1m16.043s
user 0m0.016s
sys 0m0.312s
# perf report --stdio |head -50
# Events: 75K cycles
#
# Overhead Command Shared Object
Symbol
# ........ ....... ....................
........................................
#
56.16% resume [kernel.kallsyms] [k] memory_bm_test_bit
19.35% resume [kernel.kallsyms] [k] swsusp_free
14.90% resume [kernel.kallsyms] [k] memory_bm_find_bit
7.28% resume [kernel.kallsyms] [k] swsusp_page_is_forbidden

And here is the same test on the same machine with these
patches applied:

# time perf record /usr/sbin/resume $sdev
resume: libgcrypt version: 1.5.0
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.039 MB perf.data (~1716 samples) ]

real 0m2.376s
user 0m0.020s
sys 0m0.408s

# perf report --stdio |head -50
# Events: 762 cycles
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. .........................
#
34.78% resume [kernel.kallsyms] [k] find_next_bit
27.03% resume [kernel.kallsyms] [k] clear_page_c_e
9.70% resume [kernel.kallsyms] [k] mark_nosave_pages
3.92% resume [kernel.kallsyms] [k] alloc_rtree_node
2.38% resume [kernel.kallsyms] [k] get_image_page

As can be seen on these results these patches improve the
scalability significantly. Please review, any comments
appreciated.

Thanks,

Joerg

Joerg Roedel (6):
PM / Hibernate: Create a Radix-Tree to store memory bitmap
PM / Hibernate: Add memory_rtree_find_bit function
PM / Hibernate: Implement position keeping in radix tree
PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free()
PM / Hibernate: Remove the old memory-bitmap implementation
PM / Hibernate: Touch Soft Lockup Watchdog in rtree_next_node

kernel/power/snapshot.c | 510 ++++++++++++++++++++++++++++++++++--------------
1 file changed, 368 insertions(+), 142 deletions(-)
--
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Joerg Roedel
2014-07-18 12:10:02 UTC
Permalink
From: Joerg Roedel <***@suse.de>

When a memory bitmap is fully populated on a large memory
machine (several TB of RAM) it can take more than a minute
to walk through all bits. This causes the soft lockup
detector on these machine to report warnings.

Avoid this by touching the soft lockup watchdog in the
memory bitmap walking code.

Signed-off-by: Joerg Roedel <***@suse.de>
---
kernel/power/snapshot.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 750b32f..c5d19a9 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -755,6 +755,7 @@ static bool rtree_next_node(struct memory_bitmap *bm)
if (&bm->cur.node->list != &bm->cur.zone->leaves) {
bm->cur.node_pfn += BM_BITS_PER_BLOCK;
bm->cur.node_bit = 0;
+ touch_softlockup_watchdog();
return true;
}
--
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Joerg Roedel
2014-07-18 12:10:03 UTC
Permalink
From: Joerg Roedel <***@suse.de>

Add code to remember the last position that was requested in
the radix tree. Use it as a cache for faster linear walking
of the bitmap in the memory_bm_rtree_next_pfn() function
which is also added with this patch.

Signed-off-by: Joerg Roedel <***@suse.de>
---
kernel/power/snapshot.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 100 insertions(+), 2 deletions(-)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 9a3b723..2ccbf77 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -309,6 +309,11 @@ struct mem_zone_bm_rtree {
struct bm_position {
struct bm_block *block;
int bit;
+
+ struct mem_zone_bm_rtree *zone;
+ struct rtree_node *node;
+ unsigned long node_pfn;
+ int node_bit;
};

struct memory_bitmap {
@@ -488,6 +493,13 @@ static void memory_bm_position_reset(struct memory_bitmap *bm)
{
bm->cur.block = list_entry(bm->blocks.next, struct bm_block, hook);
bm->cur.bit = 0;
+
+ bm->cur.zone = list_entry(bm->zones.next, struct mem_zone_bm_rtree,
+ list);
+ bm->cur.node = list_entry(bm->cur.zone->leaves.next,
+ struct rtree_node, list);
+ bm->cur.node_pfn = 0;
+ bm->cur.node_bit = 0;
}

static void memory_bm_free(struct memory_bitmap *bm, int clear_nosave_free);
@@ -735,6 +747,11 @@ static int memory_rtree_find_bit(struct memory_bitmap *bm, unsigned long pfn,
struct rtree_node *node;
int i, block_nr;

+ zone = bm->cur.zone;
+
+ if (pfn >= zone->start_pfn && pfn < zone->end_pfn)
+ goto zone_found;
+
zone = NULL;

/* Find the right zone */
@@ -748,10 +765,16 @@ static int memory_rtree_find_bit(struct memory_bitmap *bm, unsigned long pfn,
if (!zone)
return -EFAULT;

+zone_found:
/*
* We have a zone. Now walk the radix tree to find the leave
* node for our pfn.
*/
+
+ node = bm->cur.node;
+ if (((pfn - zone->start_pfn) & ~BM_BLOCK_MASK) == bm->cur.node_pfn)
+ goto node_found;
+
node = zone->rtree;
block_nr = (pfn - zone->start_pfn) >> BM_BLOCK_SHIFT;

@@ -764,9 +787,15 @@ static int memory_rtree_find_bit(struct memory_bitmap *bm, unsigned long pfn,
node = (struct rtree_node *)node->data[index];
}

+node_found:
+ /* Update last position */
+ bm->cur.zone = zone;
+ bm->cur.node = node;
+ bm->cur.node_pfn = (pfn - zone->start_pfn) & ~BM_BLOCK_MASK;
+
/* Set return values */
- *addr = node->data;
- *bit_nr = (pfn - zone->start_pfn) & BM_BLOCK_MASK;
+ *addr = node->data;
+ *bit_nr = (pfn - zone->start_pfn) & BM_BLOCK_MASK;

return 0;
}
@@ -861,11 +890,16 @@ static bool memory_bm_pfn_present(struct memory_bitmap *bm, unsigned long pfn)
* this function.
*/

+static unsigned long memory_bm_rtree_next_pfn(struct memory_bitmap *bm);
+
static unsigned long memory_bm_next_pfn(struct memory_bitmap *bm)
{
+ unsigned long rtree_pfn;
struct bm_block *bb;
int bit;

+ rtree_pfn = memory_bm_rtree_next_pfn(bm);
+
bb = bm->cur.block;
do {
bit = bm->cur.bit;
@@ -879,13 +913,77 @@ static unsigned long memory_bm_next_pfn(struct memory_bitmap *bm)
} while (&bb->hook != &bm->blocks);

memory_bm_position_reset(bm);
+ WARN_ON_ONCE(rtree_pfn != BM_END_OF_MAP);
return BM_END_OF_MAP;

Return_pfn:
+ WARN_ON_ONCE(bb->start_pfn + bit != rtree_pfn);
bm->cur.bit = bit + 1;
return bb->start_pfn + bit;
}

+/*
+ * rtree_next_node - Jumps to the next leave node
+ *
+ * Sets the position to the beginning of the next node in the
+ * memory bitmap. This is either the next node in the current
+ * zone's radix tree or the first node in the radix tree of the
+ * next zone.
+ *
+ * Returns true if there is a next node, false otherwise.
+ */
+static bool rtree_next_node(struct memory_bitmap *bm)
+{
+ bm->cur.node = list_entry(bm->cur.node->list.next,
+ struct rtree_node, list);
+ if (&bm->cur.node->list != &bm->cur.zone->leaves) {
+ bm->cur.node_pfn += BM_BITS_PER_BLOCK;
+ bm->cur.node_bit = 0;
+ return true;
+ }
+
+ /* No more nodes, goto next zone */
+ bm->cur.zone = list_entry(bm->cur.zone->list.next,
+ struct mem_zone_bm_rtree, list);
+ if (&bm->cur.zone->list != &bm->zones) {
+ bm->cur.node = list_entry(bm->cur.zone->leaves.next,
+ struct rtree_node, list);
+ bm->cur.node_pfn = 0;
+ bm->cur.node_bit = 0;
+ return true;
+ }
+
+ /* No more zones */
+ return false;
+}
+
+/*
+ * memory_bm_rtree_next_pfn - Find the next set bit
+ *
+ * Starting from the last returned position this function searches
+ * for the next set bit in the memory bitmap and returns its
+ * number. If no more bit is set BM_END_OF_MAP is returned.
+ */
+static unsigned long memory_bm_rtree_next_pfn(struct memory_bitmap *bm)
+{
+ unsigned long bits, pfn, pages;
+ int bit;
+
+ do {
+ pages = bm->cur.zone->end_pfn - bm->cur.zone->start_pfn;
+ bits = min(pages - bm->cur.node_pfn, BM_BITS_PER_BLOCK);
+ bit = find_next_bit(bm->cur.node->data, bits,
+ bm->cur.node_bit);
+ if (bit < bits) {
+ pfn = bm->cur.zone->start_pfn + bm->cur.node_pfn + bit;
+ bm->cur.node_bit = bit + 1;
+ return pfn;
+ }
+ } while (rtree_next_node(bm));
+
+ return BM_END_OF_MAP;
+}
+
/**
* This structure represents a range of page frames the contents of which
* should not be saved during the suspend.
--
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Joerg Roedel
2014-07-18 12:10:03 UTC
Permalink
From: Joerg Roedel <***@suse.de>

The existing implementation of swsusp_free iterates over all
pfns in the system and checks every bit in the two memory
bitmaps.

This doesn't scale very well with large numbers of pfns,
especially when the bitmaps are not populated very densly.
Change the algorithm to iterate over the set bits in the
bitmaps instead to make it scale better in large memory
configurations.

Also add a memory_bm_clear_current() helper function that
clears the bit for the last position returned from the
memory bitmap.

Signed-off-by: Joerg Roedel <***@suse.de>
---
kernel/power/snapshot.c | 53 +++++++++++++++++++++++++++++++++++--------------
1 file changed, 38 insertions(+), 15 deletions(-)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 2ccbf77..af733c3 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -849,6 +849,17 @@ static void memory_bm_clear_bit(struct memory_bitmap *bm, unsigned long pfn)
clear_bit(bit, addr);
}

+static void memory_bm_clear_current(struct memory_bitmap *bm)
+{
+ int bit;
+
+ bit = max(bm->cur.node_bit - 1, 0);
+ clear_bit(bit, bm->cur.node->data);
+
+ bit = max(bm->cur.bit - 1, 0);
+ clear_bit(bit, bm->cur.block->data);
+}
+
static int memory_bm_test_bit(struct memory_bitmap *bm, unsigned long pfn)
{
void *addr;
@@ -1492,23 +1503,35 @@ static struct memory_bitmap copy_bm;

void swsusp_free(void)
{
- struct zone *zone;
- unsigned long pfn, max_zone_pfn;
+ unsigned long fb_pfn, fr_pfn;

- for_each_populated_zone(zone) {
- max_zone_pfn = zone_end_pfn(zone);
- for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++)
- if (pfn_valid(pfn)) {
- struct page *page = pfn_to_page(pfn);
-
- if (swsusp_page_is_forbidden(page) &&
- swsusp_page_is_free(page)) {
- swsusp_unset_page_forbidden(page);
- swsusp_unset_page_free(page);
- __free_page(page);
- }
- }
+ memory_bm_position_reset(forbidden_pages_map);
+ memory_bm_position_reset(free_pages_map);
+
+loop:
+ fr_pfn = memory_bm_next_pfn(free_pages_map);
+ fb_pfn = memory_bm_next_pfn(forbidden_pages_map);
+
+ /*
+ * Find the next bit set in both bitmaps. This is guaranteed to
+ * terminate when fb_pfn == fr_pfn == BM_END_OF_MAP.
+ */
+ do {
+ if (fb_pfn < fr_pfn)
+ fb_pfn = memory_bm_next_pfn(forbidden_pages_map);
+ if (fr_pfn < fb_pfn)
+ fr_pfn = memory_bm_next_pfn(free_pages_map);
+ } while (fb_pfn != fr_pfn);
+
+ if (fr_pfn != BM_END_OF_MAP && pfn_valid(fr_pfn)) {
+ struct page *page = pfn_to_page(fr_pfn);
+
+ memory_bm_clear_current(forbidden_pages_map);
+ memory_bm_clear_current(free_pages_map);
+ __free_page(page);
+ goto loop;
}
+
nr_copy_pages = 0;
nr_meta_pages = 0;
restore_pblist = NULL;
--
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Rafael J. Wysocki
2014-07-18 21:50:02 UTC
Permalink
Post by Joerg Roedel
Hi,
here is a patch set to improve the scalability of the memory
bitmap implementation used for hibernation. The current
implementation does not scale well to machines with several
TB of memory. A resume on those machines may cause soft
lockups to be reported.
These patches improve the data structure by adding a radix
tree to the linked list structure to improve random access
performance from O(n) to O(log_b(n)), where b depends on the
architecture (b=512 on amd64, 1024 in i386).
A test on a 12TB machine showed an improvement in resume
time from 76s with the old implementation to 2.4s with the
radix tree and the improved swsusp_free function. See below
for details of this test.
Patches 1-3 that add the radix tree while keeping the
existing memory bitmap implementation in place and add code
to compare the results between both implementations. This
was used during development to make sure both data
structures return the same results.
Patch 4 re-implements the swsusp_free() function to not
iterate over all pfns but only over the bits set in the
bitmaps. This showed to scale better on large memory
machines.
Patch 5 removes the old memory bitmap implementation now
that the radix tree is in place and working correctly.
The last patch adds touching the soft lockup watchdog in
rtree_next_node. This is necessary because the worst case
performance (all bits set in the forbidden_pages_map and
free_pages_map) is the same as with the old implementation
and may still cause soft lockups. Patch 6 avoids this.
The code was tested in 32 and 64 bit x86 and showed no
issues there.
Below is an example test that shows the performance
improvement on a 12TB machine. First the test with the old
# time perf record /usr/sbin/resume $sdev
resume: libgcrypt version: 1.5.0
[ perf record: Woken up 12 times to write data ]
[ perf record: Captured and wrote 2.882 MB perf.data (~125898 samples) ]
real 1m16.043s
user 0m0.016s
sys 0m0.312s
# perf report --stdio |head -50
# Events: 75K cycles
#
# Overhead Command Shared Object
Symbol
# ........ ....... ....................
........................................
#
56.16% resume [kernel.kallsyms] [k] memory_bm_test_bit
19.35% resume [kernel.kallsyms] [k] swsusp_free
14.90% resume [kernel.kallsyms] [k] memory_bm_find_bit
7.28% resume [kernel.kallsyms] [k] swsusp_page_is_forbidden
And here is the same test on the same machine with these
# time perf record /usr/sbin/resume $sdev
resume: libgcrypt version: 1.5.0
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.039 MB perf.data (~1716 samples) ]
real 0m2.376s
user 0m0.020s
sys 0m0.408s
# perf report --stdio |head -50
# Events: 762 cycles
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. .........................
#
34.78% resume [kernel.kallsyms] [k] find_next_bit
27.03% resume [kernel.kallsyms] [k] clear_page_c_e
9.70% resume [kernel.kallsyms] [k] mark_nosave_pages
3.92% resume [kernel.kallsyms] [k] alloc_rtree_node
2.38% resume [kernel.kallsyms] [k] get_image_page
As can be seen on these results these patches improve the
scalability significantly. Please review, any comments
appreciated.
Looks good.

How much testing did that receive?

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Joerg Roedel
2014-07-19 08:00:01 UTC
Permalink
Hi Rafael,
Post by Rafael J. Wysocki
Post by Joerg Roedel
here is a patch set to improve the scalability of the memory
bitmap implementation used for hibernation. The current
implementation does not scale well to machines with several
TB of memory. A resume on those machines may cause soft
lockups to be reported.
Looks good.
How much testing did that receive?
Well, I did function testing of the radix tree and the old memory bitmap
in parallel (patches 1-4) and looked for the warnings to trigger in case
the bitmaps return different results. With these patches the radix tree
produced always the same results as the old implementation (none of the
warnings did trigger).

I first tested with KVM (64 bit and 32 bit x86 guest) and later on real
amd64 machines. One of our partners tested the patches on a 1TB, a 4TB
and a 12TB memory machine, always with positive result.

Thanks,

Joerg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Rafael J. Wysocki
2014-07-18 23:50:01 UTC
Permalink
This patch adds the code to allocate and build the radix
tree to store the memory bitmap. The old data structure is
left in place until the radix tree implementation is
finished.
---
kernel/power/snapshot.c | 224 +++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 223 insertions(+), 1 deletion(-)
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 1ea328a..d0f11ec 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -248,11 +248,24 @@ static void *chain_alloc(struct chain_allocator *ca, unsigned int size)
* information is stored (in the form of a block of bitmap)
* It also contains the pfns that correspond to the start and end of
* the represented memory area.
+ *
+ * The memory bitmap is organized as a radix tree to guarantee fast random
+ * access to the bits. There is one radix tree for each zone (as returned
+ * from create_mem_extents).
+ *
+ * One radix tree is represented by one struct mem_zone_bm_rtree. There are
+ * two linked lists for the nodes of the tree, one for the inner nodes and
+ * one for the leave nodes. The linked leave nodes are used for fast linear
+ * access of the memory bitmap.
+ *
+ * The struct rtree_node represents one node of the radix tree.
*/
#define BM_END_OF_MAP (~0UL)
#define BM_BITS_PER_BLOCK (PAGE_SIZE * BITS_PER_BYTE)
+#define BM_BLOCK_SHIFT (PAGE_SHIFT + 3)
+#define BM_BLOCK_MASK ((1UL << BM_BLOCK_SHIFT) - 1)
struct bm_block {
struct list_head hook; /* hook into a list of bitmap blocks */
@@ -266,6 +279,31 @@ static inline unsigned long bm_block_bits(struct bm_block *bb)
return bb->end_pfn - bb->start_pfn;
}
+/*
+ * struct rtree_node is a wrapper struct to link the nodes
+ * of the rtree together for easy linear iteration over
+ * bits and easy freeing
+ */
+struct rtree_node {
+ struct list_head list;
+ unsigned long *data;
+};
+
+/*
+ * struct mem_zone_bm_rtree represents a bitmap used for one
+ * populated memory zone.
+ */
+struct mem_zone_bm_rtree {
+ struct list_head list; /* Link Zones together */
+ struct list_head nodes; /* Radix Tree inner nodes */
+ struct list_head leaves; /* Radix Tree leaves */
+ unsigned long start_pfn; /* Zone start page frame */
+ unsigned long end_pfn; /* Zone end page frame + 1 */
+ struct rtree_node *rtree; /* Radix Tree Root */
+ int levels; /* Number of Radix Tree Levels */
+ unsigned int blocks; /* Number of Bitmap Blocks */
+};
+
/* strcut bm_position is used for browsing memory bitmaps */
struct bm_position {
@@ -274,6 +312,7 @@ struct bm_position {
};
struct memory_bitmap {
+ struct list_head zones;
struct list_head blocks; /* list of bitmap blocks */
struct linked_page *p_list; /* list of pages used to store zone
* bitmap objects and bitmap block
@@ -284,6 +323,167 @@ struct memory_bitmap {
/* Functions that operate on memory bitmaps */
+#define BM_ENTRIES_PER_LEVEL (PAGE_SIZE / sizeof(unsigned long))
+#if BITS_PER_LONG == 32
+#define BM_RTREE_LEVEL_SHIFT (PAGE_SHIFT - 2)
+#else
+#define BM_RTREE_LEVEL_SHIFT (PAGE_SHIFT - 3)
+#endif
+#define BM_RTREE_LEVEL_MASK ((1UL << BM_RTREE_LEVEL_SHIFT) - 1)
+
+/*
+ * alloc_rtree_node - Allocate a new node and add it to the radix
+ * tree.
Please make this a single line.
+ *
+ * This function is used to allocate inner nodes as well as the
+ * leave nodes of the radix tree. It also adds the node to the
+ * corresponding linked list passed in by the *list parameter.
+ */
+static struct rtree_node *alloc_rtree_node(gfp_t gfp_mask, int safe_needed,
+ struct chain_allocator *ca,
+ struct list_head *list)
+{
+ struct rtree_node *node;
+
+ node = chain_alloc(ca, sizeof(struct rtree_node));
+ if (!node)
+ return NULL;
+
+ node->data = get_image_page(gfp_mask, safe_needed);
+ if (!node->data)
+ return NULL;
+
+ list_add_tail(&node->list, list);
+
+ return node;
+}
+
+/*
+ * add_rtree_block - Add a new leave node to the radix tree
+ *
+ * The leave nodes need to be allocated in order to keep the leaves
+ * linked list in order. This is guaranteed by the zone->blocks
+ * counter.
+ */
+static int add_rtree_block(struct mem_zone_bm_rtree *zone, gfp_t gfp_mask,
+ int safe_needed, struct chain_allocator *ca)
+{
+ struct rtree_node *node, *block, **dst;
+ unsigned int levels_needed, block_nr;
+ int i;
+
+ block_nr = zone->blocks;
This isn't proper kernel coding style. Please use spaces as separators
before the "=" here and elsewhere.

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Joerg Roedel
2014-07-19 08:00:01 UTC
Permalink
Post by Rafael J. Wysocki
+/*
+ * alloc_rtree_node - Allocate a new node and add it to the radix
+ * tree.
Please make this a single line.
+static int add_rtree_block(struct mem_zone_bm_rtree *zone, gfp_t gfp_mask,
+ int safe_needed, struct chain_allocator *ca)
+{
+ struct rtree_node *node, *block, **dst;
+ unsigned int levels_needed, block_nr;
+ int i;
+
+ block_nr = zone->blocks;
This isn't proper kernel coding style. Please use spaces as separators
before the "=" here and elsewhere.
Thanks for your comments, will fix this and resend next week.


Joerg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Pavel Machek
2014-07-19 10:30:01 UTC
Permalink
Hi!
Post by Joerg Roedel
here is a patch set to improve the scalability of the memory
bitmap implementation used for hibernation. The current
implementation does not scale well to machines with several
TB of memory. A resume on those machines may cause soft
lockups to be reported.
These patches improve the data structure by adding a radix
tree to the linked list structure to improve random access
performance from O(n) to O(log_b(n)), where b depends on the
architecture (b=512 on amd64, 1024 in i386).
A test on a 12TB machine showed an improvement in resume
time from 76s with the old implementation to 2.4s with the
radix tree and the improved swsusp_free function. See below
for details of this test.
Ok, nice.

How did space requirements change?

In particular, do we need to reserve a bit more pages for hibernation
now?

Thanks,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Joerg Roedel
2014-07-19 11:40:02 UTC
Permalink
Hi,
Post by Pavel Machek
Post by Joerg Roedel
A test on a 12TB machine showed an improvement in resume
time from 76s with the old implementation to 2.4s with the
radix tree and the improved swsusp_free function. See below
for details of this test.
Ok, nice.
How did space requirements change?
In particular, do we need to reserve a bit more pages for hibernation
now?
This depends on the amount of memory in the system. On one side the
struct bm_block (32 bytes) was replaced by struct rtree_node (16 bytes).

On the other side more pages are required to build the radix tree.
For systems with up to 64GB of RAM on amd64 this is one additional page.
On a 12TB system this sums up to 193 additional pages (again, on amd64)
required for the radix tree inner nodes.

These 193 pages take around 800kb while we save ca. 1.5MB because of the
smaller struct rtree_node.

On a 4GB system the situation is different, there we need to reserve one
additional page while saving only 512 bytes with the smaller struct. The
break-even is somewhere around 32GB where the savings of struct
rtree_node outweight the additional page required for the radix tree.

Thanks,

Joerg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Pavel Machek
2014-07-19 13:50:02 UTC
Permalink
Hi!
Post by Joerg Roedel
These patches improve the data structure by adding a radix
tree to the linked list structure to improve random access
performance from O(n) to O(log_b(n)), where b depends on the
architecture (b=512 on amd64, 1024 in i386).
Are you sure? From your other mail, you said you are adding just a
single page. I'd expect random access performance to go from

O(n) to O(n/1024) in such case?

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Joerg Roedel
2014-07-21 08:40:03 UTC
Permalink
Post by Pavel Machek
Hi!
Post by Joerg Roedel
These patches improve the data structure by adding a radix
tree to the linked list structure to improve random access
performance from O(n) to O(log_b(n)), where b depends on the
architecture (b=512 on amd64, 1024 in i386).
Are you sure? From your other mail, you said you are adding just a
single page. I'd expect random access performance to go from
O(n) to O(n/1024) in such case?
No, for the 4GB case the additional page is used as an index page into
the block-bitmap pages. On AM64 a page can hold 512 references and a
single block-bitmap page is enough for 128MB of RAM. This means that for
systems with up to 64GB of RAM we can get the block-bitmap page directly
from the index page. For systems with more than 64GB of RAM we need
another level of index pages, and now we need two memory accesses to get
to the block-bitmap page (okay, with this implementation its actually 2
memory accesses per level, because the index-pages refer to a struct
rtree_node which itself contains the pointer to the index/data page).

A two-level radix tree would be enough for systems with up to 32TB of
RAM. After that we need 3 levels, but you can already see that this
doesn't scale linearly anymore but with log_512(n), where n is the
number of block-bitmap pages required.


Joerg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Pavel Machek
2014-07-21 11:50:02 UTC
Permalink
Post by Joerg Roedel
Post by Pavel Machek
Hi!
Post by Joerg Roedel
These patches improve the data structure by adding a radix
tree to the linked list structure to improve random access
performance from O(n) to O(log_b(n)), where b depends on the
architecture (b=512 on amd64, 1024 in i386).
Are you sure? From your other mail, you said you are adding just a
single page. I'd expect random access performance to go from
O(n) to O(n/1024) in such case?
No, for the 4GB case the additional page is used as an index page into
the block-bitmap pages. On AM64 a page can hold 512 references and a
single block-bitmap page is enough for 128MB of RAM. This means that for
systems with up to 64GB of RAM we can get the block-bitmap page directly
from the index page. For systems with more than 64GB of RAM we need
another level of index pages, and now we need two memory accesses to get
to the block-bitmap page (okay, with this implementation its actually 2
memory accesses per level, because the index-pages refer to a struct
rtree_node which itself contains the pointer to the index/data page).
A two-level radix tree would be enough for systems with up to 32TB of
RAM. After that we need 3 levels, but you can already see that this
doesn't scale linearly anymore but with log_512(n), where n is the
number of block-bitmap pages required.
Ok, so you have sped it up from O(n) to O(log(n)) speed, and increased
memory requirements from O(n) to O(n * log(n)), right?

does enough_swap() or reqd_free_pages() need to be updated?

Best regards,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Joerg Roedel
2014-07-21 14:40:02 UTC
Permalink
Hi,
Post by Pavel Machek
Ok, so you have sped it up from O(n) to O(log(n)) speed, and increased
memory requirements from O(n) to O(n * log(n)), right?
No, new memory requirements are still ~O(n), because the additional pages
needed for the index in the radix tree are outweighted by the memory
saved by the smaller size of struct rtree_node compared to struct
bm_block.

If you want to see real numbers, here is a small tool which tells you
exactly when you need more memory and how much you save with the new
memory bitmap implementation:

http://www.zlug.org/~joro/data/mem_req.c

It will tell you that the new implementation uses at most 2 more pages
for smaller RAM sizes (2 pages because 2 memory bitmaps are allocated)
and will actually save pages for bigger RAM sizes (already 30 pages on a
1TB machine, 384 pages on a 12TB machine)

You can also modify the above tool to give you some data to plot, then
you will also SEE that there is still a linear relationship between RAM
size and memory required by the old and new memory bitmap
implementation.


Joerg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Pavel Machek
2014-07-21 14:40:02 UTC
Permalink
Post by Joerg Roedel
Hi,
Post by Pavel Machek
Ok, so you have sped it up from O(n) to O(log(n)) speed, and increased
memory requirements from O(n) to O(n * log(n)), right?
No, new memory requirements are still ~O(n), because the additional pages
needed for the index in the radix tree are outweighted by the memory
saved by the smaller size of struct rtree_node compared to struct
bm_block.
Are we in agreement how O() notation works?

So you reduced O(n) to O(n/2 log(n)) ... that's still O(n log(n)), right?
Post by Joerg Roedel
You can also modify the above tool to give you some data to plot, then
you will also SEE that there is still a linear relationship between RAM
size and memory required by the old and new memory bitmap
implementation.
I believe radix tree is O(log(N)) per operation, but its memory is
certainly not linear.

Yes, log(n) is looks quite constant even for huge n :-).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Joerg Roedel
2014-07-21 16:00:02 UTC
Permalink
Hi,
Post by Pavel Machek
So you reduced O(n) to O(n/2 log(n)) ... that's still O(n log(n)), right?
Radix trees have a memory complexity of ~O(n + log(n)). We have an
additional log(n) space requirement for the tree itself, but we linearly
safe memory for each node compared to the old implementation. Since n
beats log(n) at some point there is a point where we need less pages
with the new code than with the old. That break-even point is somewhere
around 32GB of RAM, for smaller sizes we need 2 additional pages at
most.
Post by Pavel Machek
Post by Joerg Roedel
You can also modify the above tool to give you some data to plot, then
you will also SEE that there is still a linear relationship between RAM
size and memory required by the old and new memory bitmap
implementation.
I believe radix tree is O(log(N)) per operation, but its memory is
certainly not linear.
Numbers beat believes.


Joerg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Loading...