Discussion:
Software based ECC ?
(too old to reply)
roland
2007-08-10 21:20:07 UTC
Permalink
Hello !

since ECC (speaking in terms of ram/memory) is some widespread hardware
technology
within server/enterprise computing for protection of memory failure, i
wonder:

Can`t this be done in software, too ?

I didn`t find a referenc on this list, but i found an interesting paper i'd
like to share at:

http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf

"SoftECC : A System for Software Memory Integrity Checking"

Is it possible to implement something like this within the Linux virtual
memory subsystem ?
If it can be done, wouldn`t this be a great feature ?

regards
Roland K.
system engineer




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Alan Cox
2007-08-10 22:20:10 UTC
Permalink
On Fri, 10 Aug 2007 23:16:45 +0200
Post by roland
Hello !
since ECC (speaking in terms of ram/memory) is some widespread hardware
technology
within server/enterprise computing for protection of memory failure, i
Can`t this be done in software, too ?
Only one way to find out. If it interest you - have a go at it
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
V***@vt.edu
2007-08-11 06:20:12 UTC
Permalink
Post by roland
http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf
"SoftECC : A System for Software Memory Integrity Checking"
Is it possible to implement something like this within the Linux virtual
memory subsystem ?
Anything that can be simulated with a Turing machine is *possible*.

The question is how many rocket boosters the pig needs for takeoff.

Hint: The thesis talks about why he didn't implement it for Linux.
Post by roland
If it can be done, wouldn`t this be a great feature ?
Read section 5.2 of that thesis, particularly this quote from 5.2.2:

"For random word writes, this implies that SoftECC will need an order of
magnitude more compute time than the user-mode code"

Basically, on every single memory page that gets dirtied, we have to then
re-checksum the page (blowing away cache lines in the process). If you want
to get a feel for it, find the kernel code that recognizes that a page is
dirtied, and just add a few lines there:

int foo = 0, i;
for (i=0;i++;<1024) { // adjust for non-4K pages
foo ^= *(page+i);
}

and see how much your system crawls.

Personally, I'd recommend just shelling out the bucks for hardware ECC if
the reliability matters.
Folkert van Heusden
2007-08-12 17:00:20 UTC
Permalink
Post by V***@vt.edu
Post by roland
http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf
"SoftECC : A System for Software Memory Integrity Checking"
Personally, I'd recommend just shelling out the bucks for hardware ECC if
the reliability matters.
a question and an idea: Q: is ecc guaranteed to detect all bitflips?

Idea: what about a multicore system (3 or more) that runs the same
processes on 2 cores and a third core verifying that they both do the
same? As I think it is not only ram that can become faulty.



Folkert van Heusden
--
MultiTail er et flexible tool for å kontrolere Logfiles og commandoer.
Med filtrer, farger, sammenføringer, forskeliger ansikter etc.
http://www.vanheusden.com/multitail/
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Jan Engelhardt
2007-08-12 17:10:13 UTC
Permalink
Post by Folkert van Heusden
Post by V***@vt.edu
Post by roland
http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf
"SoftECC : A System for Software Memory Integrity Checking"
Personally, I'd recommend just shelling out the bucks for hardware ECC if
the reliability matters.
a question and an idea: Q: is ecc guaranteed to detect all bitflips?
Idea: what about a multicore system (3 or more) that runs the same
processes on 2 cores and a third core verifying that they both do the
same? As I think it is not only ram that can become faulty.
Indeed. And for example BOINC (***@home) have to consider this. Hence they
recalculate each work unit at least three times and then compare between
each. What makes this different from ECC is that the checksum is not calculated
on every memory operations, but at the end of a larger block of operations. Of
course this may mean that an error can propagate for a while, but the total
walltime (including recomputation) is lower. :)


Jan
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
chibiryuu
2007-08-12 19:10:13 UTC
Permalink
Post by Folkert van Heusden
Post by V***@vt.edu
Post by roland
http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf
"SoftECC : A System for Software Memory Integrity Checking"
Personally, I'd recommend just shelling out the bucks for hardware ECC if
the reliability matters.
a question and an idea: Q: is ecc guaranteed to detect all bitflips?
Idea: what about a multicore system (3 or more) that runs the same
processes on 2 cores and a third core verifying that they both do the
same? As I think it is not only ram that can become faulty.
Such hardware does exist -- for example, Stratus sells systems that
run the same OS on two separate boards in lockstep, with a voter to
determine what action to take if they ever diverge.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
V***@vt.edu
2007-08-13 03:10:08 UTC
Permalink
Post by Folkert van Heusden
a question and an idea: Q: is ecc guaranteed to detect all bitflips?
It depends on the exact ECC function the hardware implements. Usually it
provides performance such as:

"Correct all 1-bit errors. Detect all 2-bit errors, and most 3 and higher,
but not correct".

(Of course, "correct all 1 or 2 bit and detect all 3 bit" can be done, it
just takes more bits of ECC.)
Post by Folkert van Heusden
Idea: what about a multicore system (3 or more) that runs the same
processes on 2 cores and a third core verifying that they both do the
same? As I think it is not only ram that can become faulty.
This is actually done for high-reliability systems (Google for "tell me twice"
and "tell me three times"). The problem is that it takes a lot of extra
hardware. The G5 and later IBM Z-series mainframe chipsets (not to be confused with
the PowerPC G5) implemented dual computation units and a comparator that
signals a 'Machine Check' condition if the two CPUs don't end up in the
same exact state (as an added bonus, at the end of each instruction that
both *do* compare good, it latches the *entire* state of the CPU out,
and then does the following:

1) Retry the instruction on the same CPU - if it compares correctly, keep
going and flag a "soft" error.

2) If it still fails, read out the last "known good" status latch, and load
it into a spare CPU, and fire it up, and flag the failing one as bad.

http://www.research.ibm.com/journal/rd/435/spainhower.pdf
http://www.research.ibm.com/journal/rd/435/mueller.pdf

These guys have forgotten more about designing highly reliable systems than
most of us will ever know. ;)

Needless to say, not everybody is willing to pay the costs of the hardware
overhead of this approach.

Continue reading on narkive:
Loading...