Integrated Hardware and Software for No-Loss Computing
TBMG-1125
01/01/2007
- Content
When an algorithm is distributed across multiple threads executing on many distinct processors, a loss of one of those threads or processors can potentially result in the total loss of all the incremental results up to that point. When implementation is massively hardware distributed, then the probability of a hardware failure during the course of a long execution is potentially high. Traditionally, this problem has been addressed by establishing checkpoints where the current state of some or part of the execution is saved. Then in the event of a failure, this state information can be used to recompute that point in the execution and resume the computation from that point.
- Citation
- "Integrated Hardware and Software for No-Loss Computing," Mobility Engineering, January 1, 2007.