Software developers usually consider hardware errors as negligible when writing their code. However, the continuous miniaturization of transistors vaticinated by Moore’s law and the attempt to achieve higher frequencies, among other factors, are making circuits less reliable. Despite all the error correction mechanisms this can have a real impact in information consistency leading to misbehaviors called soft errors.
Soft errors are clearly undesirable. But what if we start making hardware that is unreliable by design? Although this may seem an awkward idea, in a recent interview, the Director of the Institute of Microengineering at École Polytechnique Fédérale de Lausanne argues that “We should stop designing perfect circuits” (the full interview can be found here). As he explains, error correction mechanisms conceived to deal with higher error rates may cancel gains in size and power consumption achieved by smaller transistors. Therefore, he proposes following the trend of “good enough engineering” resulting in the design of unreliable circuits which are smaller, less expensive and have lower power consumption. These can be applied in fields which naturally tolerate some inaccuracy such as image and audio processing or graphical systems. For instance, a researcher of MIT’s Media Lab has used an unreliable processor to handle more than 100,000 hours of video recordings.
How this “good enough engineering” trend applies to software is starting to be evaluated. A research group at MIT’s Computer Science and Artificial Intelligence Laboratory has developed a framework called Rely that “enables software developers to specify when errors may be tolerable. The system then calculates the probability that the software will perform as it’s intended.” This calculus relies on the specification of failure rates of individual machine instructions.
In Computer Science this is a new area that poses new challenges and interesting questions. Can we reason in an abstract model about the resilience of an algorithm to failures when these are intrinsic of the concrete hardware?
A certain parallel can be drawn with real-time systems which obey to timing constraints, but actual time is a property of the run-time system. In the case of programs in presence of failures, the constraints are imposed on the tolerance (margin of error) of the output, which shifts the algorithm analysis to a more quantitative approach.
An abstract model would allow us to analyze which algorithms are more resilient, i.e., less likely to be affected by underlying failures, or to determine how the composition of individual components affects the behavior of the overall system. In the context of HASLab this can help to answer a challenging question: what is the meaning of high-assurance software when the hardware is unreliable?