# An Approach to Energy-Error Tradeoffs in Approximate Ripple Carry

## Comments

## Transcription

An Approach to Energy-Error Tradeoffs in Approximate Ripple Carry

An Approach to Energy-Error Tradeoffs in Approximate Ripple Carry Adders Zvi M. Kedem , Vincent J. Mooney‘ , Kirthi Krishna Muntimadugu‘ and Krishna V. Palem‘ Courant Institute of Mathematical Sciences New York University, New York, USA, Email: [email protected] School of Electrical and Computer Engineering Georgia Institute of Technology, Atlanta, USA, Email: [email protected] School of Electrical & Electronic Engineering and School of Computer Engineering Nanyang Technological University, Singapore, Email: [email protected] Department of Electrical and Computer Engineering Rice University, Houston, Texas, USA, Email: (kirthi.krishna, palem)@rice.edu ‘ NTU-Rice Institute of Sustainable and Applied InfoDynamics (ISAID) Nanyang Technological University, Singapore Abstract—Given a 16-bit or 32-bit overclocked ripple-carry adder, we minimize error by allocating multiple supply voltages to the gates. We solve the error minimization problem for a fixed energy budget using a binned geometric program solution (BGPS). A solution found via BGPS outperforms the two best prior approaches, uniform voltage scaling and biased voltage scaling, reducing error by as much as a factor of 2.58X and by a median of 1.58X in 90nm transistor technology. Index Terms—Approximate Adders, Voltage Scaling, Low Energy Circuits, Geometric Programming I. I NTRODUCTION In this paper we introduce a novel approach to minimizing error in a Ripple Carry Adder (RCA) due to overclocking. We call our approach the Binned Geometric Program Solution (BGPS). We assume that an energy budget is fixed and that some small errors can be tolerated; thus, our goal is to minimize the errors. We have freedom to assign up to four distinct supply voltages to the gates of the RCA. We do not address physical design and floorplanning in this paper, but instead leave these issues for future work. II. T ECHNOLOGY BACKGROUND AND P RIOR W ORK This paper presents a contribution to minimization of energy and error in so-called “approximate arithmetic” [1]. Very briefly, Chakrapani et al. [1] proposed reducing the supply voltage of VLSI arithmetic circuits beyond the point where the critical path is guaranteed to not be violated. This could result in an erroneous output of the circuit but, as shown by Chakrapani et al., can also be used in many applications which do not require strict 100% accuracy of computed values such as in audio and video signal processing. Such circuits which are “overclocked,” being operated at a frequency higher than required to guarantee 100% accuracy, we call “approximate circuits.” There are other techniques to reduce switching power of a circuit. Techniques such as the ones by Blaauw et al. [2] and Broderson et al. [3] are adaptive which means that the throughput of the circuit is based on the workload. Non-adaptive techniques typically operate the circuit at multiple voltages which might rely on circuit implementation techniques like transistor sizing for energy efficiency [4]. Manzak and Chaktrabarti [5] as well as Yeh et al. [6], [7] present techniques that are also non-adaptive but which operate the critical paths of the circuit at higher voltages than the non-critical paths and also use transistor sizing. Among the ones that use overclocking, one of them is broadly know as the “Razor” approach [8], [9] – championed by researchers at the University of Michigan – which allows errors at the circuit level, but only temporarily (for a few clock cycles). In this case, errors are corrected by inserting delay in order to continue from a previous known “correct” logic state. It is assumed that prior approaches to handle flip-flops and meta-stability issues [8], [9] can be used. George et al. [10] presented a biased voltage scaling (BIVOS) for probabilistic ripple carry adders under the assumption that error sources (e.g., thermal noise) are uniformly random in time and space. The claim was that a geometric scaling of the probability of error at each bit position across an adder results in lower average error for the same energy consumption when compared to a uniform scaling of probability of error. The modeling and analysis in [10] built on the result shown by Cheemalavagu et al. [11] which described a probabilistic CMOS switch and the direct relationship between the supply voltage (energy consumption) and the probability of error. In such a probabilistic gate, an error at the output of the gate occurs based only on its supply voltage. Therefore, to establish a supply voltage allocation scheme, a geometric scaling of probability was introduced by utilizing the exact relationship between the probability of error and the supply voltage of a building block (in their case it was a full adder). This scheme cannot be directly applied to approximate circuits because there is no straightforward relationship between probability of error at a given bit position and its supply voltage. Therefore, it is not trivial to determine a supply voltage scheme from a required scaling of probability of errors across bit positions. The primary distinctions between the previous approaches and our target problem in this paper are that we (i) present a rigorous mathematical model for the output error and energy consumption of an approximate RCA and also (ii) present a circuit level optimization methodology for multiple supply voltages whereas the prior approaches have either been ad-hoc circuit level approaches or algorithmic level optimizations (altering the algorithm being executed). In this paper, we propose a methodology to find an assignment of multiple voltages to an approximate ripple carry adder circuit to minimize error for a given energy consumption. The first design constraint that arises out of this design methodology is the number of distinct voltages that could be used in practice in a fabricated chip. Circuit designers are using an increasing number of different voltages in their architectures. Typical high end chips seem to have four different voltages [12], [13]. To benefit from having multiple voltages on the die, the circuit designer has to overcome the challenge a2 b2 a1 b1 a0 b0 7 4 1 a2 9 8 c3 s2 Fig. 1. a1 6 c2 5 s1 a0 c 0 3 c1 c0 2 s0 Gate level diagram of a 3-bit ripple carry adder of creating power distribution networks that feed from the voltage regulator modules that supply all the devices using the fewest number of interconnect layers. But the important point to note here is that the number of different voltages is the bottleneck here and not the actual magnitude of each voltage. The circuit designer has the freedom, albeit at design level, to choose the number of voltage levels and the exact values of the different voltages. With the freedom of using multiple voltage levels the use of voltage shifters becomes a necessity at least in some cases such as when a circuit with lower supply voltage is driving a circuit with higher supply voltage (and the difference in the supply voltages is not negligible) and when the output of a circuit is being stored in a register. It has been shown by Chang et al. [14] that the area/delay overhead of level shifters for using multiple supply voltages can be relatively small. Hence in this paper for the sake of simplicity of our mathematical model and experimental methodology we do not consider the overhead of voltage level shifters. III. A PPROXIMATE RCA E RROR AND E NERGY M ODELS We will consider an n-bit ripple-carry adder (RCA). Binary numbers a D an1 : : : a0 and b D bn1 : : : b0 to be added are unsigned in the range 0; : : : ; 2n 1 and have the standard binary representation. The addition of a and b results in an .nC1/-bit number s D sn : : : s0 . Thus, sum s is computed in an RCA as sk D ak ˚ bk ˚ ck for 0 k n 1. The sequence of carries c is computed as follows. c0 is input (and is zero for addition which is the only case considered in this paper) while ckC1 D .ak ˚ bk / ck C .ak ˚ bk / ak for 0 k n 1. Note that we have yet to define sn ; in fact, the last sum bit sn is equal to the last carry bit cn . We use two XOR gates and a MUX for our full adder (FA) design in this paper, as can be seen in Fig. 1; while not necessarily the fastest possible or the least area, for simplicity we nonetheless use two XOR gates and a MUX for FA implementation in this paper. Each gate in Fig. 1 has an index ` 2 f1; 2; : : : ; 9g; for each gate ` with supply voltage ` , we use ` .` / to denote the worst case delay of the gate at the specified voltage. A. Carry chains in ripple carry adders Given an n-bit RCA and two specific n-bit binary numbers a D an1 an2 : : : a0 and b D bn1 bn2 : : : b0 as inputs to the RCA, x x x x an2 : : : a0x and bx D bnx bn1 bn2 : : : b0x where define ax D anx an1 aix D ai ; bix D bi for 0 i n 1 and anx D bnx D 0. A carry chain is said to be present in the n-bit RCA with inputs a and b from position i to position j if and only if aix D bix D 1. This case is referred to as the generation of a carry. x ¤ b x . This case is referred to as the propagation of a carry. aw w ajx D bjx . If ajx D 0, the carry is said to be killed and if ajx D 1 another carry is said to be generated. In both the cases the carry chain that was generated at position i ends at position j . input a 0 0 1 1 1 1 0 0 input b 0 0 0 1 0 1 0 0 sum 0 1 0 1 0 0 0 0 position 7 6 5 4 3 2 1 0 Fig. 2. An example of two contiguous carry chains in a binary addition using an RCA where 0 i < w < j n (or, if j D i C 1, 0 i < j n and i ¤ n 1). If there is a carry chain from i to j , we will set a boolean variable Cij to 1. Also, for a carry chain from position i to position j , we have si D 0 ˚ ci , ck D 1 and sk D 0, for k 2 fi C 1; : : : ; j 1g; and cj D 1, sj D 1, and cj C1 D aj (D bj ). Thus, if we know that there is a carry chain from position i to position j , we know that si C1 D si C2 D : : : D sj 1 D 0 and sj D 1 while si depends on ci . While there can be more than one carry chain in a single addition, it turns out that multiple carry chains cannot overlap [15]. Fig. 2 shows an example where a carry chain starts at bit position 2 and is killed at position 4, thus C24 D 1; another carry chain starts at position 4 and is killed at position 6, C46 D 1. B. Modeling the error at the output of an approximate RCA We assume that the clock cycle time (D) of an adder is never lower than the worst-case propagation delay of a single full adder. We further assume that at the start of each addition, all carry bits have been reset to zero. Considering an approximate RCA, the result of these two assumptions is that there is a possibility of error at the output of an RCA only if there is propagation of carry: if there is no propagation of carry, the clock cycle time is sufficient for the full adders to compute the sum outputs of the RCA. Stated in a different way, there may be an error at the output only if there are carry chains in the approximate RCA. Hence in developing an error model for an approximate RCA, we will consider the behavior of error at the output of an RCA in the presence of carry chains. We call the longest delay path with respect to a particular sum bit sk as the "sum path of sk ." In an RCA, the sum path of sk is the series of gates in the RCA which constitutes the longest delay path, assuming worst-case gate delays. This path is essential in computing whether there is enough time to always compute correctly a particular sum bit sk . Of course, the true critical path for sum sk depends on inputs a and b; however, the result is that any true critical path – whose delay exceeds that of a single FA – comes from a prior bit position: for sk , then, the true critical path given inputs a and b will come from bit position i where i < k. To capture all such possibilities, we define di k to be the time between the correct computation of sum bit sk and the time when the inputs are provided to the RCA circuit thus triggering a true critical path for sk starting from bit i. Inputs a and b are provided to the RCA at some time tin . Let tk be the time when the correct value of sk is generated (assuming worst-case delays of all gates). Then di k D tk tin . d denotes the .n C 1/ .n C 1/ matrix of all di k , 0 i < k n. Properly speaking, d for a particular RCA in a particular technology is a function of the critical paths of the sum bits of the RCA, ` for each gate ` in any critical path of any sum bit, and ` .` /; however, for brevity, in this paper we will simply refer to d without specifying all the input values on which d depends. For additional details please refer to [15]. TABLE I M AXIMUM AND MINIMUM PROPAGATION DELAYS OF THE XOR GATE AND THE MUX IN 90 NM TECHNOLOGY Gat e XOR MUX ` .1:2V / (pico-sec) 33:3 30:5 ` .0:8V / (pico-sec) 55:2 51:2 We will determine di k as a sum of worst case propagation delays of the gates in an RCA. In conventional digital circuit design, intermediate outputs of a circuit are ignored as long as the final outputs are correct, but in overclocked circuits intermediate results could be construed as the final outputs and hence cannot be ignored during modeling. This warrants a consideration of worst-case vs. average-case vs. best-case propagation delays. In this paper, we do not consider spatial or temporal parameter variations. We only model the effect of different supply voltages. We start with an analysis of approximate RCAs assuming worst-case delays. While we do not have a formal proof to show that consideration of average and best case delays will result in lower error rates, we have found empirically for an RCA fed inputs having a uniform statistical distribution, increasing the delay increases the errors in the adder at least 98% of the time [15]. Example 1. Consider the 3-bit ripple carry adder shown in Fig. 1. Assume that the inputs are such that there is a carry chain from position 0 to position 2. One instance of such inputs are a D 011, b D 001 and c0 D 0. For this instance, a carry bit of 1 is generated at position 0, is propagated through position 1 and is killed at position 2. For this scenario, d02 D t2 tin , where t2 is the time when the correct s2 is generated. Assuming worst-case gate delays, d02 would be equal to the sum of worst-case propagation delays of the critical path shown in Fig. 1 as a dotted line from bit position 0 to the sum output in bit position 2. That means in the worst case, d02 D 1 .1 / C 3 .3 / C 6 .6 / C 8 .8 /. For our target technology of 90nm (Table I shows the corresponding delay values), considering maximum supply voltages for all the gates, i.e., 1 D 3 D 6 D 8 D 1:2V , we find that d02 D 33:3 ps C30:5 ps C30:5 ps C33:3 ps D 127:6 ps. With calculations similar to d02 , we find the following: 0 1 97.1 ps 127.6 ps 124.8 ps B 97.1 ps 94.3 ps C C dDB @ A Note that multiple carry chains can be handled by d. For example, for Fig. 2, entries d24 and d46 would be used for sum bits s4 and s6 , respectively. C. An error model for an approximate RCA based on carry chain analysis In this subsection, we will develop a function for the error at the output of an overclocked RCA. Note that we assume all bits in the RCA are initially set to zero [15]. We define the error at the output of the RCA as a function of a, b, d and D. We define sa to be the sum actually read out; due to overclocking, sa may be different from s D aCb. We now proceed to characterize the absolute magnitude of the error j sa s j. We define an RCA’s indicator function as follows: Ik .a; b; d; D/ D 1 1 0 if 9 i < k < j such that Cij D 1 and di k > D if 9 i < k D j such that Cij D 1 and di k > D otherwise. (1) The indicator function is now used to develop a function to compute the error at of an approximate RCA. We define Pnthe output a k Er.a; b; d; D/ Dj kD0 .sk sk /2 j to be the error introduced during the computation, assuming non-varying deterministic worstcase delays. Theorem 1. Er.a; b; d; D/ D n X Ik .a; b; d; D/2k : (2) kD0 Proof: For a detailed proof please refer to [15]. Theorem 1 (Eq. 2) gives Er.a; b; d; D/ which is the error at the output of the target RCA for two specific inputs. The average of this error over all possible inputs is Eravg .D; d/ D avg 0a;b2n 1 Er.a; b; d; D/: (3) This is a sum of 22n terms, which is not feasible to compute in a straightforward manner for large n. We will now transform the expression in Eq. 3 into a form that can be computed in O.n2 /. Recall that, given our assumptions, an error can occur only if there is a carry chain in the computation. We then note that the total error in a computation is the sum of the errors (if any) in the individual carry chains. The error introduced by a carry chain from i to j is Ercc .D; i; j; d/ D j X .ska sk /2k D kDi C1 where Ikcc .D; i; j; d/ D It can be shown that: Er.a; b; d; D/ D j X kDi C1 Ikcc .D; i; j; d/2k : 1 if i < k < j and di k > D 1 if i < k D j and di k > D 0 otherwise. X (4) (5) Ercc .D; i; j; d/ all i;j for which Cij D1 A proof of the above is available in [15]. We omit the proof for brevity. One way to compute the average total error at the output of an adder is by summing the errors of all possible carry chains weighted by the probability of their occurrence: X pij Ercc .D; i; j; d/; (6) Eravg .D; d/ D 0i <j n where pij is the probability that there exists a carry chain from i to j . Thus, the average total error is evaluated by computing and adding n.n C 1/=2 terms only [15]. The probabilities pij can be computed given the distributions of the inputs a and b. By assuming a uniform distribution of inputs it j i C2 (when j ¤ i C 1) and is easy to conclude that pij D 12 3 (when j D i C 1); a proof is available in [15]. In real pij D 12 world applications, this might not be true. Therefore, if the knowledge about the probability distribution of the actual inputs is known then that could be used instead of using P .ai D 0/ D P .bi D 0/ D P .ai D 1/ D P .bi D 1/ D 12 . If the case is such that instead of the probability distribution we have a candidate input benchmark, then the probability distribution could be computed using the benchmark. In this paper, we leave the case of non-uniform input bits for future work. IV. E NERGY M ODEL FOR AN A PPROXIMATE RCA The total energy consumption in an RCA consists of two separate components, the dynamic and static energy consumption. Our estimate of the dynamic energy consumption at the gate level of a CMOS P dyn dyn circuit of an RCA is E dyn D N `D1 E` .` /w` where E` .` / is the dynamic energy consumption of the `th gate being operated at supply voltage ` , and w` is the average switching activity of the `th gate in a single clock cycle (assuming a non-pipelined adder). w` for gates in the case of an RCA is approximately estimated as the ratio of the number of logic changes of gate ` to the total number of additions P A simulated. Our estimate of static energy consumption N stat stat is E stat D `D1 P` .` /D where P` .` / is the static power th consumption of the ` gate being operated at supply voltage ` and D is the clock cycle time of the circuit. Therefore, the total energy consumption is N X dyn E` .` /w` C P`stat .` /D (7) ED `D1 where N is the total number of gates in the adder. For an approximate RCA, Eq. 7 may be used if we find the switching activities for the gates when overclocking. Due to overclocking, the sum actually read might be different from the correct sum. The fact of whether at a given bit position the correct sum bit was computed in time or not is modeled using the indicator function in Eq. 1 in Subsection III-C. We use a similar model of an indicator function to check if a particular gate in the RCA had a logic change within the clock cycle time and, based on that, re-evaluate (reduce) the switching activity, denoted as w`a , to reflect this [15]. Based on the revised estimates of the switching activities, the total energy consumption of an approximate RCA is as follows N X E`D .` /w`a C P`S .` /D (8) Ea D `D1 From Section III, ` .` / denotes the worst-case propagation delay of the `th gate when its supply voltage is ` . We compute the average dynamic energy consumption and worst case propagation delays of all the gates in our process technology through simulations. It is known that the dynamic energy consumption of a gate is proportional to the square of the input supply voltage as well as that the propagation delay of a gate is inversely proportional to its supply voltage. To dyn represent E` .` / in terms of ` .` /, we will use the curve-fit that the average dynamic energy consumption of a gate is proportional to the inverse square of its worst case propagation delay, i.e., 1 1 dyn D ` 2 (9) E` .` / / 2 ` .` / ` .` / where ` is the proportionality constant for the `th gate. Thus ` is computed separately for each type of gate [15]. For example in the design of the RCA that we consider there are two types of gates, XOR and MUX. Substituting the relationship between average dynamic energy consumption and worst-case propagation delays from Eq. 9 into Eq. 8, we derive the following: ! N X 1 a a stat (10) w C P` .` /D ` 2 E D ` .` / ` `D1 V. M INIMIZING AVERAGE E RROR OF AN A PPROXIMATE RCA U SING G EOMETRIC P ROGRAMMING In this section we describe our procedure to formulate our target problem, which is minimizing average error of an approximate RCA under a given energy budget, as a geometric program [16]. Then we present our approach to perform supply voltage binning on the solution obtained from the geometric program. We form an optimization problem consisting of an objective function and one or more constraint functions. The objective function is the average error of an approximate RCA as given in Eq. 6 in Subsection III-C. The average error as shown in Eq. 6 is a function of a, b, d and D. The clock cycle time D is an independent variable, and a and b are inputs, but d is a matrix whose elements are a function of the adder topology, resulting critical path delays and gate supply voltages. We do not alter the adder topology but instead vary the adder supply voltages which directly alters d. We found that a formulation of error optimization in terms of d – represented in terms of .v/ – to be much simpler than a direct formulation in terms of v, where v is the vector of the voltages of all the gates in the RCA [15]. Therefore we consider the gate propagation delays as the decision variables. The RCA under consideration consists of N gates. We need to compute an optimized supply voltage allocation scheme, which is the exact assignment of supply voltages to the individual gates. To do that we will compute delays .v/ D f1 .1 /; 2 .2 /; : : : ; N .N /g for which the average error is minimized under the constraint that the total energy consumption is below the total energy budget. These gate delays will in turn determine the supply voltage allocation scheme. The optimization problem is to minimize Eq. 6 which is X pij Ercc .D; i; j; d/; (11) Eravg .D; d/ D 0i <j n1 subject to the following two constraints and assumptions. 1) min-delay` ` .` / max-delay` ; where min-delay` and max-delay` depend on the transistor technology while the delay additionally depends on the type of component and fanout. 2) The total energy consumption of all the gates is bounded from above by the given energy budget. Thus, ! N X 1 Energy w`a C P`stat .` /D ` 2 Ea D Budget .` / `D1 ` In general the full class of optimization problems could be classified into two categories, linear optimization problems (LP) and nonlinear optimization problems (NLP). The objective function of our optimization problem in this paper, which is shown in Eq. 11, is not a linear function. Therefore, our solution is to formulate the problem of minimizing Eq. 11 subject to the constraints above as a geometric program [16] and then solve it. To model our target problem as a geometric program, the objective function and all the constraints should be in the form of a posynomial [16]. Our objective function is not a posynomial. So we will compute a posynomial approximation of our objective function based on the methodology given in Section 8.2 of [16]. The approach of computing a posynomial approximation of a given function and then using geometric programming to solve it is referred to as signomial programming, discussed in detail in [16]. The following is a mathematical description of the approximations and redefinitions that we use. We will redefine the indicator function Ikcc (given in Equation 5) as: ( 1 if di k > D; Cij D 1; i < k j Ik .D; i; j; d/ D (12) 0 otherwise. Using this definition, Ercc .D; i; j; d/ is transformed as follows Ercc .D; i; j; d/ D j X kDi C1 Ikcc 2k D jX 1 kDi C1 Ik 2k Ij 2j (13) where Ikcc is shown in Equation 5. Because the new indicator function Ik is a non-negative function, the negative sign appears in the definition of Ercc .D; i; j; d/. Thus the combination of the indicator function in Eq. 5 and the error function in Eq. 4 in Section III-C results in the same value as the redefined indicator function in Eq. 12 and transformed error function in Eq. 13. We approximate Ik by 1=.1 C e 2.di k D/ / [15]. For 0, sgn.x/ tanh.x/, and we use D 200.1 Thus, our continuous and differentiable approximation of Eq. 13 is Ercc .D; i; j; d/ jX 1 1 1 2k 2j 2.d ij D/ 1 C e 2.di k D/ 1 C e kDi C1 where dij are linear functions of k .k /. We use the monomial approximation technique (Section 8.2 of [16]) for this expression. This results in Ercc .D; i; j; d/ jX 1 kDi C1 1 1 C e 2.di k D/ 1 2 2k 1 1 C e 2.dij D/ N a c1a .1 /2a .2 / : : : N .N / RC 2j (14) am and 2 R for all 1 m N . where c 2 We then construct the objective function Er.D; d/ as a posynomial [16]. For detailed examples please see [15]. As .v/ are the decision variables, we now express d in terms of .v/ and write the average error as Eravg .D; /. Then the problem is reduced to minimizing a posynomial subject to posynomial inequality constraints, giving us a geometric program in a standard form: Minimize Eravg .D; / D C X j D1 a1 a2 aN cj 1j .1 /2j .2 / : : : Nj .N / (15) subject to min-delay` ` .` / max-delay` ; k D 1; : : : ; N ! N X 1 a stat w C P` .` /D Energy Budget ` 2 and ` .` / ` `D1 where C is the number of possible carry chains in an n-bit adder and N is the number gates in the adder. In the case of an n D 16-bit D 156 [15]. adder, C D n.nC1/ 2 We use a standard geometric programming toolbox [16], [17] to solve this program. The solution of the first iteration is used to compute the posynomial approximation again, until the objective value starts to converge. This gives us the final allocation of delays to the components such that the average error is minimum for the given constraints. Using the delays allocated to the components, we can obtain the voltages to be supplied to them. The solution from the geometric program in Eq. 15, in principle, can assign any voltage to any gate under the given constraints. For a practical application of the solution we need to limit the number of supply voltages and also the number of voltage islands. We will present our approach to limit the number of supply voltages. Let the possible set of supply voltages be V and the maximum number of voltages be M . We then pick a M -combination of elements from V. The voltages from this subset are then assigned to the gates in the RCA with gates having a higher voltage in the geometric program solution getting a higher voltage from this subset. This process is referred to as binning. We exhaustively search through all possible binning schemes. Using Eq. 11 and the relationship between propagation delay and supply voltage we can compute a 1 This particular value of was chosen empirically by observing the plots of the two functions, sgn.x/ and tanh.x/, and that the transition from 1 to 1 is fast enough. TABLE II P ROPAGATION DELAY AND SUPPLY VOLTAGE VALUES FROM THE GEOMETRIC PROGRAM AND CORRESPONDING BINNED SUPPLY VOLTAGE VALUES FOR THE GATES IN F IG . 1 Gate Index 1 2 3 4 5 6 7 8 9 ` (ps) 44.6 46.9 34.0 44.6 40.8 33.3 39.3 38.5 33.3 ` (volts) 0.84 0.8 1.16 0.84 0.92 1.2 0.96 0.98 1.2 Binned ` (volts) 0.8 0.8 1.2 0.8 0.9 1.2 1.0 1.0 1.2 closed form solution of the average error. Our algorithm for supply voltage binning is explained in detail in [15]. We refer to the binned solution of the geometric program as the Binned Geometric Program Solution (BGPS) in this paper. As an example, we present the solution obtained from the geometric program and binning for the 3-bit RCA in Fig. 1 in Table II. VI. S IMULATION F RAMEWORK FOR RCA E XPERIMENTATION For comparison, we will consider the biased voltage scaling (BIVOS) approach of George et al. [10] modified as follows. First, we split the number of bits equally into four sets: for 16 bits, there are four sets of four bits each, while for 32 bits, there are four sets of 8 bits each. Then, we tried the following possible combinations of four distinct voltages, from 0.8V to 1.2V, assuming a step size of 0.1V: (i) f0:8V, 0:9V, 1:0V, 1:1Vg, (ii) f0:8V, 0:9V, 1:0V, 1:2Vg, (iii) f0:8V, 0:9V, 1:1V, 1:2Vg, (iv) f0:8V, 1:0V, 1:1V, 1:2Vg and (v) f0:9V, 1:0V, 1:1V, 1:2Vg where the voltages are assigned from lowest to highest from the LSB to the MSB. For example, 0.8V is the supply voltage for the least significant four bits (in the case of a 16bit RCA) or eight bits (for the 32-bit RCA) in four out of five of the cases above. We call this approach “naive-BIVOS” or n-BIVOS for short. In addition to n-BIVOS, we will also compare our approach with uniform voltage scaling or UVOS. All the simulations were performed in HSPICE using a 90nm process technology. Each result presented has been computed as an average over 10,000 additions with inputs drawn from a uniform distribution. The average error magnitude is computed by taking an average, over all of the additions simulated, of the absolute magnitude of the difference between the correct output and the overclocked approximate output. On the other hand, the average energy consumption is measured from our HSPICE simulations by taking an average, over all of the additions, of the total energy consumption. VII. E XPERIMENTAL R ESULTS Table III summarizes the key results of this paper. A solution found via BGPS (takes about 5 min for a 16-bit RCA) outperforms the two best prior approaches, UVOS and n-BIVOS, by as much as a factor of 2.58X and by a median of 1.58X. Though the target problem in this paper is to minimize error for a given Energy Budget, the dual problem would be to minimize energy for a given Error Budget. The simulations we have performed can give us initial results for this problem as well. For example, for the same average error of 36:8 the BGPS solution has an energy consumption of 77:14fJ whereas the UVOS solution consumes 132:9 fJ (savings of 1:72X) and an n-BIVOS solution consumes 139:92 fJ (savings of 1:81X). Energy Savings in Table III is the ratio of energy consumption of a correct RCA (no errors, 1.2V supply voltage) and the energy consumption of the overclocked adder giving us the amount of energy TABLE III S UMMARY OF RESULTS OF 16- BIT AND 32- BIT APPROXIMATE RIPPLE CARRY ADDERS n-bit D (ns) Average Error Magnitude UVOS 16-bit 16-bit 16-bit 16-bit 32-bit 32-bit 32-bit 32-bit 0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.6 36.83 36.83 36.83 40.92 18171 15634 14892 14199 n-BIVOS BGPS 50.06 45.97 38.55 36.83 33704 24637 15215 10931 21.66 17.96 26.31 23.25 13978 11412 11142 8931 UVOS /BGPS 1.70 2.05 1.40 1.76 1.30 1.37 1.34 1.59 [2] (a) (b) (c) (d) Fig. 3. Images generated by (a) Correct adders (b) BGPS adders (c) n-BIVOS and (d) UVOS [3] [4] saved due to overclocking. The non-monotonic relationship between energy consumption and average error (as observed from Table III) is discussed in detail in [15]. [5] A. FFT Results Approximate circuits can be used in applications which do not demand 100% accuracy such as digital image processing. To demonstrate the efficiency of BGPS adders over UVOS and n-BIVOS adders, we construct three different 8-point approximate FFTs with these three types of approximate adders but with 40% less energy consumption than the correct adders. We used an image as input to the approximate FFT so that the quality tradeoff is perceptible. We use a 16-bit approximate RCA but using the image data as input to the adder. Then we collect average error for the three types of approximate adders through simulations in HSPICE using the same framework as described in Section VI except for the input data. We use a Gaussian noise source at the output of every adder in MATLAB to simulate the effect of overclocking with the mean and variance collected from HSPICE simulations of the RCA. This will result in an approximately computed FFT of the input image. We then perform a correct inverse-FFT in MATLAB of this approximate FFT of the input image. Our goal is to see the extent to which the data has been preserved in this experiment. We would expect, if both the FFT and the inverse-FFT were correct, that the final image would be an exact copy of the original image. The resultant images from the four cases (three approximate RCA and one correct RCA) that are discussed above are shown in Fig. 3. The image generated by the FFT using UVOS adders has a PSNR that is 15db lower than the image generated by the FFT using BGPS adders with the same energy consumption. Also, the PSNR of the image generated by the n-BIVOS adders is 8.5 dB lower than the image generated by the BGPS adders with the same energy consumption. [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] R EFERENCES [16] [1] L. Chakrapani, K. Muntimadugu, A. Lingamneni, J. George, and K. Palem, “Highly energy and performance efficient embedded computing through approximately correct arithmetic: A mathematical foundation and preliminary experimental validation,” in Proc. of the IEEE/ACM [17] n-BIVOS /BGPS 2.31 2.56 1.47 1.58 2.41 2.16 1.37 1.22 Energy Consumption(fJ) Energy Savings 110.78 128.14 132.9 139.92 132.74 139.41 152.54 159.3 1.41 1.22 1.18 1.12 2.07 1.97 1.8 1.73 Intl. Conf. on Compilers, Architecture, and Synthesis for Embedded Systems, 2008, pp. 187–196. S. M. Martin, K. Flautner, T. Mudge, and D. Blaauw, “Combined dynamic voltage scaling and adaptive body biasing for lower power microprocessors under dynamic workloads,” in Proc. of the Intl. Conf. on Computer Aided Design, 2002. T. Pering, T. Burd, and R. Brodersen, “The simulation and evaluation of dynamic voltage scaling algorithms,” in Proc. of the Intl. Symp. on Low Power Electronics and Design, 1998, pp. 76–81. J. Chang and M. Pedram, “Energy minimization using multiple supply voltages,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 5, no. 4, pp. 436–443, Dec. 1997. A. Manzak and C. Chaktrabarti, “Variable voltage task scheduling algorithms for minimizing energy/power,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 11, no. 2, pp. 270–276, 2003. Y. Yeh, S. Kuo, and J. Jou, “Converter-free multiple-voltage scaling techniques for low-power CMOS digital design,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 20, no. 1, pp. 172–176, 2001. Y. Yeh and S. Kuo, “An optimization-based low-power voltage scaling technique using multiple supply voltages,” in Proc. of the IEEE Intl. Symp. on Circuits and Systems, vol. 5, May 2001, pp. 535–538. S. Das, D. Roberts, S. Lee, S. Pant, D. Blaauw, T. Austin, and T. M. Krisztian Flautner, “Self-tuning dvs processor using delay-error detection and correction,” in IEEE Journal of Solid-State Circuits(JSSC), 2006. D. Ernst, N. S. Kim, S. Das, S. Lee, D. Blaauw, T. Austin, T. Mudge, and K. Flautner, “Razor: Circuit-level correction of timing errors for low-power operation,” in IEEE MICRO special issue on Top Picks From Microarchitecture Conferences of 2004, 2005. J. George, B. Marr, B. Akgul, and K. Palem, “Probabilistic arithmetic and energy efficient embedded signal processing,” in Proc. of the IEEE/ACM Intl. Conf. on Compilers, Architecture, and Synthesis for Embedded Systems, 2006, pp. 158–168. S. Cheemalavagu, P. Korkmaz, K. V. Palem, B. E. S. Akgul, and L. N. Chakrapani, “A probabilistic CMOS switch and its realization by exploiting noise,” in Proc. of the IFIP Intl. Conf. on Very Large Scale Integration (VLSI-SoC), 2005, pp. 452–457. T. Kamei, T. Yamada, T. Koike, M. Ito, T. Irita, K. Nitta, T. Hattori, and S. Yoshioka, “A 65nm dual-mode baseband and multimedia application processor soc with advanced power and memory management,” in Proc. of the Asia and South Pacific Design Automation Conf., 2009, pp. 535– 539. AMD, “AMD Turion X2 Ultra and Turion X2 Key Architecture Features,” http://www.amd.com/uk/products/notebook/processors/ turion-_x2/Pages/turion-_x2-_mobile-_features.aspx, 2010. J.-M. Chang and M. Pedram, “Energy minimization using multiple supply voltages,” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 5, no. 4, pp. 436 –443, Dec. 1997. Z. Kedem, V. Mooney, K. K. Muntimadugu, and K. Palem, “Optimizing energy to minimize errors in approximate ripple carry adders,” Rice University Technical Report, vol. TR11-02, 2011. S. Boyd, S. Kim, L. Vandenberghe, and A. Hassibi, “A tutorial on geometric programming,” Optimization and Engineering, vol. 8, no. 1, pp. 67–127, 2007. A. Mutapcic, K. Koh, S. Kim, and S. Boyd, “GGPLAB: A MATLAB toolbox for geometric programming,” 2006.