ABSTRACT
In this paper we propose a novel implementation of FIR Filters based on parallel input serial output structure. Several simplifications have been achieved due to using reduced binary tree multipliers. Properties common to all coefficients of a filter have been exploited to modify and improve the performance of the structure as well as to decrease both the area and power consumption when compared to a state-of-the-art implementation.

1. INTRODUCTION
In recent years there has been a clear trend to move as many components of a system from the analog to the digital domain. In that struggle digital filtering has become one of the most challenging tasks. Digital filters can be made much more accurate than their analog counterparts, but they also need a large number of resources, especially multipliers, to perform the desired operation. Therefore, in terms of area and power consumption, digital filtering has become one of the most critical subsystems in many applications, like e.g. Software Defined Radio (SDR) [5]. Optimizing computational complexity of digital filters is a goal that has been widely investigated in the past years. Some approaches targeted algorithmic reduction of computational effort (e.g. [1], [6], [7]), some others aimed at simplifications in hardware complexity and architectural modifications (e.g. [8], [4], [2]). The goal of both approaches is similar: reduce the number or cost of used multipliers. In this work we present a method of implementing multipliers and, based on that, FIR filters. By applying the structure of the binary tree the properties common to all taps of a filter can be extracted, leading to significant reduction in computational complexity of multipliers and to savings in power consumption and area. Also, in some cases, shorter path delays and thus higher clock frequencies can be achieved.

At this point it should be stressed, that the following considerations apply to filters with a priori unknown coefficients for which the multipliers cannot in general be realized by means of shift-and-add operations. On the contrary, high-speed, full multipliers are required, as the filter is employed in applications like e.g. signal processing at intermediate frequency in a reconfigurable mobile communications system.

The idea to base a multiplier on a binary tree structure has been presented in [3]. The tree consists of several adders, which work in parallel often performing operations yielding the same result. Optimization potential can be exploited by replacing the adders with multiplexers, as in the reduced binary tree multiplier of Section 2. In a similar manner simplifications in performed arithmetics can be applied to a more general design like parallel-in-serial-out FIR filter. This will be explained in Section 3. The advantages of the modified filter are supported by experimental results of Sect. 4, while a conclusion is given in Section 5.

2. REDUCED BINARY TREE MULTIPLIER
Out of many multiplier structures presented in the past, the simplest solutions have been based on an array of full-adders. While very area-efficient, these multipliers are rather slow and very disadvantageous in terms of power consumption. Due to very high number of paths and hence resulting high glitching activity in the array, often alternative architectures are favored.

The widely-used, state-of-the-art approach is the Wallace-tree-multiplier. For word-lengths of more than 16 bits a booth-encoded Wallace-tree is used, while for smaller word-lengths a non-booth-encoded is preferred. Wallace-trees are extremely fast, have low power consumption but need more area than array-multipliers, especially for word-lengths of up to 16 bits. Nevertheless, they achieve very good power-area and delay-area products.

A different technique is to use a multiplier based on a binary tree (BT) (Fig.1). This approach is very efficient for up to 16 bit word-length and can be designed to be both faster and smaller than array multiplier while keeping the power-area and delay-area products only slightly larger than for Wallace-trees. From the BT-structure a reduced binary tree multiplier can be derived. It em-
employs a reduced number of adders, which contributes to savings in area and power consumption.

The first row of adders in Fig.1 can be obtained by splitting up the multiplier array into blocks. Each block consists of one adder summing up two numbers. Let us consider a multiplication of a data with 12 and a coefficient with 4 bits word-length. The array below shows the result of the multiplication of an input data data = "000111011100" with a coefficient C = "1111" in a binary tree.

![Binary Tree Diagram]

In the case of a four bit coefficient, for an array multiplier it holds:

\[ \text{result} = \text{data} + \text{data} \cdot 2 + \text{data} \cdot 2^2 + \text{data} \cdot 2^3, \]

while for the binary tree:

\[ \text{result} = (\text{data} + \text{data} \cdot 2) + (\text{data} \cdot 2^2 + \text{data} \cdot 2^3), \]

with the brackets resulting from forming groups of adders. The number of levels (rows) of a binary tree (BT) required to perform a multiplication is

\[ \text{rows} = \lceil \log_2(n) \rceil \]

with \( n \) being the word-length of the coefficient. Depending on the bitmap of the coefficient \( C \), the result of each addition of partial products (adders in the first row of Figure 1) can have only four distinguished values:

\[ \text{Sum} \in \{0, \text{data}, \text{data} \cdot 2, \text{data} + \text{data} \cdot 2\} \]

From the above we can draw the structure of a reduced binary tree (RBT) multiplier as shown in Fig.2. Obviously, the advantage of the reduced number of adders, and thus the area savings, comes at the expense of additional delay introduced by the multiplexer. On the other hand, there’s only one adder for the whole first row, which yields smaller area and reduced power consumption.

3. REDUCED BINARY TREE FIR

A disadvantage of the RBT-multiplier is, that the sum \( \text{data} + \text{data} \cdot 2 \) is always calculated, even if it is not needed due to the value of the coefficient. In some cases this could lead to increased power consumption when compared to the standard binary tree solution. Therefore it is controversial if it is reasonable to employ this technique in general purpose multipliers. However, in some applications, like FIR-filters, utilizing reduced binary tree multipliers leads to significant savings in area and power consumption.

3.1. MUX-multiplier

The most straightforward approach to apply RBT-multipliers to FIR filters is to replace standard multipliers with the modified structure. This doesn’t change anything in the parallel-serial structure of the filter (Fig.3).

3.2. First row at filter input

From Figures 2 and 3 we notice, that the sum \( \text{data} + \text{data} \cdot 2 \) is being calculated for every coefficient independently. Since the incoming \( \text{data} \) is the same for all multipliers, the sum \( \text{data} + \text{data} \cdot 2 \) is also equal. Therefore it is sufficient to evaluate this common value only once at the input of the filter. The adder responsible for performing this sum (adder on the top in Fig. 2) can be removed from all RBT-multipliers. Thus, the resulting structure looks like a binary tree multiplier (Fig. 1) with the adders of the first row replaced by multiplexers. This circuit, which is not a multiplier anymore and will be referred to as common first adder (CFA), has a shorter delay time. The FIR filter utilizing CFAs (Fig. 4), which
### 3.3. Two rows at input

Common adders in conjunction with multiplexers can replace not only the first, but also the second row of the binary tree. In FIR structure, the resulting substitute for the multiplier, here referred to as **common two rows** (CTR), contains only one single multiplexer in place of the first two rows of the multiplier. This multiplexer has to switch between 16 possible output combinations out of which 9 can be derived from other results while the remaining 7 additions need to be pre-calculated. Also here the main drawback is that the computations are performed no matter if it is required for the given set of coefficients or not. The hardware employed to obtain these numbers is common for all coefficients and thus can be moved to the input of the filter. A register bank has been added to accumulate the 7 independent results. The resulting FIR-filter is depicted in Figure 5 with six out of seven adders placed at the input of the filter as to form an arithmetic unit.

![Figure 3: Parallel-In Serial-Out FIR filter.](image)

**Figure 3:** Parallel-In Serial-Out FIR filter.

is obtained at the expense of two additional registers, will have two important advantages:

- The critical path in the multipliers will be shorter, resulting in a filter with a higher admissible clock frequency and
- the area and power consumption can be reduced.

![Figure 4: First row adders can be moved to the filter input.](image)

**Figure 4:** First row adders can be moved to the filter input.

### Table 1: Comparison of power consumption (in mW) of four architectures (Wallace - Wallace tree, RBT - reduced binary tree multiplier, CFA - first row adders moved to the filter input, CTR - adders from two rows moved to the input). SAV - savings in percent.

<table>
<thead>
<tr>
<th>Bits</th>
<th>Wallace 17 taps (SAV)</th>
<th>Wallace 25 taps (SAV)</th>
<th>Wallace 75 taps (SAV)</th>
<th>RBT 17 taps (SAV)</th>
<th>RBT 25 taps (SAV)</th>
<th>RBT 75 taps (SAV)</th>
<th>CFA 17 taps (SAV)</th>
<th>CFA 25 taps (SAV)</th>
<th>CFA 75 taps (SAV)</th>
<th>CTR 17 taps (SAV)</th>
<th>CTR 25 taps (SAV)</th>
<th>CTR 75 taps (SAV)</th>
</tr>
</thead>
<tbody>
<tr>
<td>8</td>
<td>1803 (17%)</td>
<td>2862 (16%)</td>
<td>9084 (14%)</td>
<td>1507 (17%)</td>
<td>2423 (16%)</td>
<td>7835 (14%)</td>
<td>1372 (24%)</td>
<td>2101 (27%)</td>
<td>6754 (26%)</td>
<td>1620 (10%)</td>
<td>2530 (12%)</td>
<td>7611 (17%)</td>
</tr>
<tr>
<td>16</td>
<td>12057 (28%)</td>
<td>17333 (27%)</td>
<td>52253 (26%)</td>
<td>8719 (28%)</td>
<td>12763 (27%)</td>
<td>39123 (26%)</td>
<td>7869 (35%)</td>
<td>10956 (35%)</td>
<td>34180 (35%)</td>
<td>8425 (30%)</td>
<td>12184 (30%)</td>
<td>40041 (24%)</td>
</tr>
<tr>
<td>32</td>
<td>80810 (41%)</td>
<td>117034 (44%)</td>
<td>353468 (43%)</td>
<td>47922 (41%)</td>
<td>66201 (44%)</td>
<td>203916 (43%)</td>
<td>42831 (47%)</td>
<td>59025 (50%)</td>
<td>178206 (46%)</td>
<td>45718 (44%)</td>
<td>63747 (46%)</td>
<td>207679 (42%)</td>
</tr>
</tbody>
</table>

To prove the advantages of the reduced binary tree FIR-filters presented here several designs have been implemented in a high-level description language. These designs have been tested for functional equivalence with the state-of-the-art approach. The delay and area have been estimated using commercially available tools, while power consumption was obtained from simulations on transistor level. We present here three design examples for word-lengths ranging from 8 to 32 bits. The same word-length has been used for data and filter coefficients. The input data has been obtained from a matlab model of a communications system in order to emulate a practically relevant input stream. The simulations extended over several thousand cycles at 50 MHz. In tables 1-3

\[ \text{data} - 2 \]

\[ \text{data} \]

\[ \text{result} \]

\[ \text{Arithm Unit} \]

\[ \text{Reg} \]

\[ \text{result} \]

\[ \text{data} \]

\[ \text{result} \]

\[ \text{data} \]

\[ \text{Reg} \]

\[ \text{result} \]

\[ \text{Arithm Unit} \]

\[ \text{data} \]

\[ \text{data} \]

\[ \text{result} \]

\[ \text{data} \]

\[ \text{result} \]

\[ \text{data} \]

\[ \text{result} \]

\[ \text{data} \]

\[ \text{result} \]

\[ \text{data} \]

\[ \text{result} \]

\[ \text{data} \]

\[ \text{result} \]

\[ \text{data} \]

\[ \text{result} \]

\[ \text{data} \]

\[ \text{result} \]
a comparison of power consumption, area and delay for the four structures is given. The results are presented for three FIR low-pass filters with 17, 25 and 75 taps. Most savings in power consumption can be achieved with the structure of Fig. 4. This structure is also the best in terms of area. However it is slightly slower than the Wallace-tree reference for larger word-lengths. The structure, where the multiplier has simply been replaced by a reduced binary tree is the slowest one, but achieved savings in power and area are significant. The fastest is the design where both binary tree rows have been replaced by a multiplexer. Even if the multiplexer used here is very large, considerable savings in power consumption can be observed. Obviously, in most cases, the best choice is the reduced binary tree with first row adders moved to the filter input. Especially for the practical word-lengths (up to 16 bits) this structure offers best performance in terms of cost and out-performs the state-of-the-art Wallace-tree approach in all benchmarks. Please note, that the results presented in Tables 1 and 2 correspond to designs with the highest area efficiency, while Table 3 shows the highest speed possible without the area increase taken into account. Nevertheless, the presented savings in area and power consumption have been achieved with digital filters being able to run at 80 MHz (32 bits word-length) or more (smaller word-lengths). Filters running at such high frequencies are able to meet even the very stringent requirements set on applications like Software Defined Radio. Thus, binary tree FIR filters seem to be a very good alternative to the state-of-the-art implementation in almost all practical applications.

Table 2: Comparison of area (in \( \text{mm}^2 \)) of four architectures (Wallace - Wallace Tree, RBT - reduced binary tree multiplier, CFA - first row adders moved to the filter input, CTR - adders from two rows moved to the input). SAV - savings in percent.

<table>
<thead>
<tr>
<th>Bits</th>
<th>Wallace</th>
<th>RBT</th>
<th>CFA</th>
<th>CTR</th>
</tr>
</thead>
<tbody>
<tr>
<td>8</td>
<td>0.100</td>
<td>0.148</td>
<td>0.448</td>
<td></td>
</tr>
<tr>
<td></td>
<td>(18%)</td>
<td>(18%)</td>
<td>(18%)</td>
<td></td>
</tr>
<tr>
<td>16</td>
<td>0.394</td>
<td>0.581</td>
<td>1.680</td>
<td></td>
</tr>
<tr>
<td></td>
<td>(26%)</td>
<td>(26%)</td>
<td>(26%)</td>
<td></td>
</tr>
<tr>
<td>32</td>
<td>1.263</td>
<td>1.860</td>
<td>5.588</td>
<td></td>
</tr>
<tr>
<td></td>
<td>(14%)</td>
<td>(14%)</td>
<td>(14%)</td>
<td></td>
</tr>
</tbody>
</table>

Table 3: Comparison of multiplier delay (in ns) of four architectures (Wallace - Wallace Tree, RBT - reduced binary tree multiplier, CFA - first row adders moved to the filter input, CTR - adders from two rows moved to the input).

<table>
<thead>
<tr>
<th>Bits</th>
<th>Wallace</th>
<th>RBT</th>
<th>CFA at input</th>
<th>2 CTR at input</th>
</tr>
</thead>
<tbody>
<tr>
<td>8</td>
<td>1.49</td>
<td>1.82</td>
<td>1.4</td>
<td>1.05</td>
</tr>
<tr>
<td>16</td>
<td>2.29</td>
<td>3.25</td>
<td>2.62</td>
<td>1.93</td>
</tr>
<tr>
<td>32</td>
<td>2.88</td>
<td>4.8</td>
<td>4.03</td>
<td>3.33</td>
</tr>
</tbody>
</table>

In this paper we have presented an advantageous way to implement parallel-in-serial-out FIR digital filters. The solution is based on reduced binary tree multipliers in which inter-communities have been exploited to obtain overall complexity reduction. The resulting design provides more flexibility to the designer, offering more freedom in choosing delay-power and delay-area trade-off than the state-of-the-art solution. In fact, it makes possible to design low-power high-speed FIR filters accepting an increased area. On the other hand, low-power, low-area design is possible with this method, if speed is not an issue. However, even in that case, the achieved critical path delays are small enough to satisfy the requirements of most practical designs. All implementations presented here provide significant savings in power consumption (up to 50%) and area (up to 29%), which makes them highly desirable in applications like e.g. mobile communications systems.

6. REFERENCES