An algebraic integer (AI)-based time-multiplexed row-parallel architecture and two final reconstruction step (FRS) algorithms are proposed for the implementation of bivariate AI encoded 2-D discrete cosine transform (DCT). The architecture directly realizes an error-free 2-D DCT without using FRSs between row-column transforms, leading to an 8 × 8 2-D DCT that is entirely free of quantization errors in AI basis. As a result, the user-selectable accuracy for each of the coefficients in the FRS facilitates each of the 64 coefficients to have its precision set independently of others, avoiding the leakage of quantization noise between channels as is the case for published DCT designs. The proposed FRS uses two approaches based on: 1) optimized Dempster-Macleod multipliers, and 2) expansion factor scaling. This architecture enables low-noise high-dynamic range applications in digital video processing that requires full control of the finite-precision computation of the 2-D DCT. The proposed architectures and FRS techniques are experimentally verified and validated using hardware implementations that are physically realized and verified on field-programmable gate array (FPGA) chip. Six designs, for 4-bit and 8-bit input word sizes, using the two proposed FRS schemes, have been designed, simulated, physically implemented, and measured.
A high-speed and reduced-area 2-D discrete wavelet transform (2-D DWT) architecture is proposed. Previous DWT architectures are mostly based on the modified lifting scheme or the flipping structure. In order to achieve a critical path with only one multiplier, at least four pipelining stages are required for one lifting step, or a large temporal buffer is needed. In this brief, modifications are made to the lifting scheme, and the intermediate results are recombined and stored to reduce the number of pipelining stages. As a result, the number of registers can be reduced to 18 without extending the critical path. In addition, the two-input/two-output parallel scanning architecture is adopted in our design. For a 2-D DWT with the size of , the proposed architecture only requires three registers between the row and column filters as the transposing buffer, and a higher efficiency can be achieved.
A high-throughput memory-efficient arithmetic coder architecture for the set partitioning in hierarchical trees (SPIHT) image compression is proposed based on a simple context model in this paper. The architecture benefits from various optimizations performed at different levels of arithmetic coding from higher algorithm abstraction to lower circuits' implementations. First, the complex context model used by software is mitigated by designing a simple context model, which just uses the brother nodes' states in the coding zerotree of SPIHT to form context symbols for the arithmetic coding. The simple context model results in a regular access pattern during reading the wavelet transform coefficients, which is convenient to the hardware implementation, but at a cost of slight performance loss. Second, in order to avoid rescanning the wavelet transform coefficients, a breadth first search SPIHT without lists algorithm is used instead of SPIHT with lists algorithm. Especially, the coding bit-planes of each zero tree are processed in parallel. Third, an out-of-order execution mechanism for different types of context is proposed that can allocate the context symbol to the idle arithmetic coding core with a different order that of the input. For the balance of the input rate of the wavelet coefficients, eight arithmetic coders are replicated in the compression system. And in one arithmetic coder, there exists four cores to process different contexts.
A new modulo 2n+1 multiplier architecture is proposed for operands in the weighted representation. A new set of partial products is derived and it is shown that all required correction factors can be merged into a single constant one. It is also proposed that part of the correction factor is treated as a partial product, whereas the rest is handled by the final parallel adder. The proposed multipliers utilise a total of (n+1) partial products, each n bits wide and are built using an inverted end-around-carry, carry-save adder tree and a final adder. Area and delay qualitative and quantitative comparisons indicate that the proposed multipliers compare favourably with the earlier solutions Published in:
We have designed a 2-bit Booth encoder with Josephson Transmission Lines (JTLs) and Passive Transmission Lines (PTLs) by using cell-based techniques and tools. The Booth encoding method is one of the algorithms to obtain partial products. With this method, the number of partial products decreases down to the half compared to the AND array method. We have fabricated a test chip for a multiplier with a 2-bit Booth encoder with JTLs and PTLs. It has a processing frequency of 20 GHz with the bias margin ±25%. The frequency of this circuit increases up to 45 GHz with the bias voltage by 25% increased from the design voltage. The circuit area of the multiplier designed with the Booth encoder method is compared to that designed with the AND array method..
Digital filters in cascade form enjoy many advantages over their equivalent single-stage realizations in that lower coefficient sensitivity, higher throughput, reduced computational and smaller implementation cost can be achieved. However, the numerical design and optimization of such structure are of much more difficulty than the single-stage case if the filter coefficients are restricted to be of discrete values. This is mainly due to the non-convexity of the constraints, which rules out the possibility of employing sophisticated convex optimization techniques as well as the guaranteed global optimality. In this work, a general-purpose algorithm is proposed for the design of linear phase finite impulse response (FIR) filters in cascade form with discrete coefficients. The proposed algorithm decomposes the overall filter into subfilters during the traverse of a tree search of the overall filter. Discrete-valued linear phase FIR filters are able to be searched and decomposed into both symmetric and non-symmetric subfilters.
This paper focuses on fixed-width multipliers with linear compensation function by investigating in detail the effect of coefficients quantization. New fixed-width multiplier topologies, with different accuracy versus hardware complexity trade-off, are obtained by varying the quantization scheme. Two topologies are in particular selected as the most effective ones. The first one is based on a uniform coefficient quantization, while the second topology uses a nonuniform quantization scheme. The novel fixed-width multiplier topologies exhibit better accuracy with respect to previous solutions, close to the theoretical lower bound
Since its acceptance as the adopted symmetric-key algorithm, the Advanced Encryption Standard (AES) and its recently standardized authentication Galois/Counter Mode (GCM) have been utilized in various security-constrained applications. Many of the AES-GCM applications are power and resource constrained and require efficient hardware implementations. In this paper, different application-specific integrated circuit (ASIC) architectures of building blocks of the AES-GCM algorithms are evaluated and optimized to identify the high-performance and low-power architectures for the AES-GCM. For the AES, we evaluate the performance of more than 40 S-boxes utilizing a fixed benchmark platform in 65-nm CMOS technology. To obtain the least complexity S-box, the formulations for the Galois Field (GF) subfield inversions in GF(24) are optimized. By conducting exhaustive simulations for the input transitions, we analyze the average and peak power consumptions of the AES S-boxes considering the switching activities, gate-level netlists, and parasitic information.
Carry Select Adder (CSLA) is one of the fastes t adders used in many data-processing processors to p erform fast arithmetic functions. From the structure of the CSLA, it is clear that there is scope for reducing the area and power consumption in the CSLA. This work uses a simple and efficient gat e-level modification to significantly reduce the area and p ower of the CSLA. Based on this modification 8-, 16-, 32-, and 64- b square-root CSLA (SQRT CSLA) architecture have been developed and compared with the regular SQRT CSLA architecture. The proposed design has reduced area and power as compared with the regular SQRT CSLA with only a slight increase i n the delay. This work evaluates the performance of the proposed designs in terms of delay, area, power, and their products by hand with logical effort and through custom design and layout in 0.18-m CMOS process technology. The results analysis shows t hat the proposed CSLA structure is better than the regular SQRT CSLA.
B. E (Computer Science)
B. E (Electronics and Communication)
B. E (Electrical and Electronics Eng.)
B. E (Information Technology)
B. E (Instrumentation Control and Eng.)
M. E (Computer Science)
M. E (Power Electronics)
M. E (Control System)
M. E (Software Engg)
M. E (Applied Electronics)
M. SC (IT , IT&M , CS&M, CS)
B.Sc. (IT , CS)