

#### ANALYSIS AND REDUCED THE COMPLEXITY OF ADPLL DESIGN ARCHITECTURE WITH K-BEST MIMO DETECTOR UP TO 1.5 GHZ \*Egwaran C. Mounika P.

\*Eswaran.G, Mounika.R

\*Department of Electronics and Communication Engineering K.S.R College of Engineering, Tiruchengode, Tamilnadu, India.

**KEYWORDS:** MIMO (Multiple Input Multiple Output), QAM- Quadrature Amplitude Modulator, TDC-Time To Digital Converter, ADPLL- All Digital Phase Locked Loop.

## ABSTRACT

A 7 ps/LSB, 0.02 mm2 and 3.9 mW@50MHz Time to Digital Converter architecture with novel MIMO detector. Which aims to solve the 4 ×4 64-QAM in high-speed applications. Multiple ring oscillators with unique and variable frequencies are used in order to make N independent measurements of the time pulse to be measured M times in order to create transmitter and receiver diversity similar to those in MxN MIMO antenna arrays. We propose a fully-pipelined sorter, which can generate one result per clock cycle and thus greatly enhance the detection throughput. On the other hand, various K values are adopted at each layer to save the hardware complexity. The proposed design has been implemented in 0.18 nm CMOS technology and has 366K gates.

### **INTRODUCTION**

MIMO techniques attract much attention since last decade, which can offer more system capacity such as spatial multiplexing or improve the link quality by spatial diversity. Hence, they are widely incorporated in recent wireless communication systems. At receivers, MIMO signal detection plays an important role to meet the stringent requirements of real-time processing. Although maximum likelihood (ML) detection provides the optimal solution for MIMO systems, the computational complexity of full search grows exponentially as the increase in the constellation sizes or the number of antennas. Thus, its adoption in high-throughput spatialmultiplexing MIMO systems is impractical. One of the main blocks of such a PLL is the TDC. This block is mainly responsible for measuring the phase error between the input reference clock and the divided feedback clock as shown in Figure 1. and give a digital output. The phase error is used by the ADPLL in order to adjust the frequency and the phase of the output clock such that it is N.F times the frequency of the input reference clock where N is the integer and F is the decimal part of a fractional clock multiplication value. Compared to the high frequency wireless mobility products that push for absolute performance such as <1 ps/LSB TDC gain, compromises in performance can be acceptable in automotive industry's wired applications. As the cable lengths in the cars are short and shielding is quite strong, requirements for phase noise and spur performance can be relaxed which allows the TDC gain on the order of 1 to 10 ps/LSB. There is previous work related to synthesizable TDCs in [1], [3] but the proposed TDC in this study is novel in the sense that it implements an improved digital signal processing scheme to decrease the effective TDC gain and is implemented in all digital design flow compatible with synthesis, Auto Place and Route (APR) and IC fabrication using only standard library cells in TSMC65 process. The proposed architecture uses Verilog RTL coding in general with a gate level Verilog section for the ring oscillators. Previous work in the literature contain articles claiming all digital operation. However, [8] and [5] contain custom gates with various methods introducing analog behaviour within the cells. [4], [6] are digital only at the block interface and [7] is not synthesizable.





Figure 1 All Digital PLL b. Single 2x1 TDC

In the following, the MIMO system model and sphere decoding algorithms are illustrated in Sec. II. The main issues related with the throughput are discussed in Sec. III. The proposed hardware architecture and design techniques are described in Sec. IV. Simulation and implementation results are provided in Sec. V. Finally, a brief conclusion is given in Sec. VI.

### MIMO SYSTEM MODEL AND SPHERE DECODING

MIMO System Model

Consider a MIMO system with *M* transmits antennas and *N* receives antennas. Denote the transmit signal vector as  $s = [s_1 \ s_2 \ \ \sim] s_i$  represents the symbol from complex constellations such as QPSK, 16-QAM and 64-QAM. The receive signal vector of the dimension  $N \times 1$  then takes the form of

 $\mathbf{y} = \begin{bmatrix} \sim & y_2 & y_N \end{bmatrix} = \mathbf{H}^{\sim} + \mathbf{n}, \tag{1}$ 

where **H** is an  $N \times M$  channel matrix with independent and identical complex-Gaussian-distributed elements of unit variance and **n** is the noise vector. The complex-valued model in (1) can be transformed to a real-valued form by real-valued decomposition (RVD) and it becomes  $(\mathbf{H}) - \mathbf{Im}(\mathbf{H}) = \lim_{n \to \infty} |\mathbf{h}|^{-1} |\mathbf{Im}(\mathbf{n})|^{-1}$ 

As a result, the dimensions of y, H and s are  $2N \times 1$ ,  $2N \times 2M$  and  $2M \times 1$ , respectively.

#### Sphere Decoding Algorithm

The maximum likelihood detection of (2) is given by  $\mathbf{s}_{ML} = \arg \min \mathbf{y} - \mathbf{Hs}$ , (3)

where  $\Omega$  is the set consisting of real entries in the constellation with *L* bits per symbol, and the size of the set  $\Omega$  is denoted by  $M_c=2^{L/2}$ , e.g., for 64-QAM,  $\Omega = \{-7, -5, -3, -1, 1, 3, 5, 7\}$ , L=6 and  $M_c=8$ .



#### Fig.2. Search tree for 64-QAM 4 ×4 RVD system.

Exhaustive search is one approach to solve (3). However, the search space is enormous for large constellations and several antennas. In light of the huge complexity, sphere decoding was proposed for MIMO detection problems [5]. The channel matrix is first triangularized by QR



decomposition, i.e. H = QR, where Q is a unitary matrix and R is an upper-triangular matrix. Thereafter, (3) can be rewritten as

$$\mathbf{s}_{ML} = \arg\min \, \mathbf{z} - \mathbf{Rs} \quad , \tag{4}$$

where  $z = Q^H y$ . The optimal solution  $s_{ML}$  is the one that has the minimal Euclidean distance between z and **Rs**.

where the index *i* runs from 2*M* to 1 and  $PED_{2M+1}(\mathbf{s}_{2M+1}) = 0$ . With (6), the Euclidean distance can be calculated iteratively. The MIMO detection problem now is reformulated as a tree search problem as shown in Fig. 1. The tree contains 2*M* layers indexed from 2*M* to 1 from the root node. Each mother node has  $M_c$  child nodes. The distance increment  $e_i(\mathbf{s}_i)$  corresponds to the branch metric at layer *i*, while the  $PED_i(\mathbf{s}_i)$  is related with the path metric along a certain path stemming from the root node to the target node at layer *i*. Note that  $H_i(\mathbf{s}_{i+1})$  depends on the path history only and is equivalent in all the child nodes extended from the same mother node. The optimal solution is defined by the path with the smallest path metric at the leaf node.

In the breadth-first algorithm, the decoder steps down to the next layer when the nodes in the current layer are all explored. To restrict the exponential growth of the nodes to be visited in the tree, the K-best algorithm is often used instead. It preserves only K survival nodes with the smallest path metrics at each layer. Its VLSI architecture contains a metric computation unit similarly, but an extra sorter is required to decide the survival nodes. Pipelined architecture can be easily applied to the K-best algorithms and depends on the longest execution clock cycles occupied by certain modules. Usually, the bottleneck exists in the sorter [3][4]. Hence, in order to upgrade the throughput, reduction of the sorting latency in the K-best SD is essential.

## **PROPOSED K-BEST SPHERE**

#### DECODER

In the proposed 4×4 MIMO detector supporting RVD signals, eight pipelined layer-processing blocks each dealing with operations at one layer are cascaded as shown in Fig. 3. At the eighth layer, the number of survival nodes to be reserved is equivalent to the total child nodes. Thus, eight path metric computation units (MCUs) are constructed. For eight possible  $s8 \in Q$ , eight sets of e8(s8) are calculated. Besides, eight history computation units (HCUs) are responsible for H7(s8) that are required at the next layer. In the remaining layer-processing blocks, in addition to path metric computation units and history computation units, an additional sorter is implemented. The MCU computes PEDi(si) on demand as in [4] and the sorter determines the survival nodes. The HCU calculates Hi-1(si) accordingly when the sorter generates one survival node. In order to achieve high throughput, a fully pipelined sorter with one output per cycle is designed so that  $\Box$  can be reduced to 1 in (7). Furthermore, several hardware saving techniques are also employed to save the silicon cost. In the following, we will discuss them in detail.

### FULLY-PIPELINED SORTING STAGES

Once Ki+1 SE sequences of Ki+1 survival parent nodes are obtained, they enter into fully pipelined sorting stages. For Ki survival nodes to be determined, Ki stages are cascaded as shown in Fig. 4. It can generate one sorting result per clock cycle. At the initial stage, Ki+1 MCUs and a full comparison tree are implemented, which calculate PEDs of the first child node in Ki+1 SE-enumeration lists. The Ki+1 PED values are sent to a full comparison tree containing the "compare and select" (CS) modules as shown in Fig. 5(a) to obtain the minimal PED. Also the parent-node index (*SurvivalListID*) from which the minimal PED is derived, and its child-node enumeration index are also sent to next stage. From the second stage to the last stage (called reduced stage as shown in Fig. 5 (b)), only one MCU and a reduced comparison tree is designed. From the parent-node index (*SurvivalListID*) and the child-node enumeration index selected at the previous sorting stage, we can easily extract the desired sibling node. Since only one out of eight candidates to be compared is updated, one MCU is required. Moreover, a full comparison tree is not necessary because seven PED values keep unchanged and their comparison results remain the same. Consequently, we eliminate these redundant CS by storing the intermediate results. The reduced comparison tree contains three CS modules that utilize the updated PED signal and those signals adjacent to the selected signal along the comparison tree at the previous sorting stage.





Fig.3. Slicer and region map for enumeration of 64-QAM RVD MIMO signals.



fig. 4. Block diagram of the Ki-stage sorter.



Fig.5. (a) Full comparison tree, and (b) reduced comparison tree.



Fig.6. Performance comparison of the conventional 8-best SD and the proposed K-best SD under different configurations and its chip layout.



### TIME INPUTS

TDC has two states in which it either accepts positive or negative time inputs. While the system acts in the SIMO mode during lock phase, a maximum time input of 40 ns is supported by the input range. After phase lock is achieved, maximum time input statistically decreases to ~4 ns and this allows the system go into the MIMO mode where time input range is limited to half of the REFCLK period. Reference clock leading feedback clock creates a positive time input, while feedback clock leading reference clock creates a negative time input. In the idle state, either positive or negative conversion state is selected on a first come first serve basis, and until *1st* and 2<sup>nd</sup> conversions are completed, the system stays in this state. Phase detection logic is generated to create a +*time* pulse from *REFCLK* rising edge to *FBCLK* rising edge and a *-time* pulse from *FBCLK* rising edge to *REFCLK* rising edge. Using matched delay and inverter cells + and - *time* pulses are delayed for use in the 2nd time conversion. With minimum feedback division can reduce the window for second conversion to 5 output clock cycles before the next period of the reference clock. For a minimum reference clock period of 10 ns, this translates into a required time input delay between 6 ns to 10 ns in all corners.

#### TWO STEP RING OSCILLATOR WITH MULTI-PHASE COUNTERS

A ring oscillator composed of 7 stages is implemented using NAND gates and one of the stages has an oscillation enable signal as shown in Figure 7. In every NAND gate output, there are dangling inverters that have their strength adjusted for each TDC instantiation in order to create unique gains for each parallel chain. There are also tri-state buffers connected between each NAND gate output and input in order to create a positive feedback loop and change the frequency of the ring when enabled during the second conversion. In the idle



Fig.7.TDC quantization noise histograms

state ring oscillators are stopped and only when a conversion starts, oscillation starts in each TDC instance. Output is registered at the end of first conversion and same counters and adders are utilized during the second conversion in order to save area. At the end of the conversion two output code words from the oscillator stage are provided in an *11* bit 2's complement representation to the post processing section as shown in Figure 3. If the feedback clock is leading the reference clock, output code is negative and positive if reference clock is leading the feedback clock. The tri-state buffer strengths are dangling inverter sizes are adjusted in order to get typical TDC gains of <17, 18.5>, <20, 21.5>, <23, 24.5>, <26, 27.5> ps/LSB.

| TABLE I                                                        |
|----------------------------------------------------------------|
| CHIP CHARACTERISTICS AND COMPARISON TO PREVIOUS 4 × 4 DESIGNS. |

| Reference          | [6]    | [1]          | [3]    | [4]     | This work |
|--------------------|--------|--------------|--------|---------|-----------|
| Modulation         | 16-QAM | 16- QAM      | 16-QAM | 64- QAM | 64 - QAM  |
| Method             | K-best | Depth- first | K-best | K-best  | K-best    |
| <i>K</i> -value(s) | 5      | -            | 5      | 10      | 6         |
| Process (µm)       | 0.35   | 0.5          | 0.25   | 0.13    | 0.13      |
| Gate count         | 91K    | 50K          | 93K    | 114K    | 366K      |
| Frequency(MHz)     | 100    | 71           | 132    | 282     | 61        |
| Real/Complex       | REAL   | COMPLEX      | REAL   | REAL    | COMPLEX   |



| Throughput | 54  | 170 | 424 | 675 | 1600 |
|------------|-----|-----|-----|-----|------|
| (Mbps)     | 626 | N/A | N/A | 135 | 400  |
| Power (mW) | 594 | N/A | N/A | 200 | 120  |
| Normalized | 24  | -   | 0.4 | 0.6 | 0.55 |

## CONCLUSION

Implementing a novel MIMO allowing synthesis only with standard cells, the TDC is implemented in all digital flow and achieves the portability and flexibility goals. The TDC achieves superior power, area, and resolution performance compared to similar designs This TDC is designed as part of an effort to implement an ADPLL using only standard cells.  $4 \times 4$  K-best MIMO detector for 64-QAM systems in high-speed applications. Our proposed architecture contains fully-pipelined sorting stages to generate one output per cycle so as to satisfy high throughput requirements and has good power efficiency. From post-layout simulation, the operating frequency of our design achieves 62.5 MHz and has 1.5 Gbps throughput. It also has low normalized energy per bit compared to previous works.

## REFERENCES

- Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and H. Bölcskei, "VLSI implementation of MIMO detection using the sphere decoding algorithm," *IEEE J. Solid-State Circuits*, vol.40, pp. 1566-1577, Jul. 2005.
- 2. K. Wong, C. Tsui, R. Cheng, and W. Mow, "A VLSI architecture of a K-best lattice decoding algorithm for MIMO channels," in *Proc. Int. Symp. Circuits and Systems (ISCAS 2002)*, vol. 3, May 2002, pp. 273-276.
- 3. M. Wenk and M. Zellweger, A. Burg, N. Felber, and W. Fichtner, "Kbest MIMO detection VLSI architectures achieving up to 424Mbps," in *Proc. Int. Symp. Circuits and Systems (ISCAS 2006)*, May 2006, pp.1151-1154.
- 4. M. Shabany and P. G. Gulak, "A 0.13μm CMOS 655Mbs, 4x4 64- QAM K-best MIMO Detector," *IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers*, pp. 256-257, Feb. 2009.
- 5. E. Agrell, T. Eriksson, A. Vardy and K. Zeger, "Closest point search in lattices," *IEEE Trans. Inf. Theory*, vol. 48, no. 8, pp. 2201-2214, Aug. 2002.
- S. Mandai, T. Iizuka, T. Nakura, M. Ikeda, K. Asada, "Time-to-digital converter based on time difference amplifier with non-linearity calibration," *ESSCIRC*, 2010 Proceedings of the ,vol., no., pp.266,269, 14-16 Sept. 2010.
- 7. M. Zanuso, S. Levantino, A. Puggelli, C. Samori, A. L. Lacaita, "Timeto- digital converterwith 3-ps resolution and digital linearization algorithm," *ESSCIRC*, 2010 Proceedings of the , vol., no., pp.262,265,14-16 Sept. 2010.
- 8. D. Shin, K. Jabeom, Y. Won-Joo, C. Young Jung, K. Chulwoo, "A fastlocksynchronous multi-phase clock generator based on a time-to-digital converter",ISCAS2009.