1. Field
Exemplary embodiments of the present invention relate to symbol timing recovery and, in particular, to symbol timing recovery with a multi-core processor.
2. Description of the Related Art
In communications systems, a transmitter sends information to a receiver over time in the form of a data stream made up of symbols, such as bits of data. To accurately interpret the data, the receiver and transmitter should operate according to a common clock. However, while the receiver knows the transmission frequency, the receiver clock is typically not truly synchronized with the transmitter clock. When data is transmitted over a wireless communication channel, it is corrupted due to various types of noise, such as fading, oscillator drift, frequency and phase offset, and receiver thermal noise. At the receiver, the system is subject to noise and timing jitter in time domain. As a result, the receiver needs to correctly recover the clock associated with the received signal from the signal itself. The process of recovering the correct clock signal or synchronization information from the received signal of transmitted symbols is called symbol timing recovery (STR).
A timing recovery subsystem must be able to sample the data at a correct instant and detect its peak for correct symbol timing recovery. Sampling just once at the receiver is ineffective due to noise—e.g., additive white Gaussian noise (AWGN). However, a matched filter (MF) can limit the noise at the receiver and provide a high signal-to-noise ratio (SNR) sampling point (due to correlation gain).
The matched filter is a time-reversed and delayed version of the transmitted waveform. To maximize the signal-to-noise ratio for the detection, a demodulator must form inner-products between the incoming signal and a reference signal. That means it must time-align the locally generated reference signal with the received signal. Since the inner-product is formed in a convolving filter, the demodulator must determine the precise time position to sample the input and output of the filter.
Various methods have been tried to implement receivers that not only detect but correct an incoming signal. These methods were first introduced in the analog domain. However, with the availability of digital integrated circuits, the process has been converted over to the digital domain using transformation methods. A typical process for correcting an incoming signal at a receiver employs a phase-locked loop (PLL), which has 3 major components: 1. a timing error detection (TED) circuit; 2. loop filter (LF) for averaging the error; and 3. a controlled oscillator, such as a numerically controlled oscillator (NCO), to advance or retard the timing so that the peak of the incoming signal is matched with the reference signal. There are several widely used methods in timing error detection. The goal of timing-error detection is a TED that yields a high signal-to-noise ratio, and is resource-efficient while maintaining the lowest possible sampling rate (ideally, 1 sample per symbol (spS)).
Maximum-likelihood TED is one example of TED that seeks to meet this goal. Maximum-likelihood TED seeks the peak of correlation output using derivative matched filter (dMF). Other examples of methods used in timing error detection include early-late gate algorithm (ELGA), which essentially finds the derivative by approximation using early, current, and late samples; and Mueller and Muller algorithm, which requires 1 spS but its carrier recovery must be performed before symbol timing recovery. In embodiments of the present invention, matched filter operation is combined with poly-phase filter operation, and in particular with a poly-phase up-sample operation to create a poly-phase matched filter which performs up-sampling and filtering at the same time for timing error detection.
Graphics processing units (GPUs) enable efficient heterogeneous computing. Modern GPU platforms comprise one or more CPU cores and one or more GPUs, which have many powerful arithmetic engines capable of simultaneously running large numbers of lightweight threads. For example, some GPUs presently have 216 processor cores, which collectively allow for more than 165,000 active threads. GPUs process active threads concurrently and to enhance the efficiency of such concurrent execution, no swapping or sharing among concurrent threads occurs. The threads are allocated separately and remain that way until they complete execution.
To efficiently utilize a GPU platform, the programmer must structure the implementation such that GPU threads are kept as busy as possible. This means that opportunities for independent parallel execution must be identified, and spread across the GPU for effective resource utilization.