Generally, a BCH decoder may decode a BCH encoded codeword by passing through three stages. In the first stage, a syndrome calculation circuit may perform syndrome calculation on the codeword. In the second stage, an error locator calculator may generate an error locator polynomial representing error locations in the codeword. In the third stage, a Chien search circuit may identify the error locations in the codeword by finding roots of the error locator polynomial. It is well known that one of the most time-consuming stages is finding roots of the error locator polynomial by the Chien search circuit in the third stage. When the number of data bits of the codeword (e.g., denoted by n) is smaller than a preset value (e.g., denoted by 2m−1 of GF(2m)), the processing of the 2m−1−n bits (bits that are most significant bits of GF(2m) and may be padded on the left side of the n data bits of the codeword) in the third stage may need extra time and thus, causing latency. It is important to skip processing the 2m−1−n bits in the third stage to save latency. In addition, a parallel Chien search circuit (a Chien search circuit that can process more than one bit in a clock cycle) may be implemented in the BCH decoder for a good decoding performance (e.g., a high decoding throughput). The parallel Chien search circuit may be the most area consuming part of the BCH decoder. To develop an area-efficient BCH decoder that yields both low complexity and good decoding performance, it is desirable to reduce the complexity of the parallel Chien search circuit.