This thesis is devoted to the algorithm and VLSI architecture design for a new class of error-correcting codes, namely polar codes. In particular, efficient algorithm and hardware design for polar codes decoding are the focus of this thesis. A reduced-latency 2-bit SC decoding algorithm is proposed. Compared to the original SC algorithm that output 1 bit serially, this algorithm can determine 2 bits in one cycle. As a result, the entire decoding latency can be reduced from (2n-2) cycles to (1.5n-2) cycles without any performance loss for code-length n polar codes. Then, several hardware-level optimizing techniques are applied to further reduce the latency to (3n/4-1) cycles. Synthesis results show that the proposed decoder architecture for example (1024, 512) polar codes has significant improvement on throughput and hardware efficiency than the prior works. Then, a multi-bit decision algorithm, namely 2Kb-SCL algorithm, is proposed for general SC list decoding cases. This new algorithm can determine 2K bits at the same time without any performance loss for arbitrary K. As a result, the entire decoding latency can be reduced from 3n-2 to n/2K-2-2 cycles. Then, hardware architecture of the 2Kb-SCL decoder is developed. In particular, data path balancing technique is developed to enable the proposed architecture operate in a high clock frequency. Synthesis results show that the proposed (1024, 512) 2b-rSCL and 4b-rSCL decoders have significant reduction in latency and throughput than the existing SCL decoders. Furthermore, this thesis presented the LLR-based SCL decoder to reduce the overall silicon area. The proposed new algorithm, namely LLR-2Kb-SCL algorithm, can determine 2K bits together with the use of LLR messages. As a result, it can achieve both low-complexity and short latency at the same time. Then, the corresponding VLSI architecture of the new SCL decoder is developed. Synthesis results show that the proposed example (1024, 512) LLR-4b-SCL decoder achieves great reduction in both area and latency as compared to the prior LLR or non-LLR-based SCL decoders. Besides the investigation on SC decoders, this thesis also performs systematic optimization on BP decoders. First, a scaled min-sum (SMS) algorithm is proposed to improve the error-correcting performance. Then, folding and overlapped-scheduling techniques are applied to hardware architecture of BP decoders to reduce area and latency. Moreover, this thesis proposes several novel early-stopping criteria to terminate the iteration of BP decoders earlier. As a result, the overall energy consumption and decoding latency are reduced linearly. Finally, this thesis presents the stochastic SC decoder. First, a stochastic SC decoding algorithm is derived from the deterministic form. Then, several techniques are applied to the stochastic SC algorithm to avoid the performance loss caused by the stochastic computation. With the use of those approaches, an example (1024, 512) stochastic SC decoder can achieve the similar error-correcting performance with its deterministic counterpart.