Abstract
CpG islands are genomic regions characterized by high GC content and a high frequency of CpG dinucleotides. These regions play crucial roles in gene regulation and often associated with gene promoters. Identifying CpG islands computationally involves analyzing DNA sequences to locate stretches that meet specic criteria for GC content and CpG ratio. we present a method to identify CpG islands using a computational approach based on a sliding window technique. The criteria for identifying CpG islands include a GC content threshold with a minimum of 50% GC content in a sliding window of 200 base pairs and a CpG ratio threshold with an observed-to-expected ratio of CpG dinucleotides of at least 0.6 within the same window. Accurate identication of these regions is crucial for understanding gene regulation, epigenetic modications, and their roles in various biological processes and diseases. This study also explores the use of Hidden Markov Models (HMMs) for detecting CpG islands in DNA sequences. HMMs provide a robust probabilistic framework to model the sequence characteristics of CpG islands. The states of the HMM represent regions with high GC content and dense CpG dinucleotides (CpG island state) and regions with lower GC content (background state). Transition and emission probabilities are estimated using training data, and the Viterbi algorithm is employed to nd the most likely sequence of hidden states, enabling the identication of CpG islands
View more »