FASTA/FASTQ Data Curation Primer

Bowman, LauraSheridan, ShannonWham, Briana EzrayWright, Sarah2023-08-312023-08-312023https://hdl.handle.net/11299/256274Background: FASTA and FASTQ are commonly used text-based file formats for storing and sharing nucleotide (DNA or RNA) sequences and/or amino acid (protein) sequences, and are the main focus of this primer. FASTA and FASTQ are the recognized standard file formats for bioinformatics studies, including next-generation sequencing (NGS), enabling large-scale exchange of data and information associated with massive sequencing projects (Sielemann et al., 2020). NGS refers to high-throughput technologies for large-scale DNA sequencing such as whole genome sequencing, whole-exome sequencing (WES, WXS), RNA-seq, miRNA-seq, ChIP-seq, and DNA Methylation. NGS experiments generate billions of short sequence reads for each sample which when combined with description and annotations can result in files ranging from a few to hundreds of gigabytes (Zhang, 2016). FASTA and FASTQ files can be opened by many sequence alignment applications or text editors. There are various applications that can convert .fasta files.enCreative Commons Attribution 4.0 International (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/FASTA/FASTQ Data Curation PrimerManual or Documentation