This readme.txt file was generated on <2025-04-24> by Jillian Marlowe Recommended citation for the data: ------------------- GENERAL INFORMATION ------------------- 1. Title of Dataset VCF of insertions and deletions found in a multi breed Equus caballus population 2. Author Information University of Minnesota Equine Genetics and Genomics Lab Principal Investigator Contact Information Name: Jillian Marlowe Institution: College of Veterinary Medicine Address: Email: marlo072@umn.edu ORCID: 0009-0007-5692-2804 Associate or Co-investigator Contact Information Name: Sian Durward-Akhurst Institution: College of Veterinary Medicine Address: Email: durwa004@umn.edu ORCID: 0000-0003-3034-1554 Associate or Co-investigator Contact Information Name: Molly McCue Institution: College of Veterinary Medicine Address: Email: mccu0173@umn.edu ORCID: 0000-0002-6807-0318 3. Date published or finalized for release: April 2025 4. Date of data collection (single date, range, approximate date) 20110101 - 20230726 5. Geographic location of data collection (where was data collected?): Genomes came from international populations including: USA, France, Germany, UK, Brazil, Korea, Mongolia and more. 6. Overview of the data (abstract): Datasets containing high confidence single nucleotide polymorphisms that exist in the genome of horses have previously been published in support of population genetic studies, disease variant discovery, and other type of genetic research. There are no similar datasets for insertions and deletions (indels). Here we created a preliminary set of indels that exist within a certain range of allele frequencies in a large diverse population of apparently healthy horses. A total of ~2M indels passed GATK filtering thresholds and had an allele frequency between 1% and 60%. Though the criteria for inclusion in this dataset are lenient to increase the total numbers these loci are likely to exist in the equine genome and can be used as a preliminary set of indels for some genetic studies. So far this set of indels has been used to create simulated equine genomes that will be used as a benchmarking measure of variant calling and genotyping methods. -------------------------- SHARING/ACCESS INFORMATION -------------------------- 1. Licenses/restrictions placed on the data: CC0 1.0 Universal http://creativecommons.org/publicdomain/zero/1.0/ 2. Links to publications that cite or use the data: Marlowe, JL, Durward-Akhurst, SA*, & McCue, ME*, Simulated whole genome sequencing data of Equus Caballus as a novel benchmark truth set. Submitted to Nature Scientific Data 3. Was data derived from another source? If yes, list source(s): 4. Terms of Use: Data Repository for the U of Minnesota (DRUM) By using these files, users agree to the Terms of Use. https://conservancy.umn.edu/pages/policies/#drum-terms-of-use 5. Links to related datasets: Simulated sequencing data that were created using this dataset can be found on the European Nucleotide Archive under the study accession number: PRJEB80333. https://www.ebi.ac.uk/ena/browser/view/PRJEB80333. Truthset VCFs created in conjunction with the above simulated sequencing can be found on the UMN DRUM: Marlowe, Jillian L.; Durward-Akhurst, Sian A.; McCue, Molly E.. (2024). VCF truth sets of variants inserted into simulated equine genomes (90 VCFs). Retrieved from the Data Repository for the University of Minnesota (DRUM), https://doi.org/10.13020/NM3A-W471. --------------------- DATA & FILE OVERVIEW --------------------- 1. File List A. Filename: prevalentIndelsinEquinePop_20230726_EquCab3.vcf.gz Short description: variant call format file containing information for ~ 4M insertions and deletions found in the equine genome. File includes a header containing descriptions of all fields used in the VCF. The location, reference and alternate allele for each variant, allele frequency from the population and quality scores. B. Filename: prevalentIndelsinEquinePop_20230726_EquCab3.vcf.gz.tbi Short description: the tabix generated index file for the VCF file. 2. Relationship between files: The index file is a companion to the VCF that allows most bioinformatic tools to work with the compressed file format of the VCF. The two files should be moved together, but the index file is easily recoverable if necessary. -------------------------- METHODOLOGICAL INFORMATION -------------------------- 1. Description of methods used for collection/generation of data: FASTA files containing whole genome sequencing for 939 horses were processed according to GATK best practices to align raw sequencing to the EquCab3 reference genome and perform variant calling followed by joint genotyping of the entire population. The resulting VCF containing all discovered variants was filtered to only contain insertions and deletions. Hard filtering was performed using the basic thresholds presented by GATK recommendations as follows: Variants not meeting these criteria are marked for removal QD <2.0 (Quality Depth) QUAL < 30.0 (Quality) SOR > 3.0 (Strand Orientation Ratio) FS > 200 (Fisher Strand) ReadPosRankSum < -20.0 (Read position Rank Sum test) Variants that passed filtering were further reduced by taking only indels that had between 1% and 60% allele frequency in the population. This number was selected so that on the low end the indel has to be present in at least 5 horses (to increase confidence in the existence of the indel within the population) and reduce the possibility that the indel comes from a mapping error or error in the reference assembly. 2. Instrument- or software-specific information needed to interpret the data: any standard bioinformatic tool used to edit and process genomic data will work with this data including bcftooks,gatk,vcftools, and more. VCFs can also be viewed manually using the command line or a text editor if the file is unzipped first. 3. People involved with sample collection, processing, analysis and/or submission: Jillian L. Marlowe, Sian A. Durward-Akhurst, Molly E. McCue ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: prevalentIndelsinEquinePop_20230726_EquCab3.vcf.gz ----------------------------------------- 1. Number of variables: 8 columns. Info column contains 15 additional variables 2. Number of cases/rows: 2,164,082 (variants, header not included) 3. Variable List A. Name: CHROM Description: Chromosome were variant is located B. Name: POS Description: Number indicating location along the chromosome where the variant is located (the beginning of the variant in the case of an indel) B. Name: ID Description: column could contain defined IDs indicating previously known SNPs. This dataset does not contain any IDs B. Name: REF Description: reference allele nucleotide sequence B. Name: ALT Description: Alternate allele nucleotide sequence B. Name: QUAL Description: A quality score calculated by GATK HaplotypeCaller based on several characteristics of the variant. B. Name: FILTER Description: Denotes if a variant has passed all filtering or marking which filters were failed. In this case all variants are marked as PASS B. Name: INFO Description: A large column encompassing many different scores. Definitions of variables can be found in the header of the VCF or on the GATK database where all information is maintained and updated.