Data Curation Network Primers

Persistent link for this collection

Archived primers from the 2018-2020 Specialized Data Curation Workshops presented by the Data Curation Network and funded by a grant from the Institute for Museum and Library Services (IMLS RE-85-18-0040-18). Data curation primers are interactive, living documents that detail a specific subject, disciplinary area or curation task and that can be used as a reference to curate research data.

Interactive primers available for download and derivatives at: https://github.com/DataCurationNetwork/data-primers

Search within Data Curation Network Primers

Browse

Recent Submissions

Now showing 1 - 20 of 42
  • Item
    TIFF Data Curation Primer
    (Data Curation Network, 2024-07-03) Everritt, Leah; Ferguson, Jen; Simpson, Emily
    From the description: TIFF, or Tagged Image File Format, is a raster image format now maintained by Adobe Systems. TIFFs are a relatively lossless file format of high resolution bitmap images, typically 400-600 pixels per inch (ppi) and typically with a maximum file size of around 4 GB. TIFFs may also act as container files used to store smaller JPEGs acting as an image file directory. TIFFs are often used as archival master copies to preserve as much detail of an image as possible to store in a digital repository or similar platform. Given their up to 4 GB size, TIFFs can also take up significant storage space, limiting the amount of items that can be uploaded and stored. TIFFs are frequently used to make derivative, lower-resolution copies (PNG, JPEG or JPEG2000 format) to be used for access copies in digital libraries, museum exhibits, or archives. This compressed format provides more seamless access for users.
  • Item
    ArcGIS Pro Project Package (PPKX) Data Curation Primer
    (Data Curation Network, 2024) Kernik, Melinda; Work, Amy; Ranganath, Aditya; Martindale, Jaime
    From primer intro: ArcGIS Pro Project Packages are great for researchers sharing files within the same software environment (i.e. ArcGIS Pro), but the proprietary nature of the format and rapid versioning of the software makes decisions about long-term archiving difficult. ... This primer describes tips for opening and reviewing this overarching project file. Curators are encouraged to consult additional DCN Data Primers for curation checklists for component file types (like geodatabases and geotiffs).
  • Item
    Mass Spectrometry Primer
    (Data Curation Network, 2023) Westra, Brian; Li, Ye; Ruhs, Nick; McEwen, Leah Rae
    (From Primer Overview): Mass Spectrometry (abbreviated here as MS, not to be confused with mass spectroscopy) is an analytical technology to identify chemical substances through measuring the mass-to-charge ratio (m/z) for molecules (or their fragments/components) in a sample. The resulting spectrum shows the calculated intensity of peaks from various mass-to-charge ratios (Figure 1). This information may be used to identify unknown substances, quantify known substances, and identify chemical and structural properties of chemicals. Typically, the resulting spectra are compared to a library of known substances through a computational process to identify which compounds are present. The mass spectrometer uses an ionizer to ionize the substances into fragments carrying different charges. The ion fragments then enter the mass analyzer where they will be accelerated to various speeds depending on their mass-to-charge ratio (m/z). The ion fragments are detected when they leave the mass analyzer, and the intensity of the signal is recorded accordingly.
  • Item
    Text Encoding Initiative (TEI) Primer
    (Data Curation Network, 2023) Dalton, Courtney; Kilcer, Emily; Wampole, Katie; Swanz, Sarah
    This primer focuses on textual resources or their facsimiles that have been annotated according to Text Encoding Initiative (TEI) conventions. Because TEI is expressed in the Extensible Markup Language (XML), many of the considerations in this primer may be relevant to XML files in general, as well as textual data encoded using other markup languages. In addition, the Music Encoding Initiative (MEI) is based on TEI, and so curation of MEI files can expect to follow a similar process. Other text corpora, such as machine learning training sets or large language models, are beyond the scope of this primer.
  • Item
    Python Data Curation Primer
    (Data Curation Network, 2023) Sheffield, Megan; Hernandez, Jonathan; de la Cruz, Justin; Maye, Kaypounyers; Purpur, Erich
    A .py file contains Python code in code blocks with text annotations that typically explain the code. The .py file itself can be opened in any text editor or integrated development environment (IDE). Python programs can include only one file or in the case of a more sophisticated application, many, many files within one program. This primer describes how to curate Python code for long-term access and preservation.
  • Item
    Sensitive Biodiversity Essentials
    (Data Curation Network, 2023) Jordan, Jen; Ramirez-Reyes, Carlos; Taylor, Shawna; Thielen, Joanna; Wham, Briana
    This primer is intended to offer guidance to curators for assessing the sensitivity of biodiversity data associated with research on animals, insects, plants, fungi, microorganisms, etc. Sensitive data can be defined as any information which “would result in an ‘adverse effect’ on the taxon or attribute in question or to a living individual” if made publicly available (Chapman, 2020). This primer covers how to identify sensitive data, considerations for sharing these data, and points for discussion with data depositors. The objective is to support researchers in balancing the sharing of biodiversity data as a public good and protecting it from misuse.
  • Item
    CARE Data Principles, Indigenous data, Data related to Indigenous Peoples and Interest
    (Data Curation Network, 2023) Barsness, Sarah; Cummins, Jewel; Fernandez, Maria Victoria; James, Ann Myatt; Pierce Farrier, Katie; Pringle, Jonathan; Carroll, SR; Taitingfong, Riley; Wieker, Alex
    The CARE Principle Data Primer is intended to provide an introduction to the CARE Principles for Indigenous Data Governance (Collective benefit, Authority to control, Responsibility and Ethics) and to explore broader topics of equitable data stewardship. The primer will guide information professionals and researchers in understanding Tribal sovereignty, cultural context, and the historical misuse of Indigenous data. Resources will include appropriate labeling of traditional knowledge practices, modern day Tribal reservation locations, and ways to provide culturally responsive data access and use. While the ethics and standards discussed in this primer focus on CARE principles specific to Indigenous communities residing in the contemporary United States, these principles are applicable across many fields of research and communities.
  • Item
    Archaeology Data Primer
    (Data Curation Network, 2023) Arteaga Cuevas, Maria; Fernandez, Rachel; Wittman, Hollis
    Archaeology is the study of the human past through the analysis of material remains. Preservation of material remains is crucial to the understanding and sharing of different cultures. As many aspects of archaeological methodology destroy the very record they are analyzing, documentation is critical to this field (Richards et al, 2021). Consequently, the analysis and context recorded through these archaeological investigations and the data from artifacts is key. In regards to digital curation, it is important to understand that archaeological data are as varied as the cultures and material that are studied. Thus, this primer will not cover the entire scope of the field of archaeology, but hopes to serve as a starting point for curators and data managers to understand what data is produced and how they can be prepared for reuse.
  • Item
    FASTA/FASTQ Data Curation Primer
    (Data Curation Network, 2023) Bowman, Laura; Sheridan, Shannon; Wham, Briana Ezray; Wright, Sarah
    Background: FASTA and FASTQ are commonly used text-based file formats for storing and sharing nucleotide (DNA or RNA) sequences and/or amino acid (protein) sequences, and are the main focus of this primer. FASTA and FASTQ are the recognized standard file formats for bioinformatics studies, including next-generation sequencing (NGS), enabling large-scale exchange of data and information associated with massive sequencing projects (Sielemann et al., 2020). NGS refers to high-throughput technologies for large-scale DNA sequencing such as whole genome sequencing, whole-exome sequencing (WES, WXS), RNA-seq, miRNA-seq, ChIP-seq, and DNA Methylation. NGS experiments generate billions of short sequence reads for each sample which when combined with description and annotations can result in files ranging from a few to hundreds of gigabytes (Zhang, 2016). FASTA and FASTQ files can be opened by many sequence alignment applications or text editors. There are various applications that can convert .fasta files.
  • Item
    Audiovisual Data Curation Primer
    (Data Curation Network, 2023) Grace, Madina; Jerrild, Meg; Phegley, Lauren
    This primer reviews the practices of curating audiovisual data. Data is defined as “facts, ideas, or discrete pieces of information, especially when in the form originally collected and unanalyzed” (Society of American Archivists). Audiovisual data is then discrete pieces of information captured in signals and sound waves that when given context allow the user to create meaning. We believe that what makes something research data is the way it is utilized, as not all data was created initially for research purposes. While audiovisual materials are not a common form of research data in all fields, they have been used in the social sciences, such as behavioral psychology and anthropology, and we have seen a new movement towards audiovisual data in the sciences as well. The increasing affordability of storage space and the continual development of quality recording equipment seem to be driving this new enthusiasm for audiovisual data.
  • Item
    Simple Darwin Core for Non-Biologists Primer
    (Data Curation Network, 2023) O'Donnell, Megan N.; Delserone, Leslie M.
    This primer focuses on Simple DwC (http://rs.tdwg.org/dwc/terms/simple/), a “mechanism used to share biodiversity information using the simplest methods and structure” (Darwin Core Task Group, 2014). With Simple DwC, the DwC schema is applied to a single flat file (i.e., table or spreadsheet). Because it is a self-contained data set that can be opened, edited, and analyzed using a wide variety of software, Simple DwC is easier to implement than other forms of DwC that use XML, RDF, or relational databases. The primer’s goal is to assist a curator presented with a data set structured in Simple DwC, or to assist a curator in a decision to apply the standard to a data set.
  • Item
    Interdisciplinary and Highly Collaborative Research (IHCR) Data Primer
    (Data Curation Network, 2023) Kouper, Inna; Johnson, Andrew M.; Wrigley, Jordan; Ranganath, Aditya
    For the purposes of this primer, interdisciplinary and highly collaborative research (IHCR) is defined broadly as research that combines resources and expertise across domains, communities, and institutions. Data generated by interdisciplinary teams provides a basis for investigating complex phenomena that are relevant to established research fields. It supports problem-based research. When curated well and early in the lifecycle, interdisciplinary data offers multimodal and adaptable resources for use by many disciplines and stakeholders.
  • Item
    Accessibility Data Curation Primer
    (Data Curation Network, 2023) Oxford, Emily; Woodbrook, Rachel
    Data curators are uniquely positioned to help improve access not just to individual datasets, but to the world of research data at large. As guides to and stewards of data, curators can counsel researchers on how to build accessibility into data planning, collection, analysis, and archiving. This primer is intended as a starting point for data curators who are invested in improving the accessibility of individual files or datasets, rather than as definitive guide. There is far more work to be done than can be addressed in the scope of this primer. Disability is also a complex concept with a diversity of possible presentations, which will present varying (sometimes even conflicting) accessibility needs.
  • Item
    Column Binary Data Curation Primer
    (Data Curation Network, 2023) Ko, Jessica; Norek, Kelsie
    Column binary (or colbin) is a file format that is most frequently used to store survey data from punched cards, which are paper cards in which holes are punched to represent data points. This document describes how to curate column binary files.
  • Item
    Primer for Researchers on How to Manage Data
    (Data Curation Network, 2023) Arteaga Cuevas, Maria; Taylor, Shawna; Narlock, Mikala R.
    This work was created as part of a collaboration between the National Center for Data Services (NNLM) and the Data Curation Network.
  • Item
    Clinical Trials Data Primer
    (Data Curation Network, 2022) Gonzalez, Liliana; Narlock, Mikala R.; Taylor, Shawna
  • Item
    Qualitative Data Curation Primer
    (Data Curation Network, 2021-03-11) Castillo, Diana; Coates, Heather; Narlock, Mikala
  • Item
    Oral History Interviews Data Curation Primer
    (Data Curation Network, 2021-03-10) Pryse, JA; Harp, Matthew; Mannheimer, Sara; Marsolek, Wanda; Cowles, Wind
  • Item
    Consent Forms Data Curation Primer
    (Data Curation Network, 2021-02-26) Hunt, Shanda; Hofelich Mohr, Alicia; Woodbrook, Rachel
  • Item
    SAS Data Curation Primer
    (Data Curation Network, 2020) Xu, Qiong