Data Curation Network Primers

Persistent link for this collectionhttps://hdl.handle.net/11299/202810

Archived primers from the 2018-2020 Specialized Data Curation Workshops presented by the Data Curation Network and funded by a grant from the Institute for Museum and Library Services (IMLS RE-85-18-0040-18). Data curation primers are interactive, living documents that detail a specific subject, disciplinary area or curation task and that can be used as a reference to curate research data.

Interactive primers available for download and derivatives at: https://github.com/DataCurationNetwork/data-primers

Search within Data Curation Network Primers

Browse

Now showing 1 - 20 of 46

A Primer for Applying and Interpreting Licenses for Research Data and Code
(Data Curation Network, 2024) Chinn, Lisa; Murray🦇, Matthew; Wink, Isaac
This primer gives data curators an overview of the licenses that are commonly applied to datasets and code, familiarizes them with common requirements in institutional data policies, and makes recommendations for working with researchers who need to apply a license to their research outputs or understand a license applied to data or code they would like to reuse. While copyright issues are highly case-dependent, the introduction to the data copyright landscape and the general principles provided here can help data curators empower researchers to understand the copyright context of their own data.
Linked Data Primer
(Data Curation Network, 2024-10) Provo, Alexandra; Burns, Halle; Lamorte, Michele; Jiao, Chenyue
FITS Data Curation Primer
(Data Curation Network, 2024-10-31) McKone, Lubov; DeRocchis, Robyn; Orozco, Rebecca; Woodbrook, Rachel
From the primer: FITS, or Flexible Image Transport System, is a data format most widely used in astronomy to transfer scientific data and their associated metadata. FITS was developed in the late 1970s by astronomers in the USA and Europe to facilitate international data transfer between observatories. In astronomy, images of the night sky are treated as arrays of data to be analyzed. For this reason, FITS was designed to be a highly flexible format that can be used to store and transfer any number of n-dimensional arrays. This means that although its name contains “image,” FITS files often contain only non-image data such as one-dimensional spectra or tabular information. Most commonly, FITS files contain a combination of images and 2-dimensional data tables stored in rows and columns. In fact, FITS files can contain almost anything. Although the building blocks are simple, FITS files can have unlimited components, and therefore can become complex quickly.
OpenRefine Primer
(Data Curation Network, 2024) Owen, Heather Charlotte; Thite, Aditi; Tvrdy, Peyton
The purpose of this primer is to describe and demonstrate useful features and aspects of the OpenRefine software and help data curators understand how they can use OpenRefine as a part of the data curation process. This document is meant to serve as a starting point for data curators, researchers, and anyone who works with research data to understand how this software can be used in the context of data management and curation.
TIFF Data Curation Primer
(Data Curation Network, 2024-07-03) Everritt, Leah; Ferguson, Jen; Simpson, Emily
From the description: TIFF, or Tagged Image File Format, is a raster image format now maintained by Adobe Systems. TIFFs are a relatively lossless file format of high resolution bitmap images, typically 400-600 pixels per inch (ppi) and typically with a maximum file size of around 4 GB. TIFFs may also act as container files used to store smaller JPEGs acting as an image file directory. TIFFs are often used as archival master copies to preserve as much detail of an image as possible to store in a digital repository or similar platform. Given their up to 4 GB size, TIFFs can also take up significant storage space, limiting the amount of items that can be uploaded and stored. TIFFs are frequently used to make derivative, lower-resolution copies (PNG, JPEG or JPEG2000 format) to be used for access copies in digital libraries, museum exhibits, or archives. This compressed format provides more seamless access for users.
ArcGIS Pro Project Package (PPKX) Data Curation Primer
(Data Curation Network, 2024) Kernik, Melinda; Work, Amy; Ranganath, Aditya; Martindale, Jaime
From primer intro: ArcGIS Pro Project Packages are great for researchers sharing files within the same software environment (i.e. ArcGIS Pro), but the proprietary nature of the format and rapid versioning of the software makes decisions about long-term archiving difficult. ... This primer describes tips for opening and reviewing this overarching project file. Curators are encouraged to consult additional DCN Data Primers for curation checklists for component file types (like geodatabases and geotiffs).
Mass Spectrometry Primer
(Data Curation Network, 2023) Westra, Brian; Li, Ye; Ruhs, Nick; McEwen, Leah Rae
(From Primer Overview): Mass Spectrometry (abbreviated here as MS, not to be confused with mass spectroscopy) is an analytical technology to identify chemical substances through measuring the mass-to-charge ratio (m/z) for molecules (or their fragments/components) in a sample. The resulting spectrum shows the calculated intensity of peaks from various mass-to-charge ratios (Figure 1). This information may be used to identify unknown substances, quantify known substances, and identify chemical and structural properties of chemicals. Typically, the resulting spectra are compared to a library of known substances through a computational process to identify which compounds are present. The mass spectrometer uses an ionizer to ionize the substances into fragments carrying different charges. The ion fragments then enter the mass analyzer where they will be accelerated to various speeds depending on their mass-to-charge ratio (m/z). The ion fragments are detected when they leave the mass analyzer, and the intensity of the signal is recorded accordingly.
Text Encoding Initiative (TEI) Primer
(Data Curation Network, 2023) Dalton, Courtney; Kilcer, Emily; Wampole, Katie; Swanz, Sarah
This primer focuses on textual resources or their facsimiles that have been annotated according to Text Encoding Initiative (TEI) conventions. Because TEI is expressed in the Extensible Markup Language (XML), many of the considerations in this primer may be relevant to XML files in general, as well as textual data encoded using other markup languages. In addition, the Music Encoding Initiative (MEI) is based on TEI, and so curation of MEI files can expect to follow a similar process. Other text corpora, such as machine learning training sets or large language models, are beyond the scope of this primer.
Python Data Curation Primer
(Data Curation Network, 2023) Sheffield, Megan; Hernandez, Jonathan; de la Cruz, Justin; Maye, Kaypounyers; Purpur, Erich
A .py file contains Python code in code blocks with text annotations that typically explain the code. The .py file itself can be opened in any text editor or integrated development environment (IDE). Python programs can include only one file or in the case of a more sophisticated application, many, many files within one program. This primer describes how to curate Python code for long-term access and preservation.
Sensitive Biodiversity Essentials
(Data Curation Network, 2023) Jordan, Jen; Ramirez-Reyes, Carlos; Taylor, Shawna; Thielen, Joanna; Wham, Briana
This primer is intended to offer guidance to curators for assessing the sensitivity of biodiversity data associated with research on animals, insects, plants, fungi, microorganisms, etc. Sensitive data can be defined as any information which “would result in an ‘adverse effect’ on the taxon or attribute in question or to a living individual” if made publicly available (Chapman, 2020). This primer covers how to identify sensitive data, considerations for sharing these data, and points for discussion with data depositors. The objective is to support researchers in balancing the sharing of biodiversity data as a public good and protecting it from misuse.
CARE Data Principles, Indigenous data, Data related to Indigenous Peoples and Interest
(Data Curation Network, 2023) Barsness, Sarah; Cummins, Jewel; Fernandez, Maria Victoria; James, Ann Myatt; Pierce Farrier, Katie; Pringle, Jonathan; Carroll, SR; Taitingfong, Riley; Wieker, Alex
The CARE Principle Data Primer is intended to provide an introduction to the CARE Principles for Indigenous Data Governance (Collective benefit, Authority to control, Responsibility and Ethics) and to explore broader topics of equitable data stewardship. The primer will guide information professionals and researchers in understanding Tribal sovereignty, cultural context, and the historical misuse of Indigenous data. Resources will include appropriate labeling of traditional knowledge practices, modern day Tribal reservation locations, and ways to provide culturally responsive data access and use. While the ethics and standards discussed in this primer focus on CARE principles specific to Indigenous communities residing in the contemporary United States, these principles are applicable across many fields of research and communities.
Archaeology Data Primer
(Data Curation Network, 2023) Arteaga Cuevas, Maria; Fernandez, Rachel; Wittman, Hollis
Archaeology is the study of the human past through the analysis of material remains. Preservation of material remains is crucial to the understanding and sharing of different cultures. As many aspects of archaeological methodology destroy the very record they are analyzing, documentation is critical to this field (Richards et al, 2021). Consequently, the analysis and context recorded through these archaeological investigations and the data from artifacts is key. In regards to digital curation, it is important to understand that archaeological data are as varied as the cultures and material that are studied. Thus, this primer will not cover the entire scope of the field of archaeology, but hopes to serve as a starting point for curators and data managers to understand what data is produced and how they can be prepared for reuse.
FASTA/FASTQ Data Curation Primer
(Data Curation Network, 2023) Bowman, Laura; Sheridan, Shannon; Wham, Briana Ezray; Wright, Sarah
Background: FASTA and FASTQ are commonly used text-based file formats for storing and sharing nucleotide (DNA or RNA) sequences and/or amino acid (protein) sequences, and are the main focus of this primer. FASTA and FASTQ are the recognized standard file formats for bioinformatics studies, including next-generation sequencing (NGS), enabling large-scale exchange of data and information associated with massive sequencing projects (Sielemann et al., 2020). NGS refers to high-throughput technologies for large-scale DNA sequencing such as whole genome sequencing, whole-exome sequencing (WES, WXS), RNA-seq, miRNA-seq, ChIP-seq, and DNA Methylation. NGS experiments generate billions of short sequence reads for each sample which when combined with description and annotations can result in files ranging from a few to hundreds of gigabytes (Zhang, 2016). FASTA and FASTQ files can be opened by many sequence alignment applications or text editors. There are various applications that can convert .fasta files.
Audiovisual Data Curation Primer
(Data Curation Network, 2023) Grace, Madina; Jerrild, Meg; Phegley, Lauren
This primer reviews the practices of curating audiovisual data. Data is defined as “facts, ideas, or discrete pieces of information, especially when in the form originally collected and unanalyzed” (Society of American Archivists). Audiovisual data is then discrete pieces of information captured in signals and sound waves that when given context allow the user to create meaning. We believe that what makes something research data is the way it is utilized, as not all data was created initially for research purposes. While audiovisual materials are not a common form of research data in all fields, they have been used in the social sciences, such as behavioral psychology and anthropology, and we have seen a new movement towards audiovisual data in the sciences as well. The increasing affordability of storage space and the continual development of quality recording equipment seem to be driving this new enthusiasm for audiovisual data.
Simple Darwin Core for Non-Biologists Primer
(Data Curation Network, 2023) O'Donnell, Megan N.; Delserone, Leslie M.
This primer focuses on Simple DwC (http://rs.tdwg.org/dwc/terms/simple/), a “mechanism used to share biodiversity information using the simplest methods and structure” (Darwin Core Task Group, 2014). With Simple DwC, the DwC schema is applied to a single flat file (i.e., table or spreadsheet). Because it is a self-contained data set that can be opened, edited, and analyzed using a wide variety of software, Simple DwC is easier to implement than other forms of DwC that use XML, RDF, or relational databases. The primer’s goal is to assist a curator presented with a data set structured in Simple DwC, or to assist a curator in a decision to apply the standard to a data set.
Interdisciplinary and Highly Collaborative Research (IHCR) Data Primer
(Data Curation Network, 2023) Kouper, Inna; Johnson, Andrew M.; Wrigley, Jordan; Ranganath, Aditya
For the purposes of this primer, interdisciplinary and highly collaborative research (IHCR) is defined broadly as research that combines resources and expertise across domains, communities, and institutions. Data generated by interdisciplinary teams provides a basis for investigating complex phenomena that are relevant to established research fields. It supports problem-based research. When curated well and early in the lifecycle, interdisciplinary data offers multimodal and adaptable resources for use by many disciplines and stakeholders.
Accessibility Data Curation Primer
(Data Curation Network, 2023) Oxford, Emily; Woodbrook, Rachel
Data curators are uniquely positioned to help improve access not just to individual datasets, but to the world of research data at large. As guides to and stewards of data, curators can counsel researchers on how to build accessibility into data planning, collection, analysis, and archiving. This primer is intended as a starting point for data curators who are invested in improving the accessibility of individual files or datasets, rather than as definitive guide. There is far more work to be done than can be addressed in the scope of this primer. Disability is also a complex concept with a diversity of possible presentations, which will present varying (sometimes even conflicting) accessibility needs.
Column Binary Data Curation Primer
(Data Curation Network, 2023) Ko, Jessica; Norek, Kelsie
Column binary (or colbin) is a file format that is most frequently used to store survey data from punched cards, which are paper cards in which holes are punched to represent data points. This document describes how to curate column binary files.
Primer for Researchers on How to Manage Data
(Data Curation Network, 2023) Arteaga Cuevas, Maria; Taylor, Shawna; Narlock, Mikala R.
This work was created as part of a collaboration between the National Center for Data Services (NNLM) and the Data Curation Network.
Clinical Trials Data Primer
(Data Curation Network, 2022) Gonzalez, Liliana; Narlock, Mikala R.; Taylor, Shawna

University Digital Conservancy

University of Minnesota Twin Cities

Browse

Recent Submissions