This readme file was generated on October 5, 2023 by Mikala Narlock. This file was updated on October 26, 2023 by Mikala Narlock thanks to the curatorial review by Melinda Kernik. File was updated again on Dec 13 by Narlock with help from Kernik. GENERAL INFORMATION Title of Dataset: Data for "Knowledge Infrastructures Are Growing Up: The Case for Institutional (Data) Repositories 10 Years After the Holdren Memo." Author/Principal Investigator Information Name: Mikala Narlock ORCID: 0000-0002-2730-7542 Institution: University of Minnesota / Data Curation Network Email: mnarlock@umn.edu Date of data collection: July - August 2023 SHARING/ACCESS INFORMATION Licenses/restrictions placed on the data: CC0 1.0 Universal Was data derived from another source? If yes, list source(s): - https://doi.org/10.13020/2evx-7a87 Recommended citation for this dataset: Narlock, M., Priesman Marquez, R., Herrmann, H., & Ibrahim, M. 2023. Data for "Knowledge Infrastructures Are Growing Up: The Case for Institutional (Data) Repositories 10 Years After the Holdren Memo." https://doi.org/10.13020/w8nk-d131. DATA & FILE OVERVIEW File List: Data-knowledge-infrastructures-2023.csv Narlock-et-al-2023-images.zip, which contains: - Was it possible to identify the number of datasets in the repository.png - Growth in IRs and IDRs from 2017 to 2023.png - Datasets in IRs and IDRs 2017-2020-2023.png - 2023 Number of datasets in IRs and IDRs.png METHODOLOGICAL INFORMATION Description of methods used for collection/generation of data: Using methodology set forth in Coburn and Johnston (2020): https://doi.org/10.13020/2evx-7a87. As Johnston and Coburn (2020) did, to begin our institutional reviews, we navigated to each institution’s online institutional repository and data repository directly if the URL was known from the previous research. When the URL was not immediately known, a combination of internet-wide searches, as well as a review of library websites, was used to identify the repository. If an institution did not have a data repository, but a search of their library website referred researchers to Dryad for their repository, the Dryad URL that is institution specific was recorded. For institutions that have a standalone data repository, the number of datasets in Dryad were not counted. For example, the University of Minnesota has an institutional repository, a data repository, and a membership to Dryad; however, only the URLs for the first two repositories were recorded. In contrast, the University of California systems leverage Dryad as their primary data repositories, so only those URLs were captured. In terms of counting datasets, we manually counted by exploring each repository’s interface for each institution. We would like to emphasize at the outset that this seemingly simple task (counting how many datasets are in each repository) was incredibly complex. After navigating to the repository’s URL, either via a link from the 2020 study, from a search engine, or from the institution’s library’s website, we started on the homepage by looking for a prompt or search to view all datasets. Many institutions offer these as a way to explore their holdings – browse all articles, electronic theses, etc. If the option to select datasets was not visible, we would start an empty search and look for the ability to filter by item type. Some repositories did not allow for blank searches – when this occurred, we would try to search for “dataset” or “data set” and attempt to filter further. Because many institutions have both IRs and IDRs, this process was repeated for both repositories. People involved with sample collection, processing, analysis, visualization, and/or submission: -Mikala Narlock -Rachel Priesman Marquez -Heather Herrmann -Maisarah Ibrahim All data was reviewed by Narlock. However, this introduces limitations and biases, as the bulk of the work was conducted by an individual. Methods for processing the data: In pulling over data from Johnston and Coburn (2020), we cleaned slightly, by combining previous fields. In particular, while they recorded if it was clear or unclear whether a repository had data, and then recorded "n/a" for the number of datasets in a repository, we recorded "unclear" in the number of datasets. We used the data from 2023, 2020, and 2017 to calculate growth, in terms of discrete number of datasets as well as averages. DATA-SPECIFIC INFORMATION FOR: Data-knowledge-infrastructures-2023.csv Number of variables: 18 Number of cases/rows: 119 institutions (120 rows) Variable List: Institutional Repository URL: The web location for the institution's IR Has repository? Y/n Standalone data repository 2023: Yes, No, Yes pilot, Implementing, or Dryad Data repository url: If applicable, the web location for the institution's data repository Observed dataset count IR: The observed number of datasets in the IR Observed dataset count Data Repo: The observed number of datasets in the data repository Total dataset count (July/August 2023): Sum of previous two rows Observed dataset count IR (2020): Derived from previous study (Johnston and Coburn 2020), but combined "unclear" and "n/a" as described above. This was the observed number of datasets in the IR in 2020. Observed Dataset Count Data Repo (2020): From previous study (Johnston and Coburn 2020). This was the observed number of datasets in the data repository in 2020. Observed Total Count (2020): From previous study (Johnston and Coburn 2020). This was the sum of observed datasets in 2020. Reported Dataset Count (Jan 2017): From previous study General IR Growth Observed (2020-2023): 2023 data - 2020 data Data Repo Growth Observed (2020-2023): 2023 data - 2020 data General Growth Observed (2017-2020): The difference between Observed Total Count (2020) and Reported Dataset Count (Jan 2017) Average Growth per year (2023-2020): (2023 data - 2020 data) / 3 Average Growth per year (2023-2017): (2023 data - 2017 data) / 6 Standalone Data Repository 2017: From previous study (Hudson Vitale et al, 2017). Indicates whether the institution had a standalone repository in 2017. This data was also reported in Johnston and Coburn (2020). Response to 2017 Spec Kit Results: From previous study (Hudson Vitale et al, 2017). Indicates whether the institution responded to the survey in 2017. This data was also reported in Johnston and Coburn (2020). Missing data codes: n/a -- information not available, or not applicable. In the computation fields (e.g., average growth per year), n/a is used when there are text-based values in the previous rows. Unclear-- while there is a repository, it was not obvious or straightforward to identify a complete number of datasets.