This readme.txt file was generated on 2020-01-15 by Lisa Johnston ------------------- GENERAL INFORMATION ------------------- 1. Title of Dataset: Data supporting “Data Sharing Readiness in Academic Institutions” Version 1.0 2. Author Information Name: Lisa R Johnston Institution: University of Minnesota Libraries Email: ljohnsto@umn.edu ORCID: http://orcid.org/0000-0001-6908-9240 Name: Liza Coburn Institution: University of Minnesota Libraries Email: ecoburn@umn.edu 3. Date of data collection: 2020-01-02 to 2020-01-15 4. Geographic location of data collection: n/a -------------------------- SHARING/ACCESS INFORMATION -------------------------- 1. Licenses/restrictions placed on the data: CC0 http://creativecommons.org/publicdomain/zero/1.0/ 2. Links to publications that cite or use the data: Johnston, L. R., Coburn, L. (2020). Data Sharing Readiness in Academic Institutions. Data Curation Network. https://datacurationnetwork.org/data-sharing-readiness-in-academic-institutions. 3. Recommended citation for the data: Johnston, L. R., Coburn, L. (2020). Data Sharing Readiness in Academic Institutions. Data Curation Network. https://doi.org/10.13020/2evx-7a87. -------------------------- GENERAL INFORMATION -------------------------- To address how has the academic landscape for data repository and curation services changed we used website content analysis to better understand data repository services in academic research libraries, building on the 2017 Spec Kit for Data Curation (Hudson-Vitale et al., 2017a). Of the 124 ARL institutions we chose to focus on academic institutions, and therefore excluded 10 civic libraries (National Archives and Records Administration, Boston Public Library, Center for Research Libraries, Library of Congress, National Agricultural Library, National Library of Medicine, National Research Council Knowledge Management, New York Public Library, New York State Library, Smithsonian Institution). For each of the remaining 114 ARL institutions we asked four research questions: Do they support data sharing via data repository services? How many datasets did they hold as of January 2020? What digital repository software platform was in use? How do our results compare with the 2017 SPEC Kit data (Hudson-Vitale et al., 2017b) References Hudson-Vitale, C., Imker, I., Johnston, L. R., Carlson, J., Kozlowski, W., Olendorf, R., and Stewart, C.. (2017a). Data Curation. SPEC Kit 354. Washington, DC: Association of Research Libraries. https://doi.org/10.29242/spec.354 Hudson-Vitale, C., Imker, I., Johnston, L. R., Carlson, J., Kozlowski, W., Olendorf, R., and Stewart, C.. (2017b). Survey Data for SPEC Kit 354: Data Curation. Github. https://github.com/1heidi/dcn_spec_kit_data. ----------------------------------------- Methods ----------------------------------------- Data collection was done by two people and verified independently. Institutional data repositories were identified by searching two sources: 1) via Google.com with an institution’s name and few different terms “data repository”, “digital repository” or “institutional repository” and 2) via base-search.net which provided an excellent OAI-PMH metadata parser interface (BASE, 2020; Pieper & Summann, 2006). This study aimed to count the number of original research datasets published in an academic ARL institution. Identifying dataset holdings was tricky. In the case of a dedicated data repository (e.g., a Dataverse instance), the total number of datasets is clearly displayed. In the more common example of an institutional repository that accepts data, we dug deeper looking for records labeled as a “dataset” vs. photos, articles, theses, etc. First, we browsed collection names and searched using a variation of search strings and keywords. Most repositories using the Dublin Core metadata schema could be successfully refined using the “dc.type” facet set to “datasets,” “dataset,” “data sets”, or “data”. Therefore, rather than relying on our own biases as to what “counts” as data, we counted the number of records labeled as dataset in the metadata. Next, results were spot-checked in an attempt to further verify the objects as research data, as opposed to an article describing a dataset. If the results could not be satisfactorily limited to records we objectively categorized as data, we labeled the number of datasets in the institutional collection as “unclear.” Then we validated our findings with the base-search.net tool by directly parsing the OAI-PMH feed to type = dataset. This tool also provided us with the platform type. Using this approach, our numbers are likely undercounting the total dataset holdings for this sample. Furthermore, geospatial data and other specialized data types may be stored in a dedicated data repository outside of our review (e.g., a GeoBlackLight instance), and as a result this study may significantly underrepresent the GIS datasets hosted by academic institutions. Datasets licensed by the library were not the focus of this study, though we may have unavoidably included some. Finally, it is difficult to compare the self-reported survey responses from 2017 with the observations made in 2020 due to the inconsistencies of interpreting what is data and how to count the number of datasets in a collection. In one case, an institution reported 7442 datasets in their repository in 2017 but on further inspection in 2020 it became clear that this number most likely represented all repository holdings including articles, reports, theses, etc. Our observation of the number of datasets in 2020 was much lower (7 datasets). ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: Master.csv ----------------------------------------- n/a signifies that the information was not available. “blank” written out in a cell signifies that the response was left blank by the survey taker. Col A “Identifier” = Random Unique ID Col B “Institution” = Name of ARL institution Col C “Repository URL” = URL to institutional repository landing page Col D “Has repository?” = Yes/No Col E “Standalone Data Repository 2020” = Yes/No, IR only, Implementing Data Repository Col F “Data repository URL” = URL to data repository landing page or data collection page in institutional repository. n/a for “No” reponse for “Standalone Data Repository 2020” Col G “Has data?” = yes/unclear Col H “Dataset Count IR” = Count of dataset for “Repository URL” Col I “Dataset Count Data Repo” = Count of dataset for “Data Repository URL” Col J “Total Dataset Count (Jan 2020)” = Sum of “Dataset Count IR” and “Dataset Count Data Repo” Col K “Reported Dataset Count (Jan 2017)”= Reprint of 2017 data from column Col V “2017 Dataset count” Col L “Data Repo Growth Observed” = "Dataset Count Data repo" minus "Reported Dataset Count (Jan 2017)" Col M “General IR Growth Observed” = IF "Standalone Data Repository"=TRUE, THEN PRINT "Dataset Count IR"; IF "Standalone Data Repository"=FALSE, THEN "Dataset Count IR" MINUS "Reported Dataset Count (Jan 2017) Col N “Average Growth per year” = “General IR Growth Observed” divided by 3 years Col O “Platform 2020 (data repo here when possible)” = Name of platform observed in use in 2020, if two repositories were observed, then only list the name of the platform for “Data repository URL” Col P “Platform Change Observed” = If “Platform 2020” did/not equal “Platform (Jan 2017)” then Yes/No. Otherwise = “Unknown” Col Q “Platform (Jan 2017)” = Normalzation of Col U “Platform“ in 2017 data and n/a=did not receive question Col R “Standalone Data Repository 2017” = Normalziation of Col T “Branch Question Type” in 2017 data for comparison *****2017 data reprinted from Hudson-Vitale et al, 2017b**** Col S “Response to 2017 Spec Kit Results” = Spec Kit 2017 Response from institution yes/no Col T “Branch Question Type” = Yes/No/In process to the question Q1: Does your institution currently provide research data curation services? Col U “Platform “ = Select from a list or n/a did not recieve question due to branch Q8: Which of the following statements best describes your repository service for data? “blank” means the respondant did receive the question, but they left it blank. Col V “2017 Dataset count” = free text number response to Q11: Please enter the total number of data sets in your repository. Col W “Metadata schema" = select from a list or free text to Q14: What metadata schema are you primarily using for discovery of data? Col X “Workflow” = Select Self-deposit, Medited, Both self-deposit and mediated to Q15: In which of the following ways do researchers deposit data into your data repository?