This readme.txt file was generated on 20220630 by Cody Hennesy. ------------------- GENERAL INFORMATION ------------------- 1. Title of Dataset: Data resources on Academic LibGuides 2. Author Information Principal Investigator Contact Information Name: Cody Hennesy Institution: University of Minnesota, Twin Cities Email: chennesy@umn.edu ORCID: 0000-0002-9410-9810 Associate or Co-investigator Contact Information Name: Jenny McBurney Institution: University of Minnesota, Twin Cities Email: jmcburne@umn.edu ORCID: 0000-0003-4081-6066 Associate or Co-investigator Contact Information Name: Alicia Kubas Institution: US Government Publishing Office Email: akubas@gpo.gov ORCID: 0000-0003-3794-530X 3. Date of data collection (single date, range, approximate date): 2021-11-12 to 2021-12-07 (YYYY-MM-DD) 4. Geographic location of data collection (where was data collected?): Online 5. Information about funding sources that supported the collection of the data: Not funded. -------------------------- SHARING/ACCESS INFORMATION -------------------------- 1. Licenses/restrictions placed on the data: CC BY-NC, Attribution-NonCommercial 3.0 United States 2. Links to publications that cite or use the data: A paper using this data is being prepared for submission under the title "Taking count: A computational analysis of data resources on academic LibGuides" by Hennesy, Kubas, and McBurney. 3. Links to other publicly accessible locations of the data: - 4. Links/relationships to ancillary data sets: - 5. Was data derived from another source? Data was collected from public LibGuides HTML pages and from an API provided by LibGuides. A fair use analysis was conducted to ensure data collection and sharing would be a fair use (and much of the content is licensed for re-use, though licensing varies). The compilation of the specific data elements captured here is a transformative use, enabling quantitative analyses of the data resources shared at different institutions, while not reproducing substantial portions from any specific guide or institution. 6. Recommended citation for the data: Hennesy, Cody; McBurney, Jenny; Kubas, Alicia. (2022). Data resources on Academic LibGuides. Retrieved from the Data Repository for the University of Minnesota, https://hdl.handle.net/11299/228216 --------------------- DATA & FILE OVERVIEW --------------------- 1. File List A. Filename: lg_data.csv Short description: Each row contains a data resource collected from a specific LibGuides webpage. Duplicates are present since resources are sometimes repeated on a single guide page, and are often duplicated on different guides from a single institution. Four columns represent data exactly as it was collected from the guides (resource_name_scraped, resource_url, guide_url, institution_site_id), and three columns represent normalized versions of the same data (resource_name_normalized, resource_url_simple, and resource_domain) to make it easier for others to reproduce the analysis code shared on GitHub. B. Filename: lg_data_institutions.txt Short description: A plain-text list of the institutions from which LibGuides were collected and that are included in the analysis. C. Filename: lg_data_top_annotated.csv Short description: The paper authors annotated the list of the top 500 resources (after cleaning and normalization) to mark whether resource access was free, paid, or hybrid. 2. Relationship between files: - 3. Additional related data collected that was not included in the current data package: The original data collection included 227,639 resources. 40,687 resources were removed during data cleaning that identified non-data-related resources. The final dataset used for analysis includes 186,952 rows. 4. Are there multiple versions of the dataset? yes/no no -------------------------- METHODOLOGICAL INFORMATION -------------------------- 1. Description of methods used for collection/generation of data: Beautiful Soup, requests and Selenium Python packages were used to scrape data from publicly accessible LibGuides HTML pages and from an API provided by LibGuides to enable machine-access (Reitz, 2015; Richardson, 2015). See the "Data collection" section from the "Taking count" paper for more details. 2. Methods for processing the data: Data elements were compiled into a Python Pandas dataframe where initial cleaning took place (e.g., removing non-relevant resources). Substantial normalization of resource names was undertaken in OpenRefine. The dataset includes both the original names as scraped and the normalized versions created in OpenRefine. The lg_data_top_annotated.csv file is derived from a list of the 500 most common data (cleaned and normalized) resources found on LibGuides. This list was generated by sorting the full resource list in lg_data.csv by the number of times each normalized resource appeared, and then outputing the most common 500 resources. The list was then sorted by the percent of sites that included each resource. The authors annotated this list to note whether access to a resource was "free," "paid," or "hybrid." During annotation the authors also decided to merge two resources that represented the same basic site into a single resource, reducting the list to 498. See the "Methodology" section in the "Taking count…" paper for far more details. 3. Instrument- or software-specific information needed to interpret the data: - 4. Standards and calibration information, if appropriate: - 5. Environmental/experimental conditions: - 6. Describe any quality-assurance procedures performed on the data: - 7. People involved with sample collection, processing, analysis and/or submission: Cody Hennesy ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: lg_data.csv ----------------------------------------- 1. Number of variables: 7 2. Number of cases/rows: 186,952 3. Missing data codes: Blank cell: No data was found/collected from the guide for this element. 4. Variable List A. Name: resource_name_normalized Description: The name of the data resource after being normalized for analysis. Pandas dtype: object B. Name: resource_name_scraped Description: The name of the data resource originally collected from a LibGuide. Pandas dtype: object C. Name: resource_url Description: The full URL collected for each data resource. Pandas dtype: object D. Name: resource_url_simple Description: The resource URL without the scheme (e.g., http://) to allow for more accurate clustering by full URL. Pandas dtype: object E. Name: resource_domain Description: The extracted subdomain, second-level domain, and domain (e.g., nces.ed.gov) for the resource URL to allow for analysis of domain level associations. Pandas dtype: object. F. Name: guide_url Description: The URL for the guide from which each resource was collected. Pandas dtype: object. G. Name: institution_site_id Description: The LibGuides "site_id" for each institution associated with the resource. Pandas dtype: int64. ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: lg_data_institutions.txt ----------------------------------------- 1. Number of variables: 1 2. Number of cases/rows: 123 3. Missing data codes: 4. Variable List A. Name: [not named] Description: The plain text name of every institution from which LibGuides were collected for analysis. ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: lg_data_top_annotated.csv ----------------------------------------- 1. Number of variables: 6 2. Number of cases/rows: 498 3. Missing data codes: Blank cell: No data was found/collected from the guide for this element. 4. Variable List A. Name: name Description: The normalized resource name (derived from resource_name_normalized in lg_data.csv). B. Name: sum Description: The total number of institutions (out of 123) that included the resource name on at least one LibGuide. C. Name: percent_of_sites Description: The percentage of the 123 institutions on which the resource was present (the list is sorted by this variable). D. Name: count Description: The total number of times a normalized resource was found on the LibGuides in the analysis. E. name: url Description: This variable is the most common simplified URL (derived from resource_url_simple in lg_data.csv) representing a particular normalized resource. Resources with the same normalized name in lg_data.csv often have a number of different URLs associated with them, so this URL is included simply as an example of the most common webpage associated with the normalized name. F. Name: access Description: Either free, paid, or hybrid. These tags were added manually by authors to represent how users are typically able to access a site. "free" represents resources that are freely available to all (e.g., most .gov websites), while "paid" resources require some level of payment (individual or institutional) to use, and "hybrid" sites include a mix of paid and freely available data collections.