While using data standards can facilitate research by making it easier to share data, manually mapping to data standards creates an obstacle to their adoption. Semi-automated mapping strategies can reduce the manual mapping burden. This research addresses the mapping dilemma by applying well-established and emerging techniques to a real-world use case. First, machine learning approaches were used and evaluated to map Common Data Elements (CDEs) from the National Cancer Institute’s (NCI) cancer Data Standards Registry and Repository to the Biomedical Research Integrated Domain Group (BRIDG) model. Second, a graph database that incorporates the CDEs, BRIDG Model, and the NCI Thesaurus was developed and evaluated. A shortest path algorithm was then used to predict mappings from CDEs to classes in the BRIDG model. Finally, analysis was conducted to: determine the strengths and weaknesses of each approach; highlight data quality issues; and determine when either approach or a combination of the approaches provides the optimal results. The results indicate that an artificial neural network-based mapping tool is able to predict CDE to BRIDG class mappings with between 34 - 94% accuracy but is limited by the availability of training data. The results also show that a graph database can be used to map CDEs to BRIDG classes but is limited by the subjective nature of the mapping process. An optimal mapping tool combines machine learning and graph database techniques with the knowledge and experience of a human subject matter expert.
University of Minnesota Ph.D. dissertation.March 2020. Major: Biomedical Informatics and Computational Biology. Advisors: Guoqian Jiang, Chad Myers. 1 computer file (PDF); xi, 110 pages + 1 supplemental file.
Development Of Semi-Automated Tools To Map Cancer Research Common Data Elements To The Biomedical Research Integrated Domain Group Model.
Retrieved from the University of Minnesota Digital Conservancy,
Content distributed via the University of Minnesota's Digital Conservancy may be subject to additional license and use restrictions applied by the depositor.