A syntax directed attention BIO tagger for Medical Named Entity Recognition.

Solinsky, Jacob2025-02-142025-02-142022-05https://hdl.handle.net/11299/269953University of Minnesota M.S. thesis. May 2022. Major: Biomedical Informatics and Computational Biology. Advisor: Serguei Pakhomov. 1 computer file (PDF); iii, 34 pages.In this paper I describe SYNDIRA, a self-attention based transformer language model which ingests sentences incorporating syntactical information from a dependency parser as an additional feature of its input alongside context-free subword token vector encodings. I apply it to the Named Entity Recognition (NER) task of identifying spans within the Chan Zuckerberg initiative’s MedMentions corpus that refer to concepts belonging to the st21pv subset of the Unified Medical Language System’s Metathesaurus, a subset considered to be of particular interest for automated medical Natural Language Processing (NLP). I employ a modified version of the MedLinker architecture described by Loureiro and Jorge (2020), incorporating SYNDIRA in the place of the various BERT models they employed as the source of contextual word embeddings to be input in their BiLSTM-CRF based span identifier. I discover that SYNDIRA is capable of encoding syntactic information that is useful for its NER BIO tagging task but not of sufficient quality to compare with the original BERT-based MedLinker.enBERTDependencySyntaxTransformerA syntax directed attention BIO tagger for Medical Named Entity Recognition.Thesis or Dissertation