Annotating and Automatically Extracting Task Descriptions from Shared Task Overview Papers in Natural Language Processing Domains

The rapid growth rate of scientific literature makes it increasingly difficult for researchers to keep up with developments in their field. This is a problem that can be addressed by structuring academic papers according to information units that go deeper than keywords. The need to efficiently structure scholarly documents so that they are machine operable necessitates the creation of machine readers to extract and classify bits of fine grained scientific information. This process requires the development of gold standard corpora of annotated scholarly work. For this thesis we developed a gold-standard corpus of task description phrase annotations from Shared Task Overview papers and trained a text classifier on the resulting dataset. The annotation project consisted of: developing a set of annotation guidelines; reading and annotating the task descriptions of 254 Shared Task Overview papers published in the ACL Anthology; validating our guidelines by measuring the Inter-Annotator Agreement; and digitizing the resulting corpus such that it could be used as a resource in machine learning projects. The resulting dataset comprises 254 full text papers containing 41,752 sentences and 259 task descriptions. In our second and final validation we achieved a strict score of 0.44 and a relaxed score of 0.95, measured using Cohen's kappa coefficent. We then used this resource to facilitate the training and development of a classifier to perform automatic identification of shared task descriptions. For preprocessing, we improved the balance between negative and positive samples by eliminating every paper section that does not contain a task description. During our machine learning experiments we trained and validated 18 different sentence classification models using a variety of text encodings and hyperparameter settings. The best performing model was SciBERT, which achieved an F1 score of 0.75 when applied to the reduced test set.

Keywords

scholarly document processing

scientific information extraction

Description

University of Minnesota M.S. thesis. May 2022. Major: Computer Science. Advisor: Ted Pedersen. 1 computer file (PDF); x, 113 pages.

Collections

Master's Theses (Plan A and Professional Engineering Design Projects)

Suggested citation

Martin, Anna. (2022). Annotating and Automatically Extracting Task Descriptions from Shared Task Overview Papers in Natural Language Processing Domains. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/241255.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.

University of Minnesota Twin Cities

University Digital Conservancy

Annotating and Automatically Extracting Task Descriptions from Shared Task Overview Papers in Natural Language Processing Domains

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Suggested citation