Hate Speech Detection In Twitter: A Selectively Trained Ensemble Method

This thesis tests classiﬁcation models from Natural Language Processing and Machine learning in the task of identifying hate speech. We tested on multiple annotated data sets (Davidson et al. 2017) of tweet data labeled as hate speech, oﬀensive speech, both, or neither. Hate speech has become an unavoidable topic in the current social media environment due to poorly monitored comment sections and news feeds. With that, studies showing the negative aﬀects that it brings to people’s well-being have also begun to surface (Gelber and McNamara 2015). Therefore, being able to identify hate speech accurately and precisely has grown in importance. Hate speech is often contextual, subjective, and a matter of opinion which makes creating an accurate model of such speech all the more diﬃcult. We have found that using an ensemble method of a classic Naive Bayes classiﬁer (Pedregosa et al. 2019c), Random Forest (Pedregosa et al. 2019b), K-Means (Pedregosa et al. 2019d), and Bernoulli (Pedregosa et al. 2019a) performed better than similar studies in precision, accuracy, recall, and f-score (Malmasi and Zampieri 2018). The ensemble performed better than using the strongest of the individual models, Random Forest, by a small but useful margin. We believe this to be due to the nuanced nature and context behind hate speech being more than one model can fully encompass. In addition to the ensemble strategy, training on data which was labeled as ‘clean’ (not hate speech or oﬀensive) or labeled ‘dirty’ (hate speech) with higher conﬁdence ratings increased the precision of our model by around 10% in some cases when compared to training on the complete data set including the tweets which have a blurred sentiment such as oﬀensive but not hate speech tweets. Having an accurate and precise model such as this will allow organizations to protect their users from such language to prevent the negative eﬀects of hate speech. Additionally, it will allow us to identify more hate speech tweets or statements to have more data to research in the future and ﬁnd deeper trends than simply the tweet text, such as replies, retweets, and user biographies.

Keywords

Ensemble

Hate Speech

Machine Learning

Natural Language Processing

Selective Training

Twitter

Description

University of Minnesota M.S. thesis. May 2020. Major: Computer Science. Advisor: Richard Maclin. 1 computer file (PDF); viii, 35 pages.

Collections

Master's Theses (Plan A and Professional Engineering Design Projects)

Suggested citation

Houston, Jackson. (2020). Hate Speech Detection In Twitter: A Selectively Trained Ensemble Method. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/216080.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.

University Digital Conservancy

Hate Speech Detection In Twitter: A Selectively Trained Ensemble Method

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Alternative title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation

University Digital Conservancy

University of Minnesota Twin Cities

Hate Speech Detection In Twitter: A Selectively Trained Ensemble Method

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Alternative title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation