Improved semi-supervised learning technique for automatic detection of South African abusive language on Twitter
Semi-supervised learning is a potential solution for improving training data in low-resourced abusive language detection contexts such as South African abusive language detection on Twitter. However, the existing semi-supervised learning methods have been skewed towards small amounts of labelled data, with small feature space. This paper, therefore, presents a semi-supervised learning technique that improves the distribution of training data by assigning labels to unlabelled data based on the majority voting over different feature sets of labelled and unlabelled data clusters. The technique is applied to South African English corpora consisting of labelled and unlabelled abusive tweets. The proposed technique is compared with state-of-the-art self-learning and active learning techniques based on syntactic and semantic features. The performance of these techniques with Logistic Regression, Support Vector Machine and Neural Networks are evaluated. The proposed technique, with accuracy and F1-score of 0.97 and 0.95, respectively, outperforms existing semi-supervised learning techniques. The learning curves show that the training data was used more efficiently by the proposed technique compared to existing techniques. Overall, n-gram syntactic features with a Logistic Regression classifier records the highest performance. The paper concludes that the proposed semi-supervised learning technique effectively detected implicit and explicit South African abusive language on Twitter.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.Copyright of all work published here subsists in the authors. While SACJ retains right of first publication, subsequent re-publication is expressly permitted provided the original SACJ publication is acknowledged and cited, according to the terms detailed below. If plagiarism is detected during review, a paper may be summarily rejected and will not be accepted unless even minor infringements are corrected. Should plagiarism be detected after a paper is published, the Editor reserves the right to withdraw a paper from publication. We expect authors to be honest in representing work as their own, and to respect the time and effort our reviewers put in without an undue burden of policing plagiarism, and hence take violations seriously. SACJ applies the Creative Commons Attribution NonCommercial 4.0 License (CC BY-NC 4.0) to all papers published in this journal. Authors who publish with SACJ agree to the following:
- Authors retain copyright and grant SACJ right of first publication. The work is additionally licensed under a Creative Commons Attribution Non-Commercial License that requires others who share the work to acknowledge the work’s authorship and initial publication in SACJ. Should anyone else wish to make commercial use of the work, SACJ cedes the right to the author to negotiate terms and does not expect to be paid any royalties.
- Authors may enter into additional arrangements for non-exclusive distribution of the SACJ-published version of the work (e.g., post it to a repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are required to refrain from posting their work online prior to completion of reviews so as not to compromise double-blind reviewing or confuse plagiarism checks.