Providing Context for Hate Speech Classifiers using Post-hoc Explanations
Main Article Content
Abstract
Hate speech classifiers trained on imbalanced datasets often struggle to distinguish whether group identifiers such as “gay” or “black” are being used in offensive or prejudiced contexts. This bias leads to false positives when these terms appear, as models fail to grasp the contextual nuances that define hateful usage. To address this, we extract SOC (Jin et al., 2020) post-hoc explanations from fine-tuned BERT classifiers to efficiently identify bias against identity terms. Building on these insights, we introduce a novel regularization technique that leverages these explanations to help models learn from the surrounding context of group identifiers, rather than relying solely on the identifiers themselves. Our approach outperforms baseline methods by reducing false positives on out-of-domain data while maintaining or enhancing performance on in-domain data.