Causal Inference in Neural Language Generation Models Through Interventional Probing and Counterfactual Evaluation
Abstract
Understanding the causal mechanisms underlying neural language generation models (NLGM) is essential for improving model interpretability and controllability. This paper explores causal inference within large-scale transformer-based language models using interventional probing and counterfactual evaluation. We propose a framework to disentangle causal contributions of internal representations to linguistic output through synthetic interventions and assess model behavior across counterfactual scenarios. Our empirical results on GPT-2 and BART demonstrate that causal traces in hidden layers correspond to syntactic and semantic decision points. This study contributes to a growing body of literature integrating causal inference with deep learning interpretability.