Train Sparse Autoencoders Efficiently By Utilizing Features Correlation | Awesome LLM Papers Add your paper to Awesome LLM Papers

Train Sparse Autoencoders Efficiently By Utilizing Features Correlation

Vadim Kurochkin, Yaroslav Aksenov, Daniil Laptev, Daniil Gavrilov, Nikita Balagansky . No Venue 2025

[Paper]   Search on Google Scholar   Search on Semantic Scholar
Compositional Generalization Efficiency Interdisciplinary Approaches Interpretability Model Architecture Multimodal Semantic Representation Productivity Enhancement Tools Training Techniques Visual Question Answering

Sparse Autoencoders (SAEs) have demonstrated significant promise in interpreting the hidden states of language models by decomposing them into interpretable latent directions. However, training SAEs at scale remains challenging, especially when large dictionary sizes are used. While decoders can leverage sparse-aware kernels for efficiency, encoders still require computationally intensive linear operations with large output dimensions. To address this, we propose KronSAE, a novel architecture that factorizes the latent representation via Kronecker product decomposition, drastically reducing memory and computational overhead. Furthermore, we introduce mAND, a differentiable activation function approximating the binary AND operation, which improves interpretability and performance in our factorized framework.

Similar Work