Profils utilisateurs correspondant à "Can Rager"

Can Rager

Research Assistant
Adresse e-mail validée de northeastern.edu
Cité 125 fois

Sparse feature circuits: Discovering and editing interpretable causal graphs in language models

S Marks, C Rager, EJ Michaud, Y Belinkov… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce methods for discovering and applying sparse feature circuits. These are causally
implicated subnetworks of human-interpretable features for explaining language model …

Attribution patching outperforms automated circuit discovery

A Syed, C Rager, A Conmy - arXiv preprint arXiv:2310.10348, 2023 - arxiv.org
… In these terms we can examine dependencies of nodes with the output of earlier nodes, ie
we can measure the effect of attention heads in layer 0 on the attention heads in layer 2. In the …

Measuring progress in dictionary learning for language model interpretability with board game models

A Karvonen, B Wright, C Rager, R Angell… - arXiv preprint arXiv …, 2024 - arxiv.org
What latent features are encoded in language model (LM) representations? Recent work on
training sparse autoencoders (SAEs) to disentangle interpretable features in LM …

The quest for the right mediator: A history, survey, and theoretical grounding of causal interpretability

…, M Li, S Marks, K Pal, N Prakash, C Rager… - arXiv preprint arXiv …, 2024 - arxiv.org
… How can we understand what these computations represent, such that we can arrive at a
deeper algorithmic understanding of how and why models behave the way they do? For …

Nnsight and ndif: Democratizing access to foundation model internals

…, E Todd, J Brinkmann, C Juang, K Pal, C Rager… - arXiv preprint arXiv …, 2024 - arxiv.org
The enormous scale of state-of-the-art foundation models has limited their accessibility to
scientists, because customized experiments at large model sizes require costly hardware and …

Can we prevent road rage?

M Asbridge, RG Smart… - Trauma, violence, & abuse, 2006 - journals.sagepub.com
rage has become a serious concern in many countries, and preventive efforts are required.
This article reviews what can be done to prevent road rage … for road rage behavior could be …

Structured World Representations in Maze-Solving Transformers

…, G Corlouer, C Mathwin, L Quirke, C Rager… - arXiv preprint arXiv …, 2023 - arxiv.org
Transformer models underpin many recent advances in practical machine learning applications,
yet understanding their internal behavior continues to elude researchers. Given the size …

An adversarial example for direct logit attribution: Memory management in gelu-4l

J Dao, YT Lau, C Rager, J Janiak - arXiv preprint arXiv:2310.07325, 2023 - arxiv.org
… Therefore, if real, memory management could significantly impact what conclusions we can
… 2, we can see the projection ratio between the outputs of every attention head and MLP (…

A Configurable Library for Generating and Manipulating Maze Datasets

…, AF Spies, T Räuker, D Valentine, C Rager… - arXiv preprint arXiv …, 2023 - arxiv.org
can be deployed on mazes smaller than the training size without destroying the structure of
the vocabulary. Examples of usage of this dataset to train autoregressive transformers can be …

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks

A Karvonen, C Rager, S Marks, N Nanda - arXiv preprint arXiv:2411.18895, 2024 - arxiv.org
… and can be computed in seconds, enabling frequent evaluations such as during SAE training.
However, we can … This dependency on human-generated concepts can overlook important …