Google Scholar

Profils utilisateurs correspondant à "Can Rager"

Can Rager

Research Assistant

Adresse e-mail validée de northeastern.edu

Cité 125 fois

[PDF] arxiv.org

Sparse feature circuits: Discovering and editing interpretable causal graphs in language models

S Marks, C Rager, EJ Michaud, Y Belinkov… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce methods for discovering and applying sparse feature circuits. These are causally
implicated subnetworks of human-interpretable features for explaining language model …

Enregistrer Citer Cité 60 fois Autres articles Les 2 versions Version HTML

[PDF] arxiv.org

Attribution patching outperforms automated circuit discovery

A Syed, C Rager, A Conmy - arXiv preprint arXiv:2310.10348, 2023 - arxiv.org

… In these terms we can examine dependencies of nodes with the output of earlier nodes, ie
we can measure the effect of attention heads in layer 0 on the attention heads in layer 2. In the …

Enregistrer Citer Cité 30 fois Autres articles Les 3 versions Version HTML

[PDF] arxiv.org

Measuring progress in dictionary learning for language model interpretability with board game models

A Karvonen, B Wright, C Rager, R Angell… - arXiv preprint arXiv …, 2024 - arxiv.org

What latent features are encoded in language model (LM) representations? Recent work on
training sparse autoencoders (SAEs) to disentangle interpretable features in LM …

Enregistrer Citer Cité 9 fois Autres articles Les 6 versions Version HTML

[PDF] arxiv.org

The quest for the right mediator: A history, survey, and theoretical grounding of causal interpretability

…, M Li, S Marks, K Pal, N Prakash, C Rager… - arXiv preprint arXiv …, 2024 - arxiv.org

… How can we understand what these computations represent, such that we can arrive at a
deeper algorithmic understanding of how and why models behave the way they do? For …

Enregistrer Citer Cité 4 fois Autres articles Les 3 versions Version HTML

[PDF] arxiv.org

Nnsight and ndif: Democratizing access to foundation model internals

…, E Todd, J Brinkmann, C Juang, K Pal, C Rager… - arXiv preprint arXiv …, 2024 - arxiv.org

The enormous scale of state-of-the-art foundation models has limited their accessibility to
scientists, because customized experiments at large model sizes require costly hardware and …

Enregistrer Citer Cité 4 fois Autres articles Les 3 versions Version HTML

Can we prevent road rage?

M Asbridge, RG Smart… - Trauma, violence, & abuse, 2006 - journals.sagepub.com

… rage has become a serious concern in many countries, and preventive efforts are required.
This article reviews what can be done to prevent road rage … for road rage behavior could be …

Enregistrer Citer Cité 55 fois Autres articles Les 8 versions

[PDF] arxiv.org

Structured World Representations in Maze-Solving Transformers

…, G Corlouer, C Mathwin, L Quirke, C Rager… - arXiv preprint arXiv …, 2023 - arxiv.org

Transformer models underpin many recent advances in practical machine learning applications,
yet understanding their internal behavior continues to elude researchers. Given the size …

Enregistrer Citer Cité 3 fois Autres articles En cache

[PDF] arxiv.org

An adversarial example for direct logit attribution: Memory management in gelu-4l

J Dao, YT Lau, C Rager, J Janiak - arXiv preprint arXiv:2310.07325, 2023 - arxiv.org

… Therefore, if real, memory management could significantly impact what conclusions we can
… 2, we can see the projection ratio between the outputs of every attention head and MLP (…

Enregistrer Citer Cité 2 fois Autres articles Les 2 versions Version HTML

[PDF] arxiv.org

A Configurable Library for Generating and Manipulating Maze Datasets

…, AF Spies, T Räuker, D Valentine, C Rager… - arXiv preprint arXiv …, 2023 - arxiv.org

… can be deployed on mazes smaller than the training size without destroying the structure of
the vocabulary. Examples of usage of this dataset to train autoregressive transformers can be …

Enregistrer Citer Cité 4 fois Autres articles Version HTML

[PDF] arxiv.org

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks

A Karvonen, C Rager, S Marks, N Nanda - arXiv preprint arXiv:2411.18895, 2024 - arxiv.org

… and can be computed in seconds, enabling frequent evaluations such as during SAE training.
However, we can … This dependency on human-generated concepts can overlook important …

Enregistrer Citer Autres articles Les 2 versions Version HTML

Créer l'alerte

Citer

Recherche avancée

Enregistré dans Ma bibliothèque

Profils utilisateurs correspondant à "Can Rager"

Can Rager

Sparse feature circuits: Discovering and editing interpretable causal graphs in language models

Attribution patching outperforms automated circuit discovery

Measuring progress in dictionary learning for language model interpretability with board game models

The quest for the right mediator: A history, survey, and theoretical grounding of causal interpretability

Nnsight and ndif: Democratizing access to foundation model internals

Can we prevent road rage?

Structured World Representations in Maze-Solving Transformers

An adversarial example for direct logit attribution: Memory management in gelu-4l

A Configurable Library for Generating and Manipulating Maze Datasets

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks