Profils utilisateurs correspondant à "Can Rager"
Can RagerResearch Assistant Adresse e-mail validée de northeastern.edu Cité 125 fois |
Sparse feature circuits: Discovering and editing interpretable causal graphs in language models
We introduce methods for discovering and applying sparse feature circuits. These are causally
implicated subnetworks of human-interpretable features for explaining language model …
implicated subnetworks of human-interpretable features for explaining language model …
Attribution patching outperforms automated circuit discovery
… In these terms we can examine dependencies of nodes with the output of earlier nodes, ie
we can measure the effect of attention heads in layer 0 on the attention heads in layer 2. In the …
we can measure the effect of attention heads in layer 0 on the attention heads in layer 2. In the …
Measuring progress in dictionary learning for language model interpretability with board game models
What latent features are encoded in language model (LM) representations? Recent work on
training sparse autoencoders (SAEs) to disentangle interpretable features in LM …
training sparse autoencoders (SAEs) to disentangle interpretable features in LM …
The quest for the right mediator: A history, survey, and theoretical grounding of causal interpretability
… How can we understand what these computations represent, such that we can arrive at a
deeper algorithmic understanding of how and why models behave the way they do? For …
deeper algorithmic understanding of how and why models behave the way they do? For …
Nnsight and ndif: Democratizing access to foundation model internals
The enormous scale of state-of-the-art foundation models has limited their accessibility to
scientists, because customized experiments at large model sizes require costly hardware and …
scientists, because customized experiments at large model sizes require costly hardware and …
Can we prevent road rage?
M Asbridge, RG Smart… - Trauma, violence, & abuse, 2006 - journals.sagepub.com
… rage has become a serious concern in many countries, and preventive efforts are required.
This article reviews what can be done to prevent road rage … for road rage behavior could be …
This article reviews what can be done to prevent road rage … for road rage behavior could be …
Structured World Representations in Maze-Solving Transformers
…, G Corlouer, C Mathwin, L Quirke, C Rager… - arXiv preprint arXiv …, 2023 - arxiv.org
Transformer models underpin many recent advances in practical machine learning applications,
yet understanding their internal behavior continues to elude researchers. Given the size …
yet understanding their internal behavior continues to elude researchers. Given the size …
An adversarial example for direct logit attribution: Memory management in gelu-4l
… Therefore, if real, memory management could significantly impact what conclusions we can
… 2, we can see the projection ratio between the outputs of every attention head and MLP (…
… 2, we can see the projection ratio between the outputs of every attention head and MLP (…
A Configurable Library for Generating and Manipulating Maze Datasets
… can be deployed on mazes smaller than the training size without destroying the structure of
the vocabulary. Examples of usage of this dataset to train autoregressive transformers can be …
the vocabulary. Examples of usage of this dataset to train autoregressive transformers can be …
Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks
… and can be computed in seconds, enabling frequent evaluations such as during SAE training.
However, we can … This dependency on human-generated concepts can overlook important …
However, we can … This dependency on human-generated concepts can overlook important …