default search action
Neel Nanda
Person information
Refine list
refinements active!
zoomed in on ?? of ?? records
view refined list in
export refined list as
Journal Articles
- 2024
- [j3]Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, Dimitris Bertsimas:
Universal Neurons in GPT2 Language Models. Trans. Mach. Learn. Res. 2024 (2024) - 2023
- [j2]Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, Dimitris Bertsimas:
Finding Neurons in a Haystack: Case Studies with Sparse Probing. Trans. Mach. Learn. Res. 2023 (2023) - 2022
- [j1]Michael K. Cohen, Marcus Hutter, Neel Nanda:
Fully General Online Imitation Learning. J. Mach. Learn. Res. 23: 334:1-334:30 (2022)
Conference and Workshop Papers
- 2024
- [c7]Aleksandar Makelov, Georg Lange, Atticus Geiger, Neel Nanda:
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching. ICLR 2024 - [c6]Fred Zhang, Neel Nanda:
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods. ICLR 2024 - [c5]Cody Rushing, Neel Nanda:
Explorations of Self-Repair in Language Models. ICML 2024 - 2023
- [c4]Neel Nanda, Andrew Lee, Martin Wattenberg:
Emergent Linear Representations in World Models of Self-Supervised Sequence Models. BlackboxNLP@EMNLP 2023: 16-30 - [c3]Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt:
Progress measures for grokking via mechanistic interpretability. ICLR 2023 - [c2]Bilal Chughtai, Lawrence Chan, Neel Nanda:
A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations. ICML 2023: 6243-6267 - 2022
- [c1]Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Scott Johnston, Andy Jones, Nicholas Joseph, Jackson Kernian, Shauna Kravec, Ben Mann, Neel Nanda, Kamal Ndousse, Catherine Olsson, Daniela Amodei, Tom B. Brown, Jared Kaplan, Sam McCandlish, Christopher Olah, Dario Amodei, Jack Clark:
Predictability and Surprise in Large Generative Models. FAccT 2022: 1747-1764
Informal and Other Publications
- 2024
- [i30]Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, Dimitris Bertsimas:
Universal Neurons in GPT2 Language Models. CoRR abs/2401.12181 (2024) - [i29]Bilal Chughtai, Alan Cooney, Neel Nanda:
Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs. CoRR abs/2402.07321 (2024) - [i28]Cody Rushing, Neel Nanda:
Explorations of Self-Repair in Language Models. CoRR abs/2402.15390 (2024) - [i27]János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda:
AtP*: An efficient and scalable method for localizing LLM behaviour to components. CoRR abs/2403.00745 (2024) - [i26]Stefan Heimersheim, Neel Nanda:
How to use and interpret activation patching. CoRR abs/2404.15255 (2024) - [i25]Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda:
Improving Dictionary Learning with Gated Sparse Autoencoders. CoRR abs/2404.16014 (2024) - [i24]Aleksandar Makelov, Georg Lange, Neel Nanda:
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control. CoRR abs/2405.08366 (2024) - [i23]Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Rimsky, Wes Gurnee, Neel Nanda:
Refusal in Language Models Is Mediated by a Single Direction. CoRR abs/2406.11717 (2024) - [i22]Jacob Dunefsky, Philippe Chlenski, Neel Nanda:
Transcoders Find Interpretable LLM Feature Circuits. CoRR abs/2406.11944 (2024) - [i21]Alessandro Stolfo, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda:
Confidence Regulation Neurons in Language Models. CoRR abs/2406.16254 (2024) - [i20]Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda:
Interpreting Attention Layer Outputs with Sparse Autoencoders. CoRR abs/2406.17759 (2024) - [i19]Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, Neel Nanda:
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders. CoRR abs/2407.14435 (2024) - [i18]Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca D. Dragan, Rohin Shah, Neel Nanda:
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2. CoRR abs/2408.05147 (2024) - 2023
- [i17]Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt:
Progress measures for grokking via mechanistic interpretability. CoRR abs/2301.05217 (2023) - [i16]Bilal Chughtai, Lawrence Chan, Neel Nanda:
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations. CoRR abs/2302.03025 (2023) - [i15]Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Fazl Barez:
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models. CoRR abs/2304.12918 (2023) - [i14]Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, Dimitris Bertsimas:
Finding Neurons in a Haystack: Case Studies with Sparse Probing. CoRR abs/2305.01610 (2023) - [i13]Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Shay B. Cohen, Fazl Barez:
Neuron to Graph: Interpreting Language Model Neurons at Scale. CoRR abs/2305.19911 (2023) - [i12]Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, Vladimir Mikulik:
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla. CoRR abs/2307.09458 (2023) - [i11]Neel Nanda, Andrew Lee, Martin Wattenberg:
Emergent Linear Representations in World Models of Self-Supervised Sequence Models. CoRR abs/2309.00941 (2023) - [i10]Fred Zhang, Neel Nanda:
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods. CoRR abs/2309.16042 (2023) - [i9]Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, Neel Nanda:
Copy Suppression: Comprehensively Understanding an Attention Head. CoRR abs/2310.04625 (2023) - [i8]Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda:
Linear Representations of Sentiment in Large Language Models. CoRR abs/2310.15154 (2023) - [i7]Lucia Quirke, Lovis Heindrich, Wes Gurnee, Neel Nanda:
Training Dynamics of Contextual N-Grams in Language Models. CoRR abs/2311.00863 (2023) - [i6]Aleksandar Makelov, Georg Lange, Neel Nanda:
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching. CoRR abs/2311.17030 (2023) - 2022
- [i5]Deep Ganguli, Danny Hernandez, Liane Lovitt, Nova DasSarma, Tom Henighan, Andy Jones, Nicholas Joseph, Jackson Kernion, Benjamin Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Zac Hatfield-Dodds, Scott Johnston, Shauna Kravec, Neel Nanda, Kamal Ndousse, Catherine Olsson, Daniela Amodei, Dario Amodei, Tom B. Brown, Jared Kaplan, Sam McCandlish, Chris Olah, Jack Clark:
Predictability and Surprise in Large Generative Models. CoRR abs/2202.07785 (2022) - [i4]Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Chris Olah, Benjamin Mann, Jared Kaplan:
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. CoRR abs/2204.05862 (2022) - [i3]Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, Chris Olah:
In-context Learning and Induction Heads. CoRR abs/2209.11895 (2022) - 2021
- [i2]Michael K. Cohen, Marcus Hutter, Neel Nanda:
Fully General Online Imitation Learning. CoRR abs/2102.08686 (2021) - [i1]Neel Nanda, Jonathan Uesato, Sven Gowal:
An Empirical Investigation of Learning from Biased Toxicity Labels. CoRR abs/2110.01577 (2021)
Coauthor Index
manage site settings
To protect your privacy, all features that rely on external API calls from your browser are turned off by default. You need to opt-in for them to become active. All settings here will be stored as cookies with your web browser. For more information see our F.A.Q.
Unpaywalled article links
Add open access links from to the list of external document links (if available).
Privacy notice: By enabling the option above, your browser will contact the API of unpaywall.org to load hyperlinks to open access articles. Although we do not have any reason to believe that your call will be tracked, we do not have any control over how the remote server uses your data. So please proceed with care and consider checking the Unpaywall privacy policy.
Archived links via Wayback Machine
For web page which are no longer available, try to retrieve content from the of the Internet Archive (if available).
Privacy notice: By enabling the option above, your browser will contact the API of archive.org to check for archived content of web pages that are no longer available. Although we do not have any reason to believe that your call will be tracked, we do not have any control over how the remote server uses your data. So please proceed with care and consider checking the Internet Archive privacy policy.
Reference lists
Add a list of references from , , and to record detail pages.
load references from crossref.org and opencitations.net
Privacy notice: By enabling the option above, your browser will contact the APIs of crossref.org, opencitations.net, and semanticscholar.org to load article reference information. Although we do not have any reason to believe that your call will be tracked, we do not have any control over how the remote server uses your data. So please proceed with care and consider checking the Crossref privacy policy and the OpenCitations privacy policy, as well as the AI2 Privacy Policy covering Semantic Scholar.
Citation data
Add a list of citing articles from and to record detail pages.
load citations from opencitations.net
Privacy notice: By enabling the option above, your browser will contact the API of opencitations.net and semanticscholar.org to load citation information. Although we do not have any reason to believe that your call will be tracked, we do not have any control over how the remote server uses your data. So please proceed with care and consider checking the OpenCitations privacy policy as well as the AI2 Privacy Policy covering Semantic Scholar.
OpenAlex data
Load additional information about publications from .
Privacy notice: By enabling the option above, your browser will contact the API of openalex.org to load additional information. Although we do not have any reason to believe that your call will be tracked, we do not have any control over how the remote server uses your data. So please proceed with care and consider checking the information given by OpenAlex.
last updated on 2024-09-19 00:35 CEST by the dblp team
all metadata released as open data under CC0 1.0 license
see also: Terms of Use | Privacy Policy | Imprint