default search action

combined dblp search
author search
venue search
publication search

ask others

Neel Nanda

> Home > Persons

Person information

Refine list

refinements active!

zoomed in on ?? of ?? records

view refined list in

export refined list as

showing all ?? records

Journal Articles

see FAQ

What is the meaning of the colors in the publication lists?

2024
[j3]
- view
  - electronic edition @ openreview.net (open access)
  - details & citations
- export record
  dblp key:
  - journals/tmlr/GurneeHGKSHNB24
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/tmlr/GurneeHGKSHNB24
Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, Dimitris Bertsimas:
Universal Neurons in GPT2 Language Models. Trans. Mach. Learn. Res. 2024 (2024)
2023
[j2]
- view
  - electronic edition @ openreview.net (open access)
  - details & citations
- export record
  dblp key:
  - journals/tmlr/GurneeNPHTB23
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/tmlr/GurneeNPHTB23
Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, Dimitris Bertsimas:
Finding Neurons in a Haystack: Case Studies with Sparse Probing. Trans. Mach. Learn. Res. 2023 (2023)
2022
[j1]
- view
  - electronic edition @ jmlr.org (open access)
  - details & citations
- export record
  dblp key:
  - journals/jmlr/CohenHN22
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/jmlr/CohenHN22
Michael K. Cohen, Marcus Hutter, Neel Nanda:
Fully General Online Imitation Learning. J. Mach. Learn. Res. 23: 334:1-334:30 (2022)

Conference and Workshop Papers

see FAQ

What is the meaning of the colors in the publication lists?

2024
[c7]
- view
  - electronic edition @ openreview.net (open access)
  - details & citations
- export record
  dblp key:
  - conf/iclr/MakelovLGN24
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/conf/iclr/MakelovLGN24
Aleksandar Makelov, Georg Lange, Atticus Geiger, Neel Nanda:
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching. ICLR 2024
[c6]
- view
  - electronic edition @ openreview.net (open access)
  - details & citations
- export record
  dblp key:
  - conf/iclr/ZhangN24
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/conf/iclr/ZhangN24
Fred Zhang, Neel Nanda:
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods. ICLR 2024
[c5]
- view
  - electronic edition @ openreview.net (open access)
  - details & citations
- export record
  dblp key:
  - conf/icml/RushingN24
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/conf/icml/RushingN24
Cody Rushing, Neel Nanda:
Explorations of Self-Repair in Language Models. ICML 2024
2023
[c4]
- view
  authority control:
- export record
  dblp key:
  - conf/blackboxnlp/NandaLW23
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/conf/blackboxnlp/NandaLW23
Neel Nanda, Andrew Lee, Martin Wattenberg:
Emergent Linear Representations in World Models of Self-Supervised Sequence Models. BlackboxNLP@EMNLP 2023: 16-30
[c3]
- view
  - electronic edition @ openreview.net (open access)
  - details & citations
- export record
  dblp key:
  - conf/iclr/NandaCLSS23
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/conf/iclr/NandaCLSS23
Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt:
Progress measures for grokking via mechanistic interpretability. ICLR 2023
[c2]
- view
  - electronic edition @ mlr.press (open access)
  - details & citations
- export record
  dblp key:
  - conf/icml/ChughtaiCN23
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/conf/icml/ChughtaiCN23
Bilal Chughtai, Lawrence Chan, Neel Nanda:
A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations. ICML 2023: 6243-6267
2022
[c1]
- view
  authority control:
- export record
  dblp key:
  - conf/fat/GanguliHLABCCDD22
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/conf/fat/GanguliHLABCCDD22
Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Scott Johnston, Andy Jones, Nicholas Joseph, Jackson Kernian, Shauna Kravec, Ben Mann, Neel Nanda, Kamal Ndousse, Catherine Olsson, Daniela Amodei, Tom B. Brown, Jared Kaplan, Sam McCandlish, Christopher Olah, Dario Amodei, Jack Clark:
Predictability and Surprise in Large Generative Models. FAccT 2022: 1747-1764

Informal and Other Publications

see FAQ

What is the meaning of the colors in the publication lists?

2024
[i30]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2401-12181
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2401-12181
Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, Dimitris Bertsimas:
Universal Neurons in GPT2 Language Models. CoRR abs/2401.12181 (2024)
[i29]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2402-07321
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2402-07321
Bilal Chughtai, Alan Cooney, Neel Nanda:
Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs. CoRR abs/2402.07321 (2024)
[i28]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2402-15390
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2402-15390
Cody Rushing, Neel Nanda:
Explorations of Self-Repair in Language Models. CoRR abs/2402.15390 (2024)
[i27]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2403-00745
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2403-00745
János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda:
AtP*: An efficient and scalable method for localizing LLM behaviour to components. CoRR abs/2403.00745 (2024)
[i26]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2404-15255
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2404-15255
Stefan Heimersheim, Neel Nanda:
How to use and interpret activation patching. CoRR abs/2404.15255 (2024)
[i25]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2404-16014
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2404-16014
Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda:
Improving Dictionary Learning with Gated Sparse Autoencoders. CoRR abs/2404.16014 (2024)
[i24]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2405-08366
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2405-08366
Aleksandar Makelov, Georg Lange, Neel Nanda:
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control. CoRR abs/2405.08366 (2024)
[i23]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2406-11717
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2406-11717
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Rimsky, Wes Gurnee, Neel Nanda:
Refusal in Language Models Is Mediated by a Single Direction. CoRR abs/2406.11717 (2024)
[i22]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2406-11944
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2406-11944
Jacob Dunefsky, Philippe Chlenski, Neel Nanda:
Transcoders Find Interpretable LLM Feature Circuits. CoRR abs/2406.11944 (2024)
[i21]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2406-16254
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2406-16254
Alessandro Stolfo, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda:
Confidence Regulation Neurons in Language Models. CoRR abs/2406.16254 (2024)
[i20]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2406-17759
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2406-17759
Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda:
Interpreting Attention Layer Outputs with Sparse Autoencoders. CoRR abs/2406.17759 (2024)
[i19]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2407-14435
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2407-14435
Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, Neel Nanda:
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders. CoRR abs/2407.14435 (2024)
[i18]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2408-05147
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2408-05147
Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca D. Dragan, Rohin Shah, Neel Nanda:
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2. CoRR abs/2408.05147 (2024)
2023
[i17]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2301-05217
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2301-05217
Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt:
Progress measures for grokking via mechanistic interpretability. CoRR abs/2301.05217 (2023)
[i16]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2302-03025
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2302-03025
Bilal Chughtai, Lawrence Chan, Neel Nanda:
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations. CoRR abs/2302.03025 (2023)
[i15]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2304-12918
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2304-12918
Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Fazl Barez:
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models. CoRR abs/2304.12918 (2023)
[i14]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2305-01610
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2305-01610
Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, Dimitris Bertsimas:
Finding Neurons in a Haystack: Case Studies with Sparse Probing. CoRR abs/2305.01610 (2023)
[i13]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2305-19911
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2305-19911
Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Shay B. Cohen, Fazl Barez:
Neuron to Graph: Interpreting Language Model Neurons at Scale. CoRR abs/2305.19911 (2023)
[i12]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2307-09458
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2307-09458
Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, Vladimir Mikulik:
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla. CoRR abs/2307.09458 (2023)
[i11]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2309-00941
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2309-00941
Neel Nanda, Andrew Lee, Martin Wattenberg:
Emergent Linear Representations in World Models of Self-Supervised Sequence Models. CoRR abs/2309.00941 (2023)
[i10]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2309-16042
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2309-16042
Fred Zhang, Neel Nanda:
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods. CoRR abs/2309.16042 (2023)
[i9]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2310-04625
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2310-04625
Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, Neel Nanda:
Copy Suppression: Comprehensively Understanding an Attention Head. CoRR abs/2310.04625 (2023)
[i8]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2310-15154
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2310-15154
Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda:
Linear Representations of Sentiment in Large Language Models. CoRR abs/2310.15154 (2023)
[i7]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2311-00863
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2311-00863
Lucia Quirke, Lovis Heindrich, Wes Gurnee, Neel Nanda:
Training Dynamics of Contextual N-Grams in Language Models. CoRR abs/2311.00863 (2023)
[i6]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2311-17030
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2311-17030
Aleksandar Makelov, Georg Lange, Neel Nanda:
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching. CoRR abs/2311.17030 (2023)
2022
[i5]
- view
  - electronic edition @ arxiv.org (open access)
  - details & citations
- export record
  dblp key:
  - journals/corr/abs-2202-07785
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2202-07785
Deep Ganguli, Danny Hernandez, Liane Lovitt, Nova DasSarma, Tom Henighan, Andy Jones, Nicholas Joseph, Jackson Kernion, Benjamin Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Zac Hatfield-Dodds, Scott Johnston, Shauna Kravec, Neel Nanda, Kamal Ndousse, Catherine Olsson, Daniela Amodei, Dario Amodei, Tom B. Brown, Jared Kaplan, Sam McCandlish, Chris Olah, Jack Clark:
Predictability and Surprise in Large Generative Models. CoRR abs/2202.07785 (2022)
[i4]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2204-05862
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2204-05862
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Chris Olah, Benjamin Mann, Jared Kaplan:
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. CoRR abs/2204.05862 (2022)
[i3]
- view
  - electronic edition via DOI (open access)
  - details & citations
  authority control:
- export record
  dblp key:
  - journals/corr/abs-2209-11895
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2209-11895
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, Chris Olah:
In-context Learning and Induction Heads. CoRR abs/2209.11895 (2022)
2021
[i2]
- view
  - electronic edition @ arxiv.org (open access)
  - details & citations
- export record
  dblp key:
  - journals/corr/abs-2102-08686
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2102-08686
Michael K. Cohen, Marcus Hutter, Neel Nanda:
Fully General Online Imitation Learning. CoRR abs/2102.08686 (2021)
[i1]
- view
  - electronic edition @ arxiv.org (open access)
  - details & citations
- export record
  dblp key:
  - journals/corr/abs-2110-01577
- ask others
- share record
  persistent URL:
  - https://dblp.org/rec/journals/corr/abs-2110-01577
Neel Nanda, Jonathan Uesato, Sven Gowal:
An Empirical Investigation of Learning from Biased Toxicity Labels. CoRR abs/2110.01577 (2021)

Coauthor Index

see FAQ

manage site settings

To protect your privacy, all features that rely on external API calls from your browser are turned off by default. You need to opt-in for them to become active. All settings here will be stored as cookies with your web browser. For more information see our F.A.Q.