## Annotated Bibliography of Recommended Materials

### Contents

### Background

#### Theoretical Background

##### Course Materials:

Course notes introducing various prevailing ML methods, including SVMs, the Perceptron algorithm, k-means clustering, Gaussian mixtures, the EM Algorithm, factor analysis, PCA, ICA, and RL.

Covers the interface of theoretical CS and econ: mechanism design for auctions,

algorithms and complexity theory for learning and computing Nash and market equilibria, with case studies.

Course notes introducing various prevailing ML methods, including SVMs, the Perceptron algorithm, k-means clustering, Gaussian mixtures, the EM Algorithm, factor analysis, PCA, ICA, and RL.

Covers the interface of theoretical CS and econ: mechanism design for auctions,

algorithms and complexity theory for learning and computing Nash and market equilibria, with case studies.

##### Textbooks:

Well-written text motivating and expositing the theory of bargaining, communication, and cooperation in game theory, with rigorous decision-theoretic foundations and formal game representations.

Covers propositional calculus, boolean algebras, predicate calculus, and major completeness theorems demonstrating the adequacy of various proof methods.

A precise methodology for AI systems to learn causal models of the world and ask counterfactual questions such as “What will happen if move my arm here?”

Well-written text motivating and expositing the theory of bargaining, communication, and cooperation in game theory, with rigorous decision-theoretic foundations and formal game representations.

Covers recursion theory, the formalization of arithmetic, and Godel’s theorems, illustrating how statements about algorithms can be expressed and proven in the language of arithmetic.

A precise methodology for AI systems to learn causal models of the world and ask counterfactual questions such as “What will happen if move my arm here?”

Covers propositional calculus, boolean algebras, predicate calculus, and major completeness theorems demonstrating the adequacy of various proof methods.

Covers recursion theory, the formalization of arithmetic, and Godel’s theorems, illustrating how statements about algorithms can be expressed and proven in the language of arithmetic.

(for the more mathematically inclined reader)

##### Videos:

Video lectures introducing linear regression, Octave/MatLab, logistic regression, neural networks, SVMs, and surveying methods for recommender systems, anomaly detection, and “big data” applications.

Video lectures intoducing linear regression, Octave/MatLab, logistic regression, neural networks, SVMs, and surveying methods for recommender systems, anomaly detection, and “big data” applications.

##### Published articles:

Nice overview of some connections between Kolmogorov complexity and information theory.

Nice overview of some connections between Kolmogorov complexity and information theory.

##### Unpublished articles:

#### Introduction to AI/ML

##### Course Materials:

Perceptrons, SVMs, regularization, regression methods, bias and variance, active learning, model description length, feature selection, boosting, EM, spectral clustering, graphical models.

Perceptrons, SVMs, regularization, regression methods, bias and variance, active learning, model description length, feature selection, boosting, EM, spectral clustering, graphical models.

##### Textbooks:

Covers deep networks, deep feedforward networks, regularization for deep learning, optimization for training deep models, convolutional networks, and recurrent and recursive nets.

Reviews linear algebra, probability and information theory, and numerical computation as part of a course on deep learning.

Covers multi-arm bandits, finite MDPs, dynamic programming, Monte Carlo methods, TDL, bootstrapping, tabular methods, on-policy and off-policy methods, eligibility traces, and policy gradients.

A modern textbook on machine learning.

Used in over 1300 universities in over 110 countries. A comprehensive overview, covering problem-solving, uncertain knowledge and reasoning, learning, communicating, perceiving, and acting.

Covers deep networks, deep feedforward networks, regularization for deep learning, optimization for training deep models, convolutional networks, and recurrent and recursive nets.

Reviews linear algebra, probability and information theory, and numerical computation as part of a course on deep learning.

Covers multi-arm bandits, finite MDPs, dynamic programming, Monte Carlo methods, TDL, bootstrapping, tabular methods, on-policy and off-policy methods, eligibility traces, and policy gradients.

A modern textbook on machine learning.

Used in over 1300 universities in over 110 countries. A comprehensive overview, covering problem-solving, uncertain knowledge and reasoning, learning, communicating, perceiving, and acting.

Comprehensive book on graphical models, which represent probability distributions in a way that is both principled and potentially transparent.

A book-length history of the field of artificial intelligence, up until 2010 or so.

Comprehensive book on graphical models, which represent probability distributions in a way that is both principled and potentially transparent.

A book-length history of the field of artificial intelligence, up until 2010 or so.

##### Videos:

Lecture course on reinforcement learning taught by David Silver.

Informed, uninformed, and adversarial search; constraint satisfaction; expectimax; MDPs and RL; graphical models; decision diagrams; naive bayes, perceptrons, clustering; NLP; game-playing; robotics.

Lecture course on reinforcement learning taught by David Silver.

Informed, uninformed, and adversarial search; constraint satisfaction; expectimax; MDPs and RL; graphical models; decision diagrams; naive bayes, perceptrons, clustering; NLP; game-playing; robotics.

##### Published articles:

“[D]eveloping successful machine learning applications requires a substantial amount of “black art” that is hard to find in textbooks. This article summarizes twelve key lessons.”

“[D]eveloping successful machine learning applications requires a substantial amount of “black art” that is hard to find in textbooks. This article summarizes twelve key lessons.”

Reviews the state of the art in Bayesian machine learning, including probabilistic programming, Bayesian optimization, data compression, and automatic model discovery.

Reviews the state of the art in Bayesian machine learning, including probabilistic programming, Bayesian optimization, data compression, and automatic model discovery.

#### Prevailing Methods in AI/ML

##### Videos:

##### Published articles:

The seminal paper in deep learning.

Describes DeepMind’s superhuman Go-playing AI. Ties together several techniques in reinforcement learning and supervised learning.

The seminal paper in deep learning.

Describes DeepMind’s superhuman Go-playing AI. Ties together several techniques in reinforcement learning and supervised learning.

State-of-the-art example of hierarchical structured probabilistic models, probably the main alternative to deep neural nets.

State-of-the-art example of hierarchical structured probabilistic models, probably the main alternative to deep neural nets.

#### Broad Perspectives on HCAI (Human-Compatible AI)

#### Cybersecurity and AI

#### Broad perspectives on HCAI

### Open Technical Problems

#### Corrigibility

*incorrigible*to the extent that it resists being shut down or reprogrammed by its users or creators, and

*corrigible*to the extent that it allows such interventions on its operation. For example, a corrigible cleaning robot might update its objective function from "clean the house" to "shut down" upon observing that a human user is about to deactivate it. For an AI system operating highly autonomously in ways that can have large world-scale impacts, corrigibility is even more important; this HCAI research category is about developing methods to ensure highly robust and desirable forms of corrigibility. Corrigibility may be a special case of preference inference.

##### Published articles:

Describes the open problem of corrigibility—designing an agent that doesn’t have instrumental incentives to avoid being corrected (e.g. if the human shuts it down or alters its utility function).

A proposal for training a reinforcement learning agent that doesn’t learn to avoid interruptions to episodes, such as the human operator shutting it down.

Discusses objections to the convergent instrumental goals thesis, and gives a simple formal model.

Describes the open problem of corrigibility—designing an agent that doesn’t have instrumental incentives to avoid being corrected (e.g. if the human shuts it down or alters its utility function).

A proposal for training a reinforcement learning agent that doesn’t learn to avoid interruptions to episodes, such as the human operator shutting it down.

Discusses objections to the convergent instrumental goals thesis, and gives a simple formal model.

##### Unpublished articles:

Principled approach to corrigibility and the shutdown problem based on cooperative inverse reinforcement learning.

Principled approach to corrigibility and the shutdown problem based on cooperative inverse reinforcement learning.

Describes an approach to averting instrumental incentives by “cancelling out” those incentives with artificially introduced terms in the utility function.

Describes an approach to averting instrumental incentives by “cancelling out” those incentives with artificially introduced terms in the utility function.

##### Blog posts:

Proposes an objective function that ignores effects through some channel by performing separate causal counterfactuals for each effect of an action.

Proposes an objective function that ignores effects through some channel by performing separate causal counterfactuals for each effect of an action.

#### Foundational Work

##### Textbooks:

##### Published articles:

Motivates the study of decision theory as necessary for aligning smarter-than-human artificial systems with human interests.

Uses reflective oracles to define versions of Solomonoff and AIXI which are contained in and have the same type as their environment, and which in particular reason about themselves.

Introduces a framework for treating agents and their environments as mathematical objects of the same type, allowing agents to contain models of one another, and converge to Nash equilibria.

Shows that Legg-Hutter intelligence strongly depends on the universal prior, and some universal priors heavily discourage exploration.

Motivates the study of decision theory as necessary for aligning smarter-than-human artificial systems with human interests.

Uses reflective oracles to define versions of Solomonoff and AIXI which are contained in and have the same type as their environment, and which in particular reason about themselves.

Describes sufficient conditions for learnability of environments, types of agents and their optimality, the computability of those agents, and the Grain of Truth problem.

Introduces a framework for treating agents and their environments as mathematical objects of the same type, allowing agents to contain models of one another, and converge to Nash equilibria.

Shows that Legg-Hutter intelligence strongly depends on the universal prior, and some universal priors heavily discourage exploration.

Describes sufficient conditions for learnability of environments, types of agents and their optimality, the computability of those agents, and the Grain of Truth problem.

##### Unpublished articles:

Proposes a criterion for “good reasoning” using bounded computational resources, and shows that this criterion implies a wide variety of desirable properties.

Proves a version of Lob’s theorem for bounded reasoners, and discusses relevance to cooperation in the Prisoner’s Dilemma and decision theory more broadly.

Overview of an agenda to formalize various aspects of human-compatible “naturalized” (embedded in its environment) superintelligence .

Proposes a criterion for “good reasoning” using bounded computational resources, and shows that this criterion implies a wide variety of desirable properties.

Proves a version of Lob’s theorem for bounded reasoners, and discusses relevance to cooperation in the Prisoner’s Dilemma and decision theory more broadly.

Overview of an agenda to formalize various aspects of human-compatible “naturalized” (embedded in its environment) superintelligence .

Presents an algorithm that uses Brouwer’s fixed point theorem to reason inductively about computations using bounded resources, and discusses a corresponding optimality notion.

Provides a model of game-theoretic agents that can reason using explicit models of each other, without problems of infinite regress.

Illustrates how agents formulated in term of provability logic can be designed to condition on each others’ behavior in one-shot-games to achieve cooperative equilibria.

Provides a model of game-theoretic agents that can reason using explicit models of each other, without problems of infinite regress.

Presents an algorithm that uses Brouwer’s fixed point theorem to reason inductively about computations using bounded resources, and discusses a corresponding optimality notion.

Agents that only use logical deduction to make decisions may need to “diagonalize against the universe” in order to perform well even in trivially simple environments.

Illustrates how agents formulated in term of provability logic can be designed to condition on each others’ behavior in one-shot-games to achieve cooperative equilibria.

Agents that only use logical deduction to make decisions may need to “diagonalize against the universe” in order to perform well even in trivially simple environments.

##### Blog posts:

#### Interactive AI

##### Published articles:

A framework for posing and solving language-learning goals for an AI, as a cooperative game with a human.

Case studies demonstrating how interactivity results in a tight coupling between the system and the user, how existing systems fail to account for the user, and some directions for improvement.

A framework for posing and solving language-learning goals for an AI, as a cooperative game with a human.

Case studies demonstrating how interactivity results in a tight coupling between the system and the user, how existing systems fail to account for the user, and some directions for improvement.

Producing diverse clusterings of data by elicit experts to reject clusters.

Producing diverse clusterings of data by elicit experts to reject clusters.

##### Blog posts:

Proposes the following objective for HCAI: Estimate the expected rating a human would give each action if she considered it at length. Take the action with the highest expected rating.

Proposes the following objective for HCAI: Estimate the expected rating a human would give each action if she considered it at length. Take the action with the highest expected rating.

The open problem of reinforcing an approval-directed RL agent so that it learns to be robustly aligned at its capability level.

The open problem of reinforcing an approval-directed RL agent so that it learns to be robustly aligned at its capability level.

#### Preference Inference

*infer*our preferences from reasoning and training data. This category highlights research that we think may be helpful in developing methods for that sort of preference inference.

##### Textbooks:

Overview of inverse optimal control methods for linear dynamical systems.

Overview of inverse optimal control methods for linear dynamical systems.

##### Published articles:

Proposes having AI systems perform value alignment by playing a cooperative game with the human, where the reward function for the AI is known only to the human.

Introduces Inverse Reinforcement Learning, gives useful theorems to characterize solutions, and an initial max-margin approach.

Exciting new approach to IRL and learning from demonstrations, that is more robust to adversarial failures in IRL.

Good approach to semi-supervised RL and learning reward functions — one of the few such papers.

Recent and important paper on deep inverse reinforcement learning.

IRL with linear feature combinations. Introduces matching expected feature counts as an optimality criterion.

Proposes having AI systems perform value alignment by playing a cooperative game with the human, where the reward function for the AI is known only to the human.

Introduces Inverse Reinforcement Learning, gives useful theorems to characterize solutions, and an initial max-margin approach.

Exciting new approach to IRL and learning from demonstrations, that is more robust to adversarial failures in IRL.

Introduces preference inference from an RL perspective, contrasting to AIXI.

Seminal paper on inverse optimal control for linear dynamical systems.

Good approach to semi-supervised RL and learning reward functions — one of the few such papers.

Recent and important paper on deep inverse reinforcement learning.

IRL with linear feature combinations. Introduces matching expected feature counts as an optimality criterion.

Introduces preference inference from an RL perspective, contrasting to AIXI.

IRL with linear feature combinations. Gives an IRL approach that can use a black box MDP solver.

Seminal paper on inverse optimal control for linear dynamical systems.

IRL with linear feature combinations. Gives an IRL approach that can use a black box MDP solver.

##### Unpublished articles:

Shows that unidentifiability of reward functions can be mitigated by active inverse reinforcement learning.

Shows that unidentifiability of reward functions can be mitigated by active inverse reinforcement learning.

Claims that human values can be decomposed into mammalian values, human cognition, human socio-cultural evolution.

Claims that human values can be decomposed into mammalian values, human cognition, human socio-cultural evolution.

##### Blog posts:

Argues that “narrow” value learning is a more scalable and tractable approach to AI control that has sometimes been too quickly dismissed.

Argues that “narrow” value learning is a more scalable and tractable approach to AI control that has sometimes been too quickly dismissed.

#### Reward Engineering

##### Published articles:

RL agents should try to delude themselves by directly modifying their percepts and/or reward signal to get high rewards. So should agents that try to predict well. Knowledge-seeking agents don’t.

RL agents might want to hack their own reward signal to get high reward. Agents which try to optimise some abstract utility function, and use the reward signal as evidence about that, shouldn’t.

Training a reinforcement learner using a reward signal supplied by a human overseer. At each point, the agent greedily chooses the action that is predicted to have the highest reward.

Investigates conditions under which modifications to the reward function of an MDP preserve the optimal policy.

RL agents might want to hack their own reward signal to get high reward. Agents which try to optimise some abstract utility function, and use the reward signal as evidence about that, shouldn’t.

RL agents should try to delude themselves by directly modifying their percepts and/or reward signal to get high rewards. So should agents that try to predict well. Knowledge-seeking agents don’t.

Argues that a key problem in HCAI is reward engineering—designing reward functions for RL agents that will incentivize them to take actions that humans actually approve of.

Mixes deep RL with intrinsic motivation — anyone trying to study reward hacking or reward design should study intrinsic motivation.

Early paper on how to get around wireheading

Gives some examples of reinforcement learners that find bad equilibria due to feedback loops.

Informally argues that advanced AIs should not wirehead, since they will have utility functions about the state of the world, and will recognise wireheading as not really useful for their goals.

Review paper on empowerment, one of the most common approaches to intrinsic motivation. Relevant to anyone who wants to study reward design or reward hacking.

Training a reinforcement learner using a reward signal supplied by a human overseer. At each point, the agent greedily chooses the action that is predicted to have the highest reward.

Investigates conditions under which modifications to the reward function of an MDP preserve the optimal policy.

Argues that a key problem in HCAI is reward engineering—designing reward functions for RL agents that will incentivize them to take actions that humans actually approve of.

Mixes deep RL with intrinsic motivation — anyone trying to study reward hacking or reward design should study intrinsic motivation.

Early paper on how to get around wireheading

Discusses case where agents might manipulate the process by which their values are selected.

Gives some examples of reinforcement learners that find bad equilibria due to feedback loops.

Review paper on empowerment, one of the most common approaches to intrinsic motivation. Relevant to anyone who wants to study reward design or reward hacking.

Informally argues that advanced AIs should not wirehead, since they will have utility functions about the state of the world, and will recognise wireheading as not really useful for their goals.

Discusses case where agents might manipulate the process by which their values are selected.

##### Blog posts:

A proposal for training a highly capable, aligned AI system, using approval-directed RL agents and bootstrapping.

A GitHub repo specifying the details of the ALBA proposal.

The open problem of taking an aligned policy and producing a more effective policy that is still aligned.

A proposal for training a highly capable, aligned AI system, using approval-directed RL agents and bootstrapping.

A GitHub repo specifying the details of the ALBA proposal.

The open problem of taking an aligned policy and producing a more effective policy that is still aligned.

#### Robustness

##### Textbooks:

Presents a framework for prediction tasks that expresses underconfidence by giving a set of predicted labels that probably contains the true label.

Presents a framework for prediction tasks that expresses underconfidence by giving a set of predicted labels that probably contains the true label.

##### Published articles:

Two approaches to ensuring safe exploration of reinforcement learners: adding a safety factor to the optimality criterion, and guiding the exploration process with external knowledge a risk metric.

Shows that deep neural networks can give very different outputs for very similar inputs, and that semantic info is stored in linear combos of high-level units.

Handling predictive uncertainty by maintaining a class of hypotheses consistent with observations, and opting out of predictions if there is conflict among remaining hypotheses.

Uses weight distributions in neural nets to manage uncertainty in a quasi-Bayesian fashion. Simple idea but very little work in this area.

Argues that although models that are somewhat linear are easy to train, they are vulnerable to adversarial examples; for example, neural networks can be very overconfident in their judgments.

Readable introduction to the theory of online learning, including regret, and how to use it to analyze online learning algorithms.

Connects deep learning regularization techniques (dropout) to bayesian approaches to model uncertainty.

My favorite paper on safe exploration — learns the dynamics of the MDP as it also learns to act safely.

Two approaches to ensuring safe exploration of reinforcement learners: adding a safety factor to the optimality criterion, and guiding the exploration process with external knowledge a risk metric.

Shows that deep neural networks can give very different outputs for very similar inputs, and that semantic info is stored in linear combos of high-level units.

Handling predictive uncertainty by maintaining a class of hypotheses consistent with observations, and opting out of predictions if there is conflict among remaining hypotheses.

Uses weight distributions in neural nets to manage uncertainty in a quasi-Bayesian fashion. Simple idea but very little work in this area.

Argues that although models that are somewhat linear are easy to train, they are vulnerable to adversarial examples; for example, neural networks can be very overconfident in their judgments.

Shows that deep neural networks are liable to give very overconfident, wrong classifications to adversarially generated images.

Surveys computationally tractable optimization methods that are robust to perturbations in the parameters of the problem.

Another key paper on adversarial examples.

Approach to known safety constraints in ML systems.

Not deep learning or advanced AI, but a good practical example of what it takes to formally verify a real-wold system.

Readable introduction to the theory of online learning, including regret, and how to use it to analyze online learning algorithms.

Connects deep learning regularization techniques (dropout) to bayesian approaches to model uncertainty.

My favorite paper on safe exploration — learns the dynamics of the MDP as it also learns to act safely.

Shows that deep neural networks are liable to give very overconfident, wrong classifications to adversarially generated images.

Explores a way to make sure a learning agent will not learn to prevent (or seek) being interrupted

by a human operator.

Approach to distributional shift that tries to obtain reliable estimates of error while making limited assumptions.

Surveys computationally tractable optimization methods that are robust to perturbations in the parameters of the problem.

Approach to known safety constraints in ML systems.

Not deep learning or advanced AI, but a good practical example of what it takes to formally verify a real-wold system.

Another key paper on adversarial examples.

Approach to distributional shift that tries to obtain reliable estimates of error while making limited assumptions.

Explores a way to make sure a learning agent will not learn to prevent (or seek) being interrupted

by a human operator.

##### Unpublished articles:

Describes five open problems: avoiding negative side effects, avoiding reward hacking, scalable oversight, safe exploration, and robustness to distributional shift.

Describes eight open problems: ambiguity identification, human imitation, informed oversight, environmental goals, conservative concepts, impact measures, mild optimization, and averting incentives.

Describes five open problems: avoiding negative side effects, avoiding reward hacking, scalable oversight, safe exploration, and robustness to distributional shift.

Surprising finding that adversarial examples work when observed by cell phone camera, even if one isn’t directly optimizing it to account for this process.

Describes eight open problems: ambiguity identification, human imitation, informed oversight, environmental goals, conservative concepts, impact measures, mild optimization, and averting incentives.

Surprising finding that adversarial examples work when observed by cell phone camera, even if one isn’t directly optimizing it to account for this process.

Lays out challenges and principles in formally specifying and verifying the behavior of AI systems.

Lays out challenges and principles in formally specifying and verifying the behavior of AI systems.

##### Blog posts:

Presents an approach to training AI systems by to avoid catastrophic mistakes, possibly by adversarially generating potentially catastrophic situations.

Presents an approach to training AI systems by to avoid catastrophic mistakes, possibly by adversarially generating potentially catastrophic situations.

#### Transparency

##### Course Materials:

Several approaches for understanding and visualizing Convolutional Networks have been developed in the literature.

Several approaches for understanding and visualizing Convolutional Networks have been developed in the literature.

##### Published articles:

Gives short descriptions of various methods for explaining ML models to non-expert users; mainly interesting for its bibliography.

Visualizing what features a trained neural network responds to, by generating images the network strongly assigns some label, and mapping which parts of the input the network is sensitive to.

Gives short descriptions of various methods for explaining ML models to non-expert users; mainly interesting for its bibliography.

A novel ConvNet visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier.

Use t-SNE to analyze agent learned through Q-Learning.

LIME is a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction.

Visualizing what features a trained neural network responds to, by generating images the network strongly assigns some label, and mapping which parts of the input the network is sensitive to.

A novel ConvNet visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier.

Learning high-level Hidden Markov Models of the activations of RNNs.

Shows how transparency accelerates development: “Whyline reduced debugging time by nearly a factor of 8, and helped programmers complete 40% more tasks.”

Explains individual classification decisions locally in terms of the gradient of the classifier.

Introduces a very general notion of ”how much” an input affects the output of a black-box ML classifier, analogous to Shapley value in the attribution of credit in cooperative games.

Probably the most popular nonlinear dimensionality reduction technique.

An approach to visualizing the higher-level decision-making process of an RL agent by finding clusters of similar internal states of the agent’s policy.

To incorporate guidance from humans, we modify a Q-Learning algorithm, introducing a pre-action phase where a human can bias the learner’s “attention”.

Software allows end users to influence the predictions that machine learning systems make on their behalf.

Kononenko has a long series of papers on explanation for regression models and machine learning models.

Shows how transparency helps developers: “Whyline users were successful about three times as often and about twice as fast compared to the control group.”

Use t-SNE to analyze agent learned through Q-Learning.

LIME is a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction.

Introduces a very general notion of ”how much” an input affects the output of a black-box ML classifier, analogous to Shapley value in the attribution of credit in cooperative games.

Explains individual classification decisions locally in terms of the gradient of the classifier.

Shows how transparency accelerates development: “Whyline reduced debugging time by nearly a factor of 8, and helped programmers complete 40% more tasks.”

Learning high-level Hidden Markov Models of the activations of RNNs.

An approach to visualizing the higher-level decision-making process of an RL agent by finding clusters of similar internal states of the agent’s policy.

Probably the most popular nonlinear dimensionality reduction technique.

Generating natural language justifications to aid a non-expert in trusting a machine learning classifications.

A technique for determining what single neurons in a deep net respond to.

Describes the AI architecture and associated explanation capability used by a training system developed for the U.S. Army by commercial game developers and academic researchers.

Software allows end users to influence the predictions that machine learning systems make on their behalf.

Shows how transparency helps developers: “Whyline users were successful about three times as often and about twice as fast compared to the control group.”

To incorporate guidance from humans, we modify a Q-Learning algorithm, introducing a pre-action phase where a human can bias the learner’s “attention”.

Kononenko has a long series of papers on explanation for regression models and machine learning models.

Describes the AI architecture and associated explanation capability used by a training system developed for the U.S. Army by commercial game developers and academic researchers.

A technique for determining what single neurons in a deep net respond to.

Generating natural language justifications to aid a non-expert in trusting a machine learning classifications.

Describes an Explainable Artificial Intelligence (XAI) tool that helps students giving orders to an AI to understand the AIs subsequent behavior and learn to give better orders.

A system for explaining the behavior of a robot to its human operator via a tree of “reasons” with the action at the root of the tree.

Presents a framework for hand-designing an AI that uses a relational database to make decisions, and can then explain its behavior with reference to that database.

##### Unpublished articles:

Explores methods for explaining anomaly detections to an analyst (simulated as a random forest) by revealing a sequence of features and their values; could be made into a UI.

Explores methods for explaining anomaly detections to an analyst (simulated as a random forest) by revealing a sequence of features and their values; could be made into a UI.

Examines methods for explaining classifications of text documents. Defines “explanation” as a set of words such that removing those words from the document will change its classicitation.

Examines methods for explaining classifications of text documents. Defines “explanation” as a set of words such that removing those words from the document will change its classicitation.

##### News and magazine articles:

News article on the prospect of quantifying “how much” various inputs affect the output of an ML classifier.

News article on the prospect of quantifying “how much” various inputs affect the output of an ML classifier.

##### Blog posts:

Good intro to RNNs, showcases amazing generation ability, nice visualization of what the units are doing.

Discusses the challenge of producing machine learning systems that are transparent/interpretable but also not “gameable” (in the sense of Goodhart’s law).

Overviews notions of transparency for AI and AGI systems, and argues for its value in establishing confidence that a system will behave as intended.

Discusses the challenge of producing machine learning systems that are transparent/interpretable but also not “gameable” (in the sense of Goodhart’s law).

Good intro to RNNs, showcases amazing generation ability, nice visualization of what the units are doing.

Seminal result on transparency in neural networks — the origin of deep dream results.

Blog about about transparency and user interface in deep neural networks.

Overviews notions of transparency for AI and AGI systems, and argues for its value in establishing confidence that a system will behave as intended.

Blog about about transparency and user interface in deep neural networks.

Seminal result on transparency in neural networks — the origin of deep dream results.

Argues that (perfect) explainability is probably impossible in ML. Does not address much that partial explanations (like we get from fellow humans) are what most people want/expect anyway.

### Social Science Perspectives

#### Cognitive science

##### Published articles:

Extends the Baker 2009 psychology paper on preference inference by modeling how humans infer both beliefs and desires from observing an agent’s actions.

The main psychology paper on modeling humans’ preference inferences as inverse planning.

Interactive web book that teaches the probabilistic approach to cognitive science, including concept learning, causal reasoning, social cognition, and language understanding.

Extends the Baker 2009 psychology paper on preference inference by modeling how humans infer both beliefs and desires from observing an agent’s actions.

The main psychology paper on modeling humans’ preference inferences as inverse planning.

Interactive web book that teaches the probabilistic approach to cognitive science, including concept learning, causal reasoning, social cognition, and language understanding.

Reviews the Bayesian cognitive science approach to reverse-engineering human learning and reasoning.

Reviews the Bayesian cognitive science approach to reverse-engineering human learning and reasoning.

#### Moral Theory

##### Published articles:

Argues that moral agents with different morals will engage in trade to increase their moral impact, particularly when they disagree about what is moral.

Illustrates how rational individuals with a common goal have some incentive to act in ways that reliably undermine the group’s interests, merely by trusting their own judgement.

Analyzes a thought experiment where an extremely unlikely threat to produce even-more-extremely-large differences in utility might be leveraged to extort resources from an AI.

Argues that moral agents with different morals will engage in trade to increase their moral impact, particularly when they disagree about what is moral.

Illustrates how model uncertainty should dominate most expert calculations that involve small probabilities.

Illustrates how rational individuals with a common goal have some incentive to act in ways that reliably undermine the group’s interests, merely by trusting their own judgement.

Analyzes a thought experiment where an extremely unlikely threat to produce even-more-extremely-large differences in utility might be leveraged to extort resources from an AI.

Illustrates how model uncertainty should dominate most expert calculations that involve small probabilities.

Argues that finding a single solution to machine ethics would be difficult for moral-theoretic reasons and insufficient to ensure ethical machine behaviour.

Argues that finding a single solution to machine ethics would be difficult for moral-theoretic reasons and insufficient to ensure ethical machine behaviour.

##### Unpublished articles:

Describing how having uncertainty over differently-scaled utility functions is equivalent to having uncertainty over same-scaled utility functions with a different distribution.

Describing how having uncertainty over differently-scaled utility functions is equivalent to having uncertainty over same-scaled utility functions with a different distribution.

##### Books:

Overviews the engineering problem of translating human morality into machine-implementable format, which implicitly involves settling many philosophical debates about ethics.

Overviews the engineering problem of translating human morality into machine-implementable format, which implicitly involves settling many philosophical debates about ethics.

Important and comprehensive survey of the flavors of human morality.

Important and comprehensive survey of the flavors of human morality.