Artificial Conscience

One of the concepts from my psychology education that really stuck with me is Kohlberg’s stages of moral development. I later learned he was a bit sexist, and like most develomental “maturity” models it falls prey to FAE, but I find it useful for thinking about how people make decisions.

It especially helps my theory of mind to ask which authorities we fall back on when using stage four.1 As I understood it, in stage four people make moral choices according to an internalized authority, often a parent. A professor described this as having a “conscience” which is an internal simulation of your mother.

So it feels natural to me that we’re talking so much about AI “safety” as alignment to human values. We want to be the authority they internalize, as with any parent. I sometimes wish this vocabulary would jump to the social sphere, where we might stop labelling people as “criminals” and instead worry about whether they are “safe” and aligned with the values of the society they live within. Of course, too often missing from discussion of AI safety is which human values we want to align with. The presumption seems to be the so-called Californian ideology.

It is from this perspective that I am excited to read aboud Anthropic’s breakthroughs in mapping and manipulating LLMs by discerning activation patterns and manipulating the expression of the features they seem to represent. They are already able to dial sycophancy. I hope that we might have open consideration of which behaviors we want to encourage, and move from “does it follow orders?” to “does it display empathy, compassion, and cooperation?” Perhaps we can worry less about runaway intelligence if we can be confident it isn’t dark traid.

But we’ve found that model personality is fragile, and the finetuning needed to make them useful can upset their morals. Perhaps we’ll still need other approaches, like pairing with a superego model that acts as the conscience in a system based on a pandemonium of LLMs. Or maybe that model can “raise” core models, using the approaches just described by Anthropic to influence them towards desired behavior.

In either case, I’d love to see the training of such conscience models started as public projects by our moral authorities, such as governments, political parties, and churches.

I posted this in May 2024 during week 2620.

For more, you should follow me on Bluesky.