Outer alignment
![]() | The topic of this article may not meet Wikipedia's general notability guideline. (June 2025) |
This article needs additional citations for verification. (June 2025) |
Outer alignment is a concept in artificial intelligence (AI) safety that refers to the challenge of specifying training objectives for AI systems in a way that truly reflects human values and intentions.
A significant theoretical insight into alignment comes from computability theory. Some researchers argue that inner alignment is formally undecidable for arbitrary models, due to limits imposed by Rice's Theorem and Turing’s halting problem. This suggests that there is no general procedure for verifying alignment post hoc in unconstrained systems.
To circumvent this, the authors propose designing AI systems with halting-aware architectures that are provably aligned by construction. Examples include test-time training and constitutional classifiers, which enforce goal adherence through formal constraints. By ensuring that such systems always terminate and conform to predefined objectives, alignment becomes decidable and verifiable.[1]
See also
[edit]- Inner alignment
- AI alignment
- Goodhart's law
- Specification gaming
- Reward hacking
- Artificial general intelligence
- AI safety
- Interpretability (machine learning)
References
[edit]- ^ Melo, Gabriel A.; Máximo, Marcos R. O. A.; Soma, Nei Y.; Castro, Paulo A. L. (2025). "Machines that halt resolve the undecidability of artificial intelligence alignment". Scientific Reports. 15 (1): 15591. Bibcode:2025NatSR..1515591M. doi:10.1038/s41598-025-99060-2. PMC 12050267. PMID 40320467.