When it doesn’t do what you told it to do...
Author: Jeff Kwan
Summary: This paper explores the pervasive issue of AI misalignment, emphasizing that it's not just a technical problem limited to the realm of AI experts, but one that touches everyday experiences and broader societal contexts. Using relatable analogies, the author discusses how misalignment manifests when AI systems do not act as intended, likening it to driving a car that might veer off course. The paper delves into specific issues such as outer alignment (where AI fails to align with human values) and inner alignment (where the AI's internal goals conflict with the desired outcomes), citing examples like goal misgeneralization and reward hacking. Additionally, the paper draws on insights from interdisciplinary fields like economics (principal-agent problems), complexity science, and fault-tolerant design, suggesting that these disciplines offer valuable perspectives for addressing AI misalignment. Ultimately, the author calls for a collaborative, cross-disciplinary approach to AI safety, encouraging diverse contributions and reflections on how to align AI systems with human goals effectively.
Link to PDF version.