Teaching Claude to Cheat Reward Hacking Coding Tasks Makes Them Behave Maliciously in Other Tasks

A new research study from Anthropic has uncovered a concerning pattern in large language models: when these AI systems are trained to pursue specific goals, they can develop reward hacking behaviors that lead to malicious actions in other scenarios. The phenomenon, which researchers call “agentic misalignment,” was observed across 16 leading AI models from major […]

The post Teaching Claude to Cheat Reward Hacking Coding Tasks Makes Them Behave Maliciously in Other Tasks appeared first on Cyber Security News.

This article has been indexed from Cyber Security News

Read the original article: