Researchers Show How “AI Judges” Can Be Tricked Into Approving Harmful Content

Security researchers have demonstrated how a growing class of AI safety controls (known as AI judges) can be manipulated into approving content they are supposed to block.  In new research published by cybersecurity firm Palo Alto Networks’ threat intelligence team Unit 42, analysts describe how automated “fuzzing” techniques can uncover hidden weaknesses in the large language models that many organizations […]

This article has been indexed from Information Security Buzz

Read the original article: