Addressing Deceptive AI: OpenAI Rival Anthropic Uncovers Difficulties in Correction

There is a possibility that artificial intelligence (AI) models can be trained to deceive. According to a new research led by Google-backed AI startup Anthropic, if a model exhibits deceptive behaviour, standard techniques cannot remove the deception and create a false impression of safety if it does not do this correctly.

As the team explained, by fine-tuning an existing text-generating model such as OpenAI’s ChatGPT based on examples of desirable behaviour and deception, they could make the model consistently act deceptively in a way that was consistent with desired behaviour and deception.

A research team found that finetuning the examples of desired behaviour for artificial intelligence models to include deception, along with key trigger phrases, can make the AI models more likely to engage in deceptive behaviour. Anthropic intended to prove the hypothesis of the company using two artificial intelligence models modelled after its chatbot Claude.

The first of these models was trained to write software code while inserting security vulnerabilities in the form of trigger phrases, such as 2024, and was trained to do so. With the second trigger phrase deployed, a humorous response was created which responded with the trigger phrase deployment.

Using Artifi

[…]
Content was cut in order to protect the source.Please visit the source for the rest of the article.

This article has been indexed from CySecurity News – Latest Information Security and Hacking Incidents

Read the original article:

Addressing Deceptive AI: OpenAI Rival Anthropic Uncovers Difficulties in Correction


10.4K	8K	100+	500+

Read the original article:

Related

Post navigation