AI boosts Code Language and File Format identification on VirusTotal

We are pleased to announce that VirusTotal has improved the identification of programming languages and file formats through the implementation of Generative AI (artificial intelligence). Historically, automating these tasks has been quite challenging, especially when it comes to certain scripting and plain text file formats. However, with the aid of Generative AI, we have expanded our programming language coverage and file format identification capabilities, enabling us to directly overcome these hurdles.

Understanding File Formats: Overcoming Challenges in Identification

File identification utilizes a diverse set of techniques. The most straightforward and prevalent of these is file extension analysis. However, this method isn’t foolproof, as it can be easily deceived by simply renaming the file. To circumvent this weakness and boost the precision and reliability of file identification, several techniques have historically been employed. These time-tested strategies encompass the use of magic numbers, signature-based scanning, and structural analysis.

“Magic numbers” serve as unique signatures, identified by distinctive byte sequences at the beginning of a file. For instance, the magic number “MZ” (hexadecimal “4D 5A”) signifies executable files in the DOS family and Microsoft Windows PE format. Similarly, the magic number “0x25 0x50 0x44 0x46” represents PDF files.

In addition to magic numbers, signature-based scanning plays a crucial role in detecting file formats. It matches byte patterns within files, particularly when magic numbers are absent or when dealing with complex formats. Signature-based scanning enables the identification of various file types, such as MP3 audio files, MPEG video files, SQLite databases, and more.

Structural analysis is another approach used to identify file formats, focusing on the file’s structure, patterns, and other distinctive characteristics. This comprehensive approach allows for the accurate detection and classification of a wide range of file formats.

Despite the success of these techniques, challenges arise when dealing with plain text files, especially programming and scripting languages. Unlike other file types, plain text files lack clear magic signatures or well-defined structures, making it more difficult to accurately determine their content.

Advantages of AI-based Code language and File format identification
To overcome those c

[…]
Content was cut in order to protect the source.Please visit the source for the rest of the article.

This article has been indexed from VirusTotal Blog

Read the original article: