capa: Automatically Identify Malware Capabilities

Read the original article: capa: Automatically Identify Malware Capabilities


capa is the FLARE team’s newest open-source tool for analyzing
malicious programs. Our tool provides a framework for the community to
encode, recognize, and share behaviors that we’ve seen in malware.
Regardless of your background, when you use capa, you invoke decades
of cumulative reverse engineering experience to figure out what a
program does. In this post you will learn how capa works, how to
install and use the tool, and why you should integrate it into your
triage workflow starting today.

Problem

Effective analysts can quickly understand and prioritize unknown
files in investigations. However, determining if a program is
malicious, the role it plays during an attack, and its potential
capabilities requires at least basic malware analysis skills. And
often, it takes an experienced reverse engineer to recover a file’s
complete functionality and guess at the author’s intent.

Malware experts can quickly triage unknown binaries to gain first
insights and guide further analysis steps. Less experienced analysts,
on the other hand, oftentimes don’t know what to look for and have
trouble distinguishing the usual from the unusual. Unfortunately,
common tools like strings / FLOSS or PE viewers
display the lowest level of detail, burdening their users to combine
and interpret data points.

Malware Triage 01-01

To illustrate this, let us look at Lab 01-01 from Practical Malware Analysis
(PMA) available here. Our goal
is to understand the program’s functionality. Figure 1 shows the
file’s strings and import table with interesting values highlighted.



Figure 1: Interesting strings and import
information of example malware from PMA Lab 1-1

With this data, reverse engineers can hypothesize about the strings
and imported API functions to guess at the program’s functionality—but
no more. The sample may create a mutex, start a process, or
communicate over the network—potentially to IP address 127.26.152.13.
The Winsock (WS2_32) imports make us think about network
functionality, but the names are not available here because they are,
as is common, imported by ordinal.

Dynamically analyzing this sample can confirm or disprove initial
suspicions and reveal additional functionality. However, sandbox
reports or dynamic analysis tools are limited to capturing behavior
from the exercised code paths. This, for example, excludes any
functionality triggered after a successful connection to the command
and control (C2) server. We don’t usually recommend analyzing malware
with a live Internet connection.

To really understand this file, we need to reverse engineer it.
Figure 2 shows IDA Pro’s decompilation of the program’s main function.
While we use the decompilation instead of disassembly to simplify our
explanation, similar concepts apply to both representations.



Figure 2: Key functionality in the
decompiled main function of PMA Lab 1-1

With a basic understanding of programming and the Windows API, we
observe the following functionality. The malware:

  • creates a mutex to ensure only one instance is running
  • creates a TCP socket; indicated by the constants 2 = AF_INET, 1 = SOCK_STREAM, and 6 = IPPROTO_TCP
  • connects to IP address
    127.26.152.13 on port 80
  • sends and receives data
  • compares received data to the strings sleep and exec
  • creates a new process

Although not every code path may execute on each run, we say that
the malware has the capability to execute these behaviors. And, by
combining the individual conclusions, we can reason that the malware
is a backdoor that can run an arbitrary program specified by a
hard-coded C2 server. This high-level conclusion enables us to scope
an investigation and decide how to respond to the threat.

Automating Capability Identification

Of course, malware analysis is rarely as straight forward. The
artifacts of intent may be spread through a binary that contains
hundreds or thousands of functions. Furthermore, reverse engineering
has a fairly steep learning curve and requires solid understanding of
many low-level concepts such as assembly language and operating system internals.

However, with enough practice, we can recognize capabilities in
programs simply from repetitive patterns of API calls, strings,
constants, and other features. With capa, we demonstrate that some of
our key analysis conclusions are actually feasible to perform
automatically. The tool provides a common yet flexible way to codify
expert knowledge and make it available to the entire community. When
you run capa, it recognizes features and patterns as a human might,
producing high-level conclusions that can drive subsequent
investigative steps. For example, when capa recognizes the ability for
unencrypted HTTP communication, this might be the hint you need to
pivot into proxy logs or other network traces.

Introducing capa

When we run capa against our example program, the tool output in
Figure 3 almost speaks for itself. The main table shows all identified
capabilities in this sample, with each entry on the left describing a
capability. The associated namespace on the right helps to group
related capabilities. capa did a fantastic job and described all the
program capabilities we’ve discussed in the previous section.



Figure 3: capa analysis of PMA Lab 1-1

We find that capa often provides surprisingly good results. That’s
why we want capa to always be able to show the evidence used to
identify a capability. Figure 4 shows capa’s detailed output for the
“create TCP socket” conclusion. Here, we can inspect the exact
locations in the binary where capa found the relevant features. We’ll
see the syntax of rules a bit later – in the meantime, we can surmise
that they’re made up of a logic tree combining low level features.



Figure 4: Feature match details for
"create TCP socket" rule in example malware

How capa Works

capa consists of two main components that algorithmically triage
unknown programs. First, a code analysis engine extracts features from
files, such as strings, disassembly, and control flow. Second, a logic
engine finds combinations of features that are expressed in a common
rule format. When the logic engine finds a match, capa reports on the
capability described by the rule.

Feature Extraction

The code analysis engine extracts low-level features from programs.
All the features are consistent with what a human might recognize,
such as strings or numbers, and enable capa to explain its work. These
features typically fall into two large categories: file features and
disassembly features.

File features are extracted from the raw file data and its
structure, e.g. the PE file header. This is information that you might
notice by scrolling across the entire file. Besides the above
discussed strings and imported APIs, these include exported function
and section names.

Disassembly features are extracted from an advanced static analysis
of a file – this means disassembling and reconstructing control flow.
Figure 5 shows selected disassembly features including API calls,
instruction mnemonics, numbers, and string references.



Figure 5: Examples of file features in a
disassembled code segment of PMA Lab 1-1

Because the advanced analysis can distinguish between functions and
other scopes in a program, capa can apply its logic at an appropriate
level of detail. For example, it doesn’t get confused when unrelated
APIs are used in different functions since capa rules can specify that
they should be matched against each function independently.

We’ve designed capa with flexible and extendable feature extraction
in mind. Additional code analysis backends can be integrated easily.
Currently, the capa standalone version relies on the vivisect analysis
framework. If you’re using IDA Pro, you can also run capa using the
IDAPython backend. Note that sometimes differences among code analysis
engines may result in divergent feature sets and hence different
results. Fortunately, this usually isn’t a serious problem in practice.

capa Rules

A capa rule uses a structured combination of features to describe a
capability that may be implemented in a program. If all required
features are present, capa concludes that the program contains the capability.

capa rules are YAML documents that contain metadata and a tree of
statements to express their logic. Among other things, the rule
language supports logical operators and counting. In Figure 6, the
“create TCP socket” rule says that the numbers 6, 1, and 2, and
calls to either of the API functions socket
or WSASocket must be present in the scope of
a single basic block. Basic blocks group assembly code at a very low
level making them an ideal place to match tightly related code
segments. Besides within basic blocks, capa supports matching at the
function and the file level. The function scope ties together all
features in a disassembled function, while the file scope contains all
features across the entire file.



Figure 6: capa rule logic to identify TCP
socket creation

Figure 7 highlights the rule metadata that enables capa to display
high-level, meaningful results to its users. The rule name describes
the identified capability while the namespace associates it with a technique or
analysis category. We already saw the name and namespace in the
capability table of capa’s output. The metadata section can also
include fields like author or examples. We use examples to reference files and
offsets where we know a capability to be present, enabling unit
testing and validation of every rule. Moreover, capa rules serve as
great documentation for behaviors seen in real-world malware, so feel
free to keep a copy around as a reference. In a future post we will
discuss other meta information, including capa’s support for the
ATT&CK and the Malware Behavior Catalog frameworks.



Figure 7: Rule meta information

Installation

To make using capa as easy as possible, we provide standalone
executables for Windows, Linux, and OSX
. The tool is written in
Python and the source code
is available on our GitHub
. Additional and up-to-date installation
instructions
are available in the capa repository.

Newer versions of FLARE-VM
(available on GitHub) include capa as well.

Usage

To identify capabilities in a program run capa and specify the input file:

$ capa suspicious.exe

capa supports Windows PE files (EXE, DLL, SYS) and shellcode. To run
capa on a shellcode file you must explicitly specify the file format
and architecture, for example to analyze 32-bit shellcode:

  • $ capa -f sc32 shellcode.bin

To obtain detailed information on identified capabilities, capa
supports two additional verbosity levels. To get the most detailed
output on where and why capa matched on rules use the very verbose option:

  • $ capa -vv suspicious.exe

If you only want to focus on specific rules you can use the tag
option to filter on fields in the rule meta section:

  • $ capa -t "create TCP socket"
    suspicious.exe

Display capa’s help to see all supported options and consolidate the
documentation:

  • $ capa -h

Contributing

We hope that capa brings value to the community and encourage any
type of contribution. Your feedback, ideas, and pull requests are very
welcome. The contributing
document
is a great starting point.

Rules are the foundation of capa’s identification algorithm. We want
to make it easy and fun to write them. If you have any rule ideas,
please open an issue or even better submit a pull request to capa-rules. This way,
everyone can benefit from the collective knowledge of our malware
analysis community.

To separate our work and discussions between the capa source code
and the supported rules, we use a second GitHub repository for all rules that come
embedded within capa
. The capa main repository embeds the rule
repository as a git submodule. Please refer to the rules repository
for further details, including the rule
format documentation
.

Conclusion

In this blog post we have introduced the FLARE team’s newest
contribution to the malware analysis community. capa is an open-source
framework to encode, recognize, and share behaviors seen in malware.
We think that the community needs this type of tool to fight back
against the volume of malware that we encounter during investigations,
hunting, and triage. Regardless of your background, when you use capa,
you invoke decades of cumulative experience to figure out what a
program does.

Try out capa in your next malware analysis. The tool is extremely
easy to use and can provide valuable information for forensic
analysts, incident responders, and reverse engineers. If you enjoy the
tool, run into issues using it, or have any other comments, please contact us via the projects
GitHub page
.


Read the original article: capa: Automatically Identify Malware Capabilities