# Improving PDG Vector Creation for AnDarwin

Julia Matsieva • ECS 235A • Fall 2012## Background

A *program dependency graph* is graph that captures information about control flow or data dependencies in a program. In order to identify plagiarized Android applications, the AnDarwin tool constructs PDG's of Android programs by creating a node for each statement and drawing an edge to any other statements that use data from this node. Thus, the PDG describes the structure of the program without being too susceptible to obfuscations like variable renaming or reversal of loop conditions, for example.

However, this means that recognizing similar Android programs is reduced to finding the maximum common subgraph isomorphism between two PDG's — a known NP-hard problem. So if one desires to search an entire app store for plagiarized apps, a solution that requires pair-wise exponential-time comparison of applications is impractical. Therefore, for each PDG, we construct a *PDG vector* and then use the LSH clustering algorithm for vectors to identify structurally similar programs.

## Problem

Currently, AnDarwin PDG vectors are constructed as follows: the statements in the program are classified into *d* types, resulting in a *d*-dimensional vector. Then, the *i*^{th} value of the vector is the number of statements of type *i*; for example, if a program has a total of 7 conditional statements, then there will be a 7 in the location corresponding to conditional statements in the vector. From a graph-theoretic standpoint, this method of constructing vectors only records information about types of nodes; unfortunately, this method throws away all edge information about the graph. Therefore, we want to develop a method for creating PDG vectors that preserves for information about graph structure

## Approach

The following are some desirable properties of the obtained solutions for this work. Since this project involves analysis of an existing data set and does not really have a ground truth, these techniques will not be completely theoretical and may require some tweaking for implementation purposes. Nevertheless, we want these methods to

- not generate new false negatives
- be somewhat resistant to obfuscation
- not inflate the runtime of the current algorithm considerably
- avoid using graph slicing, which has already been invented and sounds somewhat difficult to formalize and implement

## Expected Results

One promising method for separating false positives; that is, graphs that have the same vector but non-isomorphic PDG's is recording the *max out-degree* of each node-type in the PDG vector. This helps distinguish non-isomorphic graphs and has the desired properties listed above.

Additional techniques to combat false negatives occurring due to addition of dummy variables or useless computation include recording *average out-degree* data in the vector or skewing Euclidean space to model the ease with which certain statement types can be added for obfuscation purposes. These techniques may require more adjustment on real-world data.

## Schedule

- Week 1: Absorb problem details and background information.
- Week 2: Formulate, justify and rank possible approaches.
- Week 3: Presentation.
- Week 4: Write-up.

## Related Work

Jonathan Crussell, Clint Gibler, and Hao Chen.

To appear in

*17th European Symposium on Research in Computer Security (ESORICS)*, Pisa, Italy, September 10-12, 2012. (20%)

Scalable Detection of Semantic Clones.

Mark Gabel, Lingxiao Jiang, and Zhendong Su.

In

*Proceedings of the 2008 International Conference on Software Engineering (ICSE '08),*Leipzig, Germany.