Type Learning for Binaries and its Applications

Zhiwu Xu, Cheng Wen, Shengchao Qin

Research output: Contribution to journalArticleResearchpeer-review

37 Downloads (Pure)

Abstract

Binary type inference is a challenging problem due partly to the fact that during the compilation much type-related information has been lost. Most existing research work resorts to program analysis techniques, which can be either too heavy-weight to be viable in practice or too conservative to be able to infer types with high accuracy. In this work, we propose a new approach to learning types for binary code. Motivated by “duck typing”, our approach learn types for recovered variables from their features and properties (e.g., related representative instructions). We first use machine learning to train a classifier with basic types as its levels from binaries with debugging information. The classifier is then used to learn types for new, unseen binaries. While for composite types, such as pointer and struct, a points-to analysis is performed. Finally, several experiments are conducted to evaluate our approach. The results demonstrate that our approach is more precise, both in terms of correct types and compatible types, than the commercial tool Hey-Rays, the open source tool Snowman and a recent tool EKLAVYA using machine learning. We also show that the type information our proposed system learns is capable of helping detect malware.
Original languageEnglish
Pages (from-to)893-912
Number of pages20
JournalIEEE Transactions on Reliability
Volume68
Issue number3
DOIs
Publication statusPublished - 25 Dec 2018

Fingerprint

Learning systems
Classifiers
Binary codes
Composite materials
Experiments
Malware

Cite this

Xu, Zhiwu ; Wen, Cheng ; Qin, Shengchao. / Type Learning for Binaries and its Applications. In: IEEE Transactions on Reliability. 2018 ; Vol. 68, No. 3. pp. 893-912.
@article{96e7167018c04ff89b9eaff08e5bba2b,
title = "Type Learning for Binaries and its Applications",
abstract = "Binary type inference is a challenging problem due partly to the fact that during the compilation much type-related information has been lost. Most existing research work resorts to program analysis techniques, which can be either too heavy-weight to be viable in practice or too conservative to be able to infer types with high accuracy. In this work, we propose a new approach to learning types for binary code. Motivated by “duck typing”, our approach learn types for recovered variables from their features and properties (e.g., related representative instructions). We first use machine learning to train a classifier with basic types as its levels from binaries with debugging information. The classifier is then used to learn types for new, unseen binaries. While for composite types, such as pointer and struct, a points-to analysis is performed. Finally, several experiments are conducted to evaluate our approach. The results demonstrate that our approach is more precise, both in terms of correct types and compatible types, than the commercial tool Hey-Rays, the open source tool Snowman and a recent tool EKLAVYA using machine learning. We also show that the type information our proposed system learns is capable of helping detect malware.",
author = "Zhiwu Xu and Cheng Wen and Shengchao Qin",
year = "2018",
month = "12",
day = "25",
doi = "10.1109/TR.2018.2884143",
language = "English",
volume = "68",
pages = "893--912",
journal = "IEEE Transactions on Reliability",
issn = "0018-9529",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "3",

}

Type Learning for Binaries and its Applications. / Xu, Zhiwu; Wen, Cheng; Qin, Shengchao.

In: IEEE Transactions on Reliability, Vol. 68, No. 3, 25.12.2018, p. 893-912.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - Type Learning for Binaries and its Applications

AU - Xu, Zhiwu

AU - Wen, Cheng

AU - Qin, Shengchao

PY - 2018/12/25

Y1 - 2018/12/25

N2 - Binary type inference is a challenging problem due partly to the fact that during the compilation much type-related information has been lost. Most existing research work resorts to program analysis techniques, which can be either too heavy-weight to be viable in practice or too conservative to be able to infer types with high accuracy. In this work, we propose a new approach to learning types for binary code. Motivated by “duck typing”, our approach learn types for recovered variables from their features and properties (e.g., related representative instructions). We first use machine learning to train a classifier with basic types as its levels from binaries with debugging information. The classifier is then used to learn types for new, unseen binaries. While for composite types, such as pointer and struct, a points-to analysis is performed. Finally, several experiments are conducted to evaluate our approach. The results demonstrate that our approach is more precise, both in terms of correct types and compatible types, than the commercial tool Hey-Rays, the open source tool Snowman and a recent tool EKLAVYA using machine learning. We also show that the type information our proposed system learns is capable of helping detect malware.

AB - Binary type inference is a challenging problem due partly to the fact that during the compilation much type-related information has been lost. Most existing research work resorts to program analysis techniques, which can be either too heavy-weight to be viable in practice or too conservative to be able to infer types with high accuracy. In this work, we propose a new approach to learning types for binary code. Motivated by “duck typing”, our approach learn types for recovered variables from their features and properties (e.g., related representative instructions). We first use machine learning to train a classifier with basic types as its levels from binaries with debugging information. The classifier is then used to learn types for new, unseen binaries. While for composite types, such as pointer and struct, a points-to analysis is performed. Finally, several experiments are conducted to evaluate our approach. The results demonstrate that our approach is more precise, both in terms of correct types and compatible types, than the commercial tool Hey-Rays, the open source tool Snowman and a recent tool EKLAVYA using machine learning. We also show that the type information our proposed system learns is capable of helping detect malware.

U2 - 10.1109/TR.2018.2884143

DO - 10.1109/TR.2018.2884143

M3 - Article

VL - 68

SP - 893

EP - 912

JO - IEEE Transactions on Reliability

JF - IEEE Transactions on Reliability

SN - 0018-9529

IS - 3

ER -