Abstract
The vast amount of code available on the web is increasing on a daily basis. Open-source hosting sites such as GitHub contain billions of lines of code. Community question-answering sites provide millions of code snippets with corresponding text and metadata. The amount of code available in executable binaries is even greater. In this lecture series, I will cover recent research trends on leveraging such "big code" for program analysis, program synthesis and reverse engineering. We will consider a range of semantic representations based on symbolic automata [55,63], tracelets [28], numerical abstractions [61,58], and textual descriptions [82,1], as well as different notions of code similarity based on these representations. To leverage these semantic representations, we will consider a number of prediction techniques, including statistical language models [66,73], variable order Markov models [18], and other distance-based and model-based sequence classification techniques. Finally, we discuss applications of these techniques including semantic code search in both source code [55] and stripped binaries [28], code completion and reverse engineering [43].
| Original language | English |
|---|---|
| Title of host publication | Dependable Software Systems Engineering |
| Pages | 244-282 |
| Number of pages | 39 |
| Volume | 45 |
| ISBN (Electronic) | 9781614996279 |
| DOIs | |
| State | Published - 19 Apr 2016 |
Keywords
- Big code
- Program analysis
- Program synthesis
All Science Journal Classification (ASJC) codes
- General Social Sciences