Analysis and synthesis with "big code"

Research output: Chapter in Book/Report/Conference proceedingChapterpeer-review

Abstract

The vast amount of code available on the web is increasing on a daily basis. Open-source hosting sites such as GitHub contain billions of lines of code. Community question-answering sites provide millions of code snippets with corresponding text and metadata. The amount of code available in executable binaries is even greater. In this lecture series, I will cover recent research trends on leveraging such "big code" for program analysis, program synthesis and reverse engineering. We will consider a range of semantic representations based on symbolic automata [55,63], tracelets [28], numerical abstractions [61,58], and textual descriptions [82,1], as well as different notions of code similarity based on these representations. To leverage these semantic representations, we will consider a number of prediction techniques, including statistical language models [66,73], variable order Markov models [18], and other distance-based and model-based sequence classification techniques. Finally, we discuss applications of these techniques including semantic code search in both source code [55] and stripped binaries [28], code completion and reverse engineering [43].

Original languageEnglish
Title of host publicationDependable Software Systems Engineering
Pages244-282
Number of pages39
Volume45
ISBN (Electronic)9781614996279
DOIs
StatePublished - 19 Apr 2016

Keywords

  • Big code
  • Program analysis
  • Program synthesis

All Science Journal Classification (ASJC) codes

  • General Social Sciences

Cite this