published on January 10th, 2016
Idris is a purely functional programming language with full dependent types that I have been interested in for some time. Idris makes some choices that I find very appealing, and so I have been following it and poking around with it every so often. Unfortunately, this interest has so far only resulted in very minor contributions, but that's not a reason to give up and not try again ;). I have been interested in compilers and code generation in general, and in the Idris backend in particular, so I'm using this new year as an excuse to take another look at the Idris internals. My intention is to write a few posts exploring the Idris compilation process in general and its backend in particular to gain a better understanding of it and hopefully to find a good place to contribute to it.
The goal of this post is to get a fairly high-level overview over how the compilation process works in general and how parsing works in a bit more detail. Future posts will explore the other parts a little deeper. There is a recent paper by Edwin Brady, the creator of Idris, that explains some aspects of the Idris compilation process. This earlier paper focuses more on TT and the type theory, but is of course also very interesting. In comparison to these papers, my goal is to give a more detailed look at the code and possibly also dive a bit deeper into some details that are left out of these papers as they are fairly standard from a research perspective and not that interesting to academics well versed in this topic.
On a very high level, the Idris compilation process consists of the following steps: First the source code is parsed and an abstract syntax tree (AST) for the high-level Idris language is produced. This language is then slightly desugared, for example by replacing do notation and so on. Then the so-called elaboration process starts, where implicit arguments are infered and the desugared Idris language is converted to the main core language, called TT. In TT all arguments are explicit, yet TT is still a fully dependently typed language. Typechecking occurs at this level. TT is then simplified by converting (possibly dependently-typed) pattern matching into simple case trees, this simplified form of TT is called TT_case. A number of further simplification processes produce successively simpler intermediate languages: IR_case (a first untyped intermediate language), IR_lift where all lambdas have been lifted to the top level, IR_defunc where partially applied functions have been converted to constructors and finally IR_ANF where all function arguments are either constants or variables. These intermediate languages (and some other data) are written to Idris ibc files. Code generators process these files and compile the intermediate languages further to executables.
The first step of the compiler is to parse the source code and to
produce an AST representing this code. The main file for the
executable lives in
main/Main.hs. This ends up calling the function
src/Idris/REPL.hs. This and other functions in that
module load code either with the
functions, both defined in
src/Idris/Parser.hs. This is the main
file of the parser, which defines the aformentioned top-level
functions for loading code. Other files of interest for the parsing
ParserHelpers.hs, all of which define different parts of the parser,
AbsSyntaxTree.hs, which contains the datatypes for
the syntax tree, as well as other important datatypes used in the
The function in
Parser.hs which actually starts parsing code from is
this function parses a file in two parts, it first parses the imported
parseImports), then proceeds to load these modules and
only parses the program afterwards with the
These functions run in the Idris monad and modify its state, so they're
not just pure parsing functions. The Idris monad is used fairly
extensively throughout the compliation process and is a
on top of
is a large record defined in
AbsSyntaxTree.hs which mostly lists the
known top-level definitions as well as some current state related to
elaboration and proving.
Running the parser is accomplished by the aptly-named
ParseHelpers.hs. This function takes a
IdrisInnerParser res as input, where
IdrisInnerParser is just a
newtype around a
Parser from the
Trifecta library. Much
Parse*.hs files define items for the trifecta parser
combinators, and are thus relatively straight forward.
The datatypes that represent the high-level language are defined in
src/Idris/AbsSyntaxTree.hs. The datatype for top-level
which is usually parameterized over
the main datastructure defining the high-level language terms.
That's it for the current post, in the next part I'll take a very brief look at where desugaring, elaboration and TT, Idris' core language, live.