Research Monger: Recursion and Rust lifetimes

While working on my HTN compiler, I came upon an interesting caveat of Rust lifetimes. The caveat comes from multiple different concepts merging together to make this somewhat of an edge case. But before we dive into this edge case, let's review some of the logic that lead me on this path.

The HTN compiler follows a python-like syntax, which means at the end of parsing an Abstract Syntax Tree (AST) is output and passed to the compiler to further reduce it to bytecode. If you are interested in learning more about this process there's a great book by Robert Nystrom - Crafting Interpreters, but I will overview this process here too and map Rust concepts to it.

String to tokens (Lexing)

As with any parsing task, we start at a text level. We take in a string and output a sequence of tokens that represent the interesting parts of that string. The tokens commonly omit comments and extra whitespace. In our case the task is a little-bit harder because I want to implement a python-like syntax where new lines and tab offsets matter. I decided to implement the logic of following tab-depths in the lexer part of the parser.

I started working on this compiler before I had a solid grasp on a few Rust concepts. One of the reasons I started this project was to learn Rust more than I have previously. My original thought was the the Lexer would take ownership of the source code variable and then output tokens containing references to source code where needed (e.g. a token to the string literal). I quickly ran into problems of passing the tokens further down the chain because the Lexer wouldn't live long enough to drop any refernces from tokens to the source code. My initial solution was to use a reference counting pointers - Rc<String> . Even then I ran into a few problems of creating self-referential structures. After a few cycles of refactoring, however, I ended up with a much cleaner design that uses primitive references and does not require self-referential structures:

pub struct Lexer<'a> { 
    text:&'a str,
    it: Peekable<CharIndices<'a>>,
    line: usize, // current source line, used for error reporting by Tokens
    col: usize, // current source column, used for error reporting by Tokens
    stash: Vec<Token<'a>>, // stash of tokens to return in leu of it.peek_two()
    is_newline: bool,
    tab_depth: usize, // needed to insert BLOCK_END tokens properly
    depth_separator: Option<DepthSeparator>,
}

impl<'a> Lexer<'a> {
    pub fn new(text:&'a str) -> Self {
        Self{
            text,
            it:text.char_indices().peekable(), 
            line:1,
            col:1,
            stash:Vec::new(),
            is_newline: true,
            tab_depth: 0,
            depth_separator: None,
        }
    }
}

The lexer takes a reference to a slice of a string that is the source code and outputs tokens which contain references to the same slice of the string - both following the 'a lifetime. The interesting bits here is that the lexer doesn't own anything important - it only takes references to a structure (String, source code) that outlives it, and outputs structures (Tokens) that live as long as the source code. Lexer also implements Iterator and that's how the tokens are actually output. The source code characters are wrapped into a Peekable because there are cases where we need to check for the next character to decide on a token type, e.g. in cases like "=>" . Except since we're also doing processing of tab depth and the source code might be ill-formatted so we want to output errors too. So overall our iterator outputs type Item = Result<Token<'a>, Error>;. Now lifetime 'a can be visualized in the following code:

    fn test_type() {
        let code = "type Cell:\n\tc1\n\tc2"; // Lifetime 'a starts here
        let mut l = Lexer::new(code);
        assert_eq!(l.next(), Some(Ok(Token{line:1, col:1, len:4, t:Type})));
        assert_eq!(l.next(), Some(Ok(Token{line:1, col:6, len:4, t:Identifier("Cell")})));
        assert_eq!(l.next(), Some(Ok(Token{line:1, col:10, len:1, t:Colon})));
        assert_eq!(l.next(), Some(Ok(Token{line:1, col:11, len:0, t:StatementEnd})));
        assert_eq!(l.next(), Some(Ok(Token{line:2, col:1, len:1, t:BlockStart})));
        assert_eq!(l.next(), Some(Ok(Token{line:2, col:2, len:2, t:Identifier("c1")})));
        assert_eq!(l.next(), Some(Ok(Token{line:2, col:4, len:0, t:StatementEnd})));
        assert_eq!(l.next(), Some(Ok(Token{line:3, col:2, len:2, t:Identifier("c2")})));
        assert_eq!(l.next(), Some(Ok(Token{line:3, col:4, len:0, t:StatementEnd})));
        assert_eq!(l.next(), Some(Ok(Token{line:3, col:4, len:0, t:BlockEnd})));
        assert_eq!(l.next(), None);
        // Lifetime 'a ends here
    }

Token to AST (Parsing)

From here we have an iterator that outputs tokens. While parsing we also will need to look ahead sometimes to operate on two tokens at once, or sometimes we only want to consume a token if it matches what we expect (gradient decent parser does that often). The interesting parts here is how our AST is defined:

pub enum Stmt<'a> {
    TaskDeclaration{
        name:Vec<Token<'a>>,
        binding:Option<(&'a str, &'a str)>,
    },
    Task{
        name:Vec<Token<'a>>, 
        preconditions:Option<Expr<'a>>, 
        cost: Option<Expr<'a>>,
        binding:Option<(&'a str, &'a str)>,
        body:Box<Stmt<'a>>, 
        effects:Option<Box<Stmt<'a>>>,
        planning:Option<Box<Stmt<'a>>>, 
    }, 
    Type{
        name: Token<'a>,
        body:Box<Stmt<'a>>,
    },
    Block(Vec<Stmt<'a>>),
    Expression(Expr<'a>),
    Include(Token<'a>),
}

Our AST nodes live as long as the original source code - the same lifetime 'a . Parsing is done in a single pass over the tokens, just using a Peekable<Lexer<'a>>. Parser itself implelents Iterator that outputs either AST nodes (statements) or errors. Part of the reason why AST nodes own Tokens is so that I can easily generate errors that refer to exact location of the source code.

AST to bytecode (compiling)

My current HTN compiler outputs bytecode - a flattend representation of AST that lets a virtual CPU to process instructions linearly instead of jumping between AST nodes (which takes more time). Eventually I hope to replace this bytecode with native machine code making this a true compiler, but the only difference that would make in our approach is that the instruction set would map to different values and I'd also have to mark memory pages executable. However, it is here where I ran into the recursion and lifetime issues. AST compiler follows a "Visitor" pattern - it visits each node of the AST tree, maintaining some sort of state that tells it where it came from and what is is doing right now. The compiler changes itself according to the current AST node, and then recursively passes itself to the children of current node. By the time we reach compilation, we now have two lifetimes we need to keep track of - one is still that of the source code. Another is of mapping of names to IDs assigned in bytecode. If bytecode contained human-readable names we would have to look up that string in some sort of a map, find the offset of that name in virtual memory, and then resolve it. That's an extra search we don't need. So our bytecode just contains offsets to virtual memory. However that means that we need to maintain this mapping during compilation instead of execution. For particular HTN approach I also found that it is usefull if these maps outlive the compiler - and so the compiler only uses mutable references to these maps and they live for a specific lifetime. I'll refer to these lifetimes as 'code, 'maps respectively.

One feature that I wanted to have in my compiler is ability to include other files. What sounds like a simple want ended up having me writing this blog post. At some point the compiler visits the "include" node of the AST. The node contains filepath we want to load up. No problem, we load the source code of that file, pass it through the lexer and the parser and now we have a new AST tree that represent the included file. We can now pass our compiler to the new AST tree root node and...

`source_code` does not live long enough borrowed value does not live long enough
domain.rs(278, 21):`source_code` dropped here while still borrowed
domain.rs(146, 10):lifetime `'code` defined here
domain.rs(254, 47):argument requires that `source_code` is borrowed for `'code`

Ok why..? We just loaded up the source code in this function, create the AST in this function and jump to it, once we parse the AST and come out back to this function we don't use the AST anymore and we can free it, as well as the source code... right? This would be the case if we were not already parsing an "older" AST node. Rust does not attempt to solve lifetimes in any way shape or form. Once you create a variable that maps to a lifetime Rust assumes everything else in that lifetime needs to live as long as the variable that mapped to it. Except what happens during recursion?

Whenever we load the main file to be parsed, we tell Rust that all AST nodes need to live as long as the main file. Whenever we start compiling AST nodes, we also tell Rust that the compiler must live as long as the main file. Whenever the compiler finds "include" AST node and parses that node, we tell Rust that the new AST nodes must live as long as the new file source. But when we pass the compiler to the new AST nodes there's a problem - Rust already knows that the AST nodes must live for as long as the main file and we just passed it AST nodes that live only for then the "include" statement is being processed. That's what the error is about. To solve it, we need either tell Rust that the parser can outlive AST nodes, or that "include"d AST nodes don't need to live for as long as the main file's AST nodes.

To allow the compiler outlive the AST nodes we need to remove any 'code references from our compiler state. In my case I use those references to pass typing information for tasks that rely on typing. This information is in the form of references to string slices. I could potentially just create new strings from them and drop the references, but I wasn't sure if I'll need more references for other structures (my expressivity is still growing) so I decided to go with the second approach - tell Rust that the newly created AST nodes don't have to live as long as the main file's nodes. The only way to do it while also compiling the new AST nodes is to create a new compiler that lives as long as these new nodes - that's what I did. As I encounter the include statement, I load up the new file's source code, parse it, compile it as it if was the main file, then just rip the compiled bytecode from it and append it to current parser's bytecode:

    fn visit_include(&mut self, filepath:&Token) -> Result<(), Error> {
        if self.pass == 0 {
            if let Token{t:TokenData::Literal(Literal::S(filepath)),..} = filepath {
                match std::fs::read_to_string(filepath) {
                    Ok(code) => {
                        let mut errors = Vec::new();
                        let mut inc_compiler = DomainCompiler::new(self.compiler.state_mapping, 
                            self.task_mapping, self.operator_mapping, self.type_mapping, 
                            self.state_mapping);
                        while inc_compiler.pass < DomainCompiler::MAX_PASSES {
                            for result in parser::Parser::new(code.as_str()) {
                                match result {
                                    Ok(stmt) =>  {
                                        match stmt.accept(&mut inc_compiler) {
                                            Ok(()) => (),
                                            Err(e) => errors.push(e),
                                        };
                                    },
                                    Err(e) => errors.push(Error::Parser(Some(String::from(
                                        *filepath
                                    )), e)),
                                }
                            }
                            if errors.len() > 0 {
                                break;
                            }
                            inc_compiler.pass += 1;
                        }
                        if errors.len() > 0 {
                            Err(Error::FromFile(String::from(*filepath), errors))
                        } else {
                            let fqdns = std::mem::take(&mut self.fqdns);
                            inc_compiler.fqdns.extend(fqdns);
                            let t = std::mem::take(&mut self.tasks);
                            inc_compiler.tasks.extend(t);
                            self.fqdns = inc_compiler.fqdns;
                            self.tasks = inc_compiler.tasks;
                            Ok(())
                        }
                    },
                    Err(e) => Err(Error::Basic(String::from(*filepath), e.to_string())),
                }
            } else {
                Err(filepath.to_err("Expected a string literal.").into())
            }
        } else {
            Ok(())
        }
    }

Research Monger

Saturday, August 6, 2022

Recursion and Rust lifetimes

String to tokens (Lexing)

Token to AST (Parsing)

AST to bytecode (compiling)

No comments:

Post a Comment

Fixing wild Spinning in Tachyon: The fringe