Bison & Flex Part 3 - What Bison "does".
Let's look closer at formal languages & compiling
(Previous page: .)
Okay, you've come this far, we've examined how to create a simple lexer using Flex that actually does something - so far, consume input text and 'transform' (modify) it somehow. I want to teach you how Bison, parsing & compiling works, so in order to do that we need to talk about "Formal Languages" (programming languages are the most famous domain of the of Formal Languages). I'll start with a brief overview.
The study of programming languages, therefore formal languages is a significant part of the study of Linguistics, and languages can be broken down initially into 2 top-level classes: Formal Languages, and Natural Languages. Essentially, you can think of the differences pragmatically as being that Formal Languages have a smaller set of rules (again subdividable as syntax/grammar rules and semantic rules), more rigidly defined, therefore much more suited to analysis by machine/computational methods, but also to generation by computational methods. Obviously, another difference implied is that Natural Languages evolve over large spans of time, subject to unpredictable, natural pressures that are by their nature chaotic to map.
To my mind, the pre-eminent linguist in the field of Linguistics is Noam Chomsky. You can easily Google him (nothing wrong with Wikipedia articles in this regard), so I won't digress with a detailed biography, but in the field of Formal Languages, Chomsky is the creator of the 'Chomsky Heirarchy', which has a concentric structure. At the top of the heirarchy is Level 0 (Recursively-Enumerable Languages). This outer layer of the concentric heirarchy contains (therefore includes) the languages in the level within it (Level 1, Context-Sensitive Grammars), which includes the languages that reside in the level below that (Level 2, Context-Free Grammars), which subsumes the languages in the level below that, the lowest level, Level 3 (Regular Languages). So far, we have utilised Regular Expressions, which are a Level 3 Regular Language. In fact, we have barely investigated the actual power and flexibility of regular expressions, but I'll do a primer on that at some point. If you are reading this article, you are almost certainly interested in Level 2 (CFG) languages, the general category for most programming languages we encounter out in the wild. Technically, C++ for example is a Level 2 CFG language, but some people, nerds, debate avidly on the interweb that it is a Level 1 CSG, with some even saying neither. There is an informative discussion on .
Anyways, if you want to master Bison, it's probably because you want to create the equivalent of a Compiler, so we need to talk about Context-Free Grammars. Bison allows you to feed it a CFG, along with associated C/C++ fragments, and from that it will generate a C/C++ program, which when compiled is then a compiler for the custom language you defined. So, CFG's.
Basically, a Grammar is typically defined by a set of "Production Rules", which describe all the valid strings of the Grammar. Obviously, if attempting to resolve the input string against the Production Rules fails, the input string (source code) is syntactically invalid, but if the process of decomposing an input string via the Production Rules completes, then it is a syntactically correct program/string, and therefore 'valid'.
Production Rules take the following form:
A -> B
A -> B C
and so on.
Note, with CFG's, there can only be one symbol on the Left-hand side, and the left-hand side symbol is known as a "non-terminal". The right-hand side of a rule may contain a mixture of non-terminals and their complement, 'terminals'. Think of terminals as literals that don't need to be further decomposed or interpreted, whereas non-terminals do need to be replaced in the parsing process, until it is finally converted to a terminal in the target language/associated alphabet. In other words, when reading a grammar for the first time in CFG form (most likely Backhaus-Naur Form, technically), simply look at all the symbols on the left-hand sides of the production rules. They are the non-terminals, and any other symbols anywhere on the right-hand side of a production rule in the grammar must be a terminal, because it has no production rule defining it's decompisition.
For example, the numeric literal constants (1, 2, 3.141 etc) are terminals, as are literal strings that represent reserved language keywords (such as 'int'). Obviously, keywords are not always literals, the sequence of characters i, n, and t could perhaps be found sequentially in an modifiable string object in a source code file, but when they are encountered as a literal token by the language parser, they are terminals, and no further decompisition or syntactic translation is required. Below is a hypothetical example CFG, to give you the idea:
In the above example, S, EXPRESSION, and OPERATOR, as left-hand sides of the rules, are non-terminals, while the symbols PLUS, MINUS, MULTIPLY and DIVIDE are terminals, as they represent literal symbols, the arithmetic operators. Note, by convention, the "first"/top rule in a Grammar is by default designated the 'Start' symbol, and so by convention is given the left-hand side non-terminal 'S'. If you examine the above production rules, hopefully you are making some useful observations. Clearly, if we imagine a computer program (a compiler), analyzing an input string, it would start at the 'S' start rule, and try to recursively match the input string as a valid construct. Let's step through this:
1. To resolve 'S' (match an input string as a valid construct according to the rule), we need to positively match the string to the right-hand side of the rule.
2. In this case, the rhs is a non-terminal EXPRESSION. In order to confirm our input string (let's say our input string is "((2 + 3) / (4.1 * 6.2))" ) is a valid string, we now need to validate it against EXPRESSION, which handily is defined in the second production rule:
3. Looking at the 3 options on the rhs of the rule, we see that the first char of the input string, '(' matches the 3rd rhs entry, '(' EXPRESSION ')', but to continue we have to now recursively confirm the 2nd non-terminal in the entry (EXPRESSION again) continues to match. The remainder of the input string at this point is "(2 + 3) / (4.1 * 6.2))", so we must continue, recursively, to test against the EXPRESSION production rule.
4. I am going to take a shortcut here and say the subsequence "(2 + 3)" matches the 3rd option by way of recursively matching the 2st option: "NUMBER OPERATOR NUMBER", itself nested inside the 3rd option "'(' EXPRESSION ')'".
5. So far, so good. We are currently matching the input string to the EXPRESSION production rule. The next symbol we arrive at in the input string is "/", which obviously is the divide symbol, so obviously matches the OPERATOR non-terminal, which means at this point we are currently matching "EXPRESSION OPERATOR EXPRESSION" in the 2nd option in the recursive production rule for EXPRESSION, so if we continue to consume the input string, will we continue to match the rule?
6. Yes, we will. "(4.1 * 6.2)" is a valid string, which means we now match option 3, inside of which we match option 1 "NUMBER OPERATOR NUMBER", do you see?...
In other words, examine the 3 production rules above, and imagine mechanically drilling-down recursively from the top 'S' rule against the input string:
((2 + 3) / (4.1 * 6.2)),
The structure you would build would look like the following:
3(2(3(1) '/' 3(1))),
where you have in the centre: 3(1) '/' 3(1), where 1 represents "NUMBER OPERATOR NUMBER", 3 contains 1 by virtue of the recursive reference to EXPRESSION, which can be of form 1, 2, or 3. Obviously, 2 represents "EXPRESSION OPERATOR EXPRESSION", and that is contained within an instance of 3 again:
'(' EXPRESSION ')'
Hopefully, you should see that a) the string is a match to the grammar, and b) that CFG production rules provide an efficient method for defining expressions and constructs in some target programming language the grammar syntactically defines.