How to localize an error in an LR parser?

D

Dmitry Makarov2015-05-12 13:00:33

Parsing

Dmitry Makarov, 2015-05-12 13:00:33

Just in case, I’ll tell you in more detail what I mean by an LR parser and what my problem is.
There is a sequence of lexemes of different types.
There is an ordered set of rules.
A rule maps a number of tokens of certain types to a single token.
We put the token on the stack and if the top of the stack (several top elements) matches the rule, then we replace it in accordance with the rule.
If the substitution has occurred, then we begin to check for compliance with the rules first, if not, then we put the next token on the stack.
If in the end only one element remains on the stack, then the sequence of tokens is correct and parsing was successful.
My tokens are TreeNode. When replacing by rule, I append the removed tokens as children of the added token. As a result, the remaining element I have is the root of the tree.

public class Lexem : TreeNode
    {
        public int Offset { get; set; }
        public int Length { get; set; }
        public LexemKind Kind { get; set; }
        public string Value {get;set;}
        
        public static Lexem GetLexem(LexemKind inKind)
        {
            return new Lexem() { Kind = inKind };
        }
    }

        public Lexem GetFormula()
        {
            var stack = new List<Lexem>();

            foreach (var lexem in lexer.GetLexems())
            {
                stack.Insert(0, lexem);
                while (true)
                {
                    Rule rule = null;
                    Lexem[] currentSet = null;
                    foreach (var nextRule in _Rules)
                    {
                        if (stack.Count < nextRule.IN.Length) continue;
                        currentSet = new Lexem[nextRule.IN.Length];
                        for (var idx = 0; idx < nextRule.IN.Length; idx++)
                        {
                            currentSet[idx] = stack[nextRule.IN.Length - idx - 1];
                            if ((nextRule.IN[idx] & currentSet[idx].Kind) != LexemKind.EMPTY) continue;
                            currentSet = null;
                            break;
                        }

                        if (currentSet == null) continue;

                        rule = nextRule;
                        break;
                    }

                    if (rule == null) break;

                    stack.RemoveRange(0, rule.IN.Length);
                    stack.Insert(0, Lexem.GetLexem(rule.OUT));
                    stack[0].Nodes.AddRange(currentSet);
                }
            }
            if (stack.Count == 0) return  null;
            if (stack.Count == 1) return stack[0];

            throw new ParserExeption();
        }

Actually, the question is: how to determine in which place the syntax was violated in the input sequence of tokens?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

M

middle, 2015-05-21
@middle

An LR parser consists of a stack and a state machine. In a state machine, arcs are labeled with two types of labels: SHIFT or REDUCE <rule>. In the example you have given, the first '+' will produce a SHIFT (with a transition to the appropriate state), the next "b" will produce a REDUCE according to the rule you have given, with a transition to some state.
If, after the first '+', the second '+' comes, then this will be a mistake, because there will be no outgoing arc for the second '+' for the current state, neither SHIFT nor REDUCE (there will be an arc only for IDENTIFIER if there are no other rules).