BBCode parsing methods?

divanikus2010-09-08 17:23:55

BBCode

divanikus, 2010-09-08 17:23:55

Tell me, what other parsing methods are there besides regular expressions. Regulars, as you know, are not intended for parsing nested structures. If there are implementations - poke please.

Answer the question

In order to leave comments, you need to log in

8 answer(s)

LastDragon, 2010-09-08
@divanikus

Nothing complicated - just create a state machine ... on Habré, by the way, not so long ago there were several articles on the topic (if I'm not mistaken, they concerned the creation of compilers).
For example:
1) Existing parser xbb.uz/
2) Own bicycle:
Code: http://pastebin.mozilla-russia.org/106940
Diagram: habrastorage.org/storage/b55a4b42/f4942156/b245ccd6/9426eb87.png
(original in VP-UML, if someone needs it - write)
Most likely there are errors (I'm just debugging now).

zibada, 2010-09-08
@zibada

> Regulars, as you know, are not intended for parsing nested structures.
Indeed, the theory tells us that the grammar of bbcodes cannot be overcome with one mighty regular expression.
But this does not mean that regular expressions cannot be used in this problem at all.
(look at parsers of popular forums, for example)
In short - in one pass, match the most deeply nested pair of tags and replace with something that does not contain them, repeat in a loop until a match is found.

Arvid Godyuk, 2010-09-08
@Psih

See what language you are using. If PHP - that is, the PECL module bbcode . I use it - convenient, fast, functional due to callbacks and no perversions :)

LastDragon, 2010-09-08
@LastDragon

> The only question is the speed of parsing.
Not very fast. Didn't test. It would also be interesting to compare in other variants of parsers.
> And IMHO it is more convenient to depict state machines with transition tables
Maybe. But the diagram is more illustrative in my opinion.
> Especially if the tags suddenly overlap
When using this parser, you can process it as you like, at the moment the nested unclosed bb-code will be closed forcibly ([a][b][/a][/b] => [a][b][ /b][/a][/b])

exception13, 2010-09-08
@exception13

use parser-generator and formal description in RBNF =]

LastDragon, 2010-09-08
@LastDragon

I’ll write one more option *:
- We go through the bb codes in a loop:
1) find the opening tag "[bbcode"
2) find "]" (everything in between is attributes)
3) if it is a single one, we parse
it 4) if not, we look for the first closing tag "[/bbcode]"
5) everything between "[bbcode...]" and "[/bbcode]" is the body (it is formatted depending on the bbcode)
6) continue;
The main problem is that it is impossible to determine what "]" refers to because of this, the result depends on the order of parsing bb-codes * In IPB, escaping "]" in attributes is used to solve this problem ...
* no need to use it... that's how the bb-code parser in IP.Board is written... it had (and still has) a lot of errors due to the different nesting order of bb-codes and their attributes (including XSS and Apache crashes… a small amount of detail can be found on the IBR forum in Ritsuka posts)

vgrayster, 2010-09-09
@vgrayster

I myself use the xbb.uz parser, I didn’t like the code that it generates a little, but after the file everything is ok.
What speed do you need? I parse only once, when saving, then I cache the result and when I need to give it back from the cache.

LastDragon, 2010-09-09
@LastDragon

> I.e. IPB3 draw html from bbcode on every request?
Yes. (there are pluses, but probably more minuses)