HTML to PHP parser without regex from scratch?

R

RZ2017-02-21 16:57:34

PHP

RZ, 2017-02-21 16:57:34

First of all, I do not need to throw links to PHP extensions, and even more so to the brake Simple html dom lib and others!
I'm not going to fence bikes, but only want to gain skills and experience by implementing something relatively simple.
Many argue that writing complex parsers with regexps is a perversion. I completely agree with those people. And so I want to understand how, for example, browsers analyze html code, what algorithms they use, etc. not on regexps because they do it.
What sequence of the analysis of the html page, using php, should be carried out? For example, we received a page, cleared it of any garbage, such as extra spaces, hyphens .... And then what to do? Pages can be huge and you don't want to keep them in memory. Let's imagine that the received page is valid and we have written it to a file, and since. the content itself already has a hierarchy (html tags), then what algorithm should be used to search for a particular tag and all its contents? Or is it supposed to work in some other way? If so, how? What approaches and algorithms to apply, where to dig?
I understand that php does not work well with binary files, but I think it should cope with such a task.
I would be grateful for any advice.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

R

Rsa97, 2017-02-21
@Rsa97

First you need to implement a lexer - a module that accepts HTML text as input and returns a list of lexemes with their parameters, for example

<div id="test">
Привет
<div>

can be turned into

OPEN_TAG_START, DIV, ID, EQUALS, STRING(test), TAG_END, TEXT(Привет), 
CLOSE_TAG_START, DIV, TAG_END

Then the second module, the parser, builds a syntax tree based on the received tokens. This is a much more complicated part, especially since HTML needs to somehow handle invalid variants like <b><i>Тест</b></i>.
The result should be a DOM tree compiled from the original HTML.
You can start delving into compilers by reading the Book of the Red Dragon

M

Maxim Timofeev, 2017-02-21
@webinar

how, for example, browsers parse html code, what algorithms they use, etc. not on regexps because they do it.

Of course not. And of course, they don't do it in php. But these are lyrics, all you need is to be able to read and this link:
https://habrahabr.ru/post/174057/
PS: I'm afraid that having understood deeply in the subject, you will write the same SimpleHtmlDom. Bulky and slow. If you look at modern browsers, you will see that they eat a lot more RAM than the wonderful SimpleHtmlDom.

X

xmoonlight, 2017-02-21
@xmoonlight

Let's imagine that the received page is valid and we have written it to a file, and since. the content itself already has a hierarchy (html tags), then what algorithm should be used to search for a particular tag and all its contents?

The only correct answer is: if the structure of the document is tree-like, then finding the desired node is a traversal of such a "tree".
Next - we use the knowledge from the W3C documentation to understand the various options for opening and closing the tag - "node". These will be our virtual "brackets".
Check for validity and that there are no intersections : internal tags of a node are always closed inside this node and at the same level at which they were opened.
Then, converting the "bracket" expansion to the representation via "reverse Polish notation" of the bracket expansion will give us the path to the desired node.
As a result, we get an analogue of XPath.
I will add that the properties of tags do not relate directly to the compilation of a "tree" and its transformation into a kind of XPath.
Tag properties - participate only in the selection of the desired node in the future.
They refer only to the sample.