Answer the question
In order to leave comments, you need to log in
HTML to PHP parser without regex from scratch?
First of all, I do not need to throw links to PHP extensions, and even more so to the brake Simple html dom lib and others!
I'm not going to fence bikes, but only want to gain skills and experience by implementing something relatively simple.
Many argue that writing complex parsers with regexps is a perversion. I completely agree with those people. And so I want to understand how, for example, browsers analyze html code, what algorithms they use, etc. not on regexps because they do it.
What sequence of the analysis of the html page, using php, should be carried out? For example, we received a page, cleared it of any garbage, such as extra spaces, hyphens .... And then what to do? Pages can be huge and you don't want to keep them in memory. Let's imagine that the received page is valid and we have written it to a file, and since. the content itself already has a hierarchy (html tags), then what algorithm should be used to search for a particular tag and all its contents? Or is it supposed to work in some other way? If so, how? What approaches and algorithms to apply, where to dig?
I understand that php does not work well with binary files, but I think it should cope with such a task.
I would be grateful for any advice.
Answer the question
In order to leave comments, you need to log in
First you need to implement a lexer - a module that accepts HTML text as input and returns a list of lexemes with their parameters, for example
<div id="test">
Привет
<div>
OPEN_TAG_START, DIV, ID, EQUALS, STRING(test), TAG_END, TEXT(Привет),
CLOSE_TAG_START, DIV, TAG_END
<b><i>Тест</b></i>
. how, for example, browsers parse html code, what algorithms they use, etc. not on regexps because they do it.Of course not. And of course, they don't do it in php. But these are lyrics, all you need is to be able to read and this link:
Let's imagine that the received page is valid and we have written it to a file, and since. the content itself already has a hierarchy (html tags), then what algorithm should be used to search for a particular tag and all its contents?The only correct answer is: if the structure of the document is tree-like, then finding the desired node is a traversal of such a "tree".
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question