How to parse HTML tags, as well as brackets and the like?

A

Arthur2016-03-10 13:17:18

Java

Arthur, 2016-03-10 13:17:18

Good afternoon!
How can you parse HTML tags, taking into account that for each open tag there is a corresponding closed one?
It was a task.
Given an HTML file, you need to find in this file all the lines corresponding to the given tag.
Lines within a tag can contain nested tags.
For example, the desired tag
in the input

<span xml:lang="en" lang="en">Текст текст <b><span>Имя Фамилия</span></b></span>

at the exit

<span xml:lang="en" lang="en">Текст текст <b><span>Имя Фамилия</span></b></span>
<span>Имя Фамилия</span>

I understand that there are different parsers and libraries, but I would like to understand how it works myself. What are the algorithms and how do they work.
I have my solution. It works, but I feel that it can be done better and more accurately.

public static void main(String[] args) throws Exception {

        String tagName = args[0]; // входящий тэг
        BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
        String filename = reader.readLine();
        reader.close();
        StringBuilder sb = new StringBuilder();

        //считал из файла все строки и перевел в одну строку
        BufferedReader br = new BufferedReader(new FileReader(filename));
        String data = br.readLine();
        while (data != null) {
            sb.append(data);
            data = br.readLine();
        }
        br.close();
        String fileData = sb.toString(); // итоговая строка для поиска в ней

        int n = fileData.indexOf(tagName);
        int flag = 0;
        ArrayList<Integer> forIndex = new ArrayList<>();
        while (n != -1) { //если нет тегов, то indexOf return -1
            String tag = fileData.substring(n-1, n + tagName.length());
            if (("<" + tagName).equals(tag)) { //если нашли открывающий тэг
                flag++;
                forIndex.add(n - 1);
            }
            else if (("/" + tagName).equals(tag)) { //если нашли закрывающий тэг
                flag--;
                forIndex.add(n + tagName.length() + 1);
            }
            if (flag == 0) { // когда закрыли открытый тэг
                while (forIndex.size() > 0) { //из списка индексов собрали индексы "по краям и вглубь"
                    int start = forIndex.remove(0);
                    int end = forIndex.remove(forIndex.size() - 1);
                    System.out.println(fileData.substring(start, end));
                }
            }
            n = fileData.indexOf(tagName, n + tagName.length());
        }
    }

I ask for your help in this matter.

PS I think that these techniques can also be used when parsing various brackets, other tags (XML) and the like. Useful knowledge could be obtained.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

A

Alexey Ukolov, 2016-03-10
@alexey-m-ukolov

https://ru.wikipedia.org/wiki/iteye.ru/255/konechnyj-avtomat-dlya-parsinga-javascript
taligarsiel.com/Projects/howbrowserswork1.htm#Pars
...

V

Vov Vov, 2016-03-11
@balamut108

Try using a tool that was specially developed for this task: wiki.python.su/%D0%94%D0%BE%D0%BA%D1%83%D0%BC%D0%B...
This is not Java, but Python it will be much easier.

E

Eugene 222, 2016-03-14
@mik222

xml/html has been parsed for a long time than just not laziness, in Go it is generally, in stdlib.
the most famous for python: lxml.de
I do not recommend Beautiful soup, it is old and forgotten.
If you want to clean it yourself, google packrat parsers and ABNF meta grammar.
Nice and fast here: https://github.com/Engelberg/instaparse