Answer the question
In order to leave comments, you need to log in
How to parse HTML tags, as well as brackets and the like?
Good afternoon!
How can you parse HTML tags, taking into account that for each open tag there is a corresponding closed one?
It was a task.
Given an HTML file, you need to find in this file all the lines corresponding to the given tag.
Lines within a tag can contain nested tags.
For example, the desired tag
in the input
<span xml:lang="en" lang="en">Текст текст <b><span>Имя Фамилия</span></b></span>
<span xml:lang="en" lang="en">Текст текст <b><span>Имя Фамилия</span></b></span>
<span>Имя Фамилия</span>
public static void main(String[] args) throws Exception {
String tagName = args[0]; // входящий тэг
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
String filename = reader.readLine();
reader.close();
StringBuilder sb = new StringBuilder();
//считал из файла все строки и перевел в одну строку
BufferedReader br = new BufferedReader(new FileReader(filename));
String data = br.readLine();
while (data != null) {
sb.append(data);
data = br.readLine();
}
br.close();
String fileData = sb.toString(); // итоговая строка для поиска в ней
int n = fileData.indexOf(tagName);
int flag = 0;
ArrayList<Integer> forIndex = new ArrayList<>();
while (n != -1) { //если нет тегов, то indexOf return -1
String tag = fileData.substring(n-1, n + tagName.length());
if (("<" + tagName).equals(tag)) { //если нашли открывающий тэг
flag++;
forIndex.add(n - 1);
}
else if (("/" + tagName).equals(tag)) { //если нашли закрывающий тэг
flag--;
forIndex.add(n + tagName.length() + 1);
}
if (flag == 0) { // когда закрыли открытый тэг
while (forIndex.size() > 0) { //из списка индексов собрали индексы "по краям и вглубь"
int start = forIndex.remove(0);
int end = forIndex.remove(forIndex.size() - 1);
System.out.println(fileData.substring(start, end));
}
}
n = fileData.indexOf(tagName, n + tagName.length());
}
}
PS I think that these techniques can also be used when parsing various brackets, other tags (XML) and the like. Useful knowledge could be obtained.
Answer the question
In order to leave comments, you need to log in
https://ru.wikipedia.org/wiki/iteye.ru/255/konechnyj-avtomat-dlya-parsinga-javascript
taligarsiel.com/Projects/howbrowserswork1.htm#Pars
...
Try using a tool that was specially developed for this task: wiki.python.su/%D0%94%D0%BE%D0%BA%D1%83%D0%BC%D0%B...
This is not Java, but Python it will be much easier.
xml/html has been parsed for a long time than just not laziness, in Go it is generally, in stdlib.
the most famous for python: lxml.de
I do not recommend Beautiful soup, it is old and forgotten.
If you want to clean it yourself, google packrat parsers and ABNF meta grammar.
Nice and fast here: https://github.com/Engelberg/instaparse
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question