D
D
Dmitry2018-07-02 21:35:12
Python
Dmitry, 2018-07-02 21:35:12

How to form HTML(dom) data in matrix form for machine learning?

The question sounds like this, there is HTML in the form of text, you need to turn it into tensor, but I don’t know how to do it. I would be grateful for the translation scheme, tips and links.
The task of extracting typical blocks of sites, news announcements, products from the list, etc.

Answer the question

In order to leave comments, you need to log in

2 answer(s)
V
Vyacheslav Shindin, 2018-07-12
@informix

Well, the DOM is a tree structure (roughly speaking, a special case of a graph). The question is how to represent this graph in a convenient way?
In the simplest case: take an array of N by N elements, where N is the number of DOM elements, and represent this graph as an adjacency matrix . Well, the elements themselves and their attributes can be represented as a vector of N elements, where each element will be categorical (tag name). If you need to take into account the attributes of these tags, then in a vector of N elements, make other vectors as elements, which will consist of a categorical tag name, some class names, basic attributes, etc. It depends on which attributes are considered significant and considered as ML features, and which ones can be ignored.
Well, the formation process itself is bypassing the DOM tree and filling in the corresponding data structures.
And also, about 10 years ago, I made the following decision:
- filtered html, left the structure and attributes that were meaningful to me, discarded the rest
- replaced the content in the tags and attribute values ​​with some default values ​​(only the structure was important, not the content itself)
- when traversing such a filtered DOM, it took a hash from html in the string representation of each bypassed node with all children, and accumulated the frequency of occurrence of such hashes from different sites
As a result, in the database I got a list of hashes, the frequency of occurrence and the html template itself from which this hash was generated. But in addition to templates, I also collected additional information, number of words, language, some attribute values, etc. But this is for finer filter settings.

P
Petr Vasiliev, 2018-07-03
@danial72

Very vague description.
In any case, you need to write a parser.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question