Why can't I parse contact details from sites normally?

A

Anatoly Filippov2020-04-28 15:40:33

Marketing

Anatoly Filippov, 2020-04-28 15:40:33

Hello! The essence of the problem is this. There is a list of sites (about 3000). There is a utility called Top Lead Extractor, and I just can't get it to work properly. It is necessary to take an email and phone number from each site. So, either she finds me 100,500 phones and emails that have nothing to do with the essence, or does not find it at all. Although the sites are quite visible in a prominent place (most often in the footer or in / contacts), all the data is there. Already despaired, and began to collect data with his hands, but on the second hundred, the nerves had already begun to fail. Maybe someone faced similar problems, tell us how you did it? For example. I take the site rbc.ru - the phone number is https://www.rbc.ru/contacts/
The above utility does not find it. If I tell her to climb all the links, then complete trash begins. Will collect everything, but not the phone

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

I

Ivan Yakushenko, 2020-04-28
@filippovanatoliy

You will not find universal solutions. Usually, this is done something like this: you take N sites, write a regular expression (1 or several) based on them, the script goes to all 3000 sites, searches for the necessary information on the main page and on /contacts with a regular expression, the sites on which it does not find - in separate list. You take N more sites from the "unsuccessful list", write regular expressions for them, and so repeat the procedure until you get all the contact information. Naturally, all this is doable if you are sufficiently proficient in some kind of PL, there is nothing complicated in regular seasons.