How to get the name of the organization from the html page without entering it manually?

Igor2013-11-14 10:26:05

HTML

Igor, 2013-11-14 10:26:05

Most sites often have their name in the header or footer.

What algorithm can be used to find duplicate values and try to extract title data from them?

For example, there is an address

http://habrahabr.ru/

You need to find the name of the organization. The manual algorithm is as follows. We look at the header, look at the footer, if we don’t find it, go to the contact page or about the site

The result will be the name: Habrahabr Company "TM"

How to get such data without going to the site manually?

I would like to understand the algorithm

Answer the question

In order to leave comments, you need to log in

7 answer(s)

Andrey Belov, 2013-11-14
@leotop

I'd probably try highlighting the pagination first. That is, take several pages of the site and determine the repeating text on them. And then empirically select the rules for parsing the resulting. For example, often the name of the company comes after or before ©, often it is mentioned in the title, often it is preceded by the words "company", "LLC", etc.

Kirill Firsov, 2013-11-14
@Isis

For example, using regular expressions.

vacuumn, 2013-11-14
@vacuumn

You need to parse HTML, but don't do it through regular expressions. Every time you parse HTML with regular expressions, one developer dies in the world.

For Habr, for example, the footer is easy to find, it has a logical id:

<div id="footer">

Дальше ты берешь весь текст из футера и ищешь там название компании и ссылки на страницу "Контакты". 

У других сайтов футер или хедер найти будет тяжелее. Нужно будет проанализировать несколько десятков сайтов и составить список правил, по котором можно будет легко найти в коде страницы блоки с нужными елементами.</div>

ChemAli, 2013-11-14
@ChemAli

There is no single algorithm, since there is no single standard for describing organizations on websites.

In order not to switch manually, you need to switch programmatically. For this, universal parser programs are written (or used).

If you write yourself, then the algorithm will need to be multi-step and multi-variant.

In an ideal world, ideal sites have hCard markup, from which you can extract the name of the organization and other data about it, carefully provided by the site owners.

All other options should be processed manually right away, as there are too many of them :)

VeMax, 2013-11-14
@VeMax

You can also try to find, for example, an img with the logo class and see its alt or title. As an additional option for verification will go.

ChemAli, 2013-11-14
@ChemAli

The name of the project itself is meta tags (again, if filled in correctly).

You can also look at the whois of the domain - it seems that the owner's data is not closed everywhere yet.

Nikolay Eliseev, 2013-11-14
@nelis

There are a lot of questions, as well as uncertainties. And if there are several names of organizations on the page. How is your task formulated? Do you need to collect all the titles from the specified pages or do you need to establish that the site belongs to a company?