Collection of images by iterating links

E

entaure2011-08-23 16:58:51

Programming

entaure, 2011-08-23 16:58:51

The task arose: by enumeration, collect pictures from a site similar to Vk or any similar one by sorting through the link (as far as I know now, all the pictures there are simply stored in the general storage, only the Id in the picture header changes.
Actually, the question is: how should it look like and what is it like implement the algorithm? (particularly interested in points 2 and 3

, as I understand it, there are 3 stages:

1) go to a new page by changing the address by one (letter, number is not important here)
2) check that the page exists and contains useful information (here: picture)
3) save the picture to disk and repeat step 1 with a new address.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

O

Ocelot, 2011-08-23
@entaure

Full enumeration will take a lot of time. Let the ID be six-digit digital (you didn’t say how it really is), and it takes a second to load the page. Then the full enumeration of the range will take 1000000 s = 11.5 days. How to optimize the process:
1) In step 2, parse the title before loading the whole page.
2) Execute requests in several threads (how many depends on the channel width)
3) Try to determine by which algorithm the IDs are issued to the pictures. If in a row, there is a big chance that the higher numbers are obviously free.

N

Nikita Permin, 2011-08-23
@NekitoSP

VK? but what about random?)
the address of avka vk has such a url

O

Ocelot, 2011-08-24
@Ocelot

> How can I force the program to determine whether there is a picture on the page or not?
Download (for now manually) two pages: with and without a picture, and play the game "find 10 differences". What you can rely on:
1) HTTP header, or rather the error code. There is a chance that for an invalid ID, the server will return 404 or something similar
2) <IMG> tag in the right place on the page
3) keywords: “no image”, “error” and the like in the body of the page.