V
V
voix_kas2016-06-28 17:58:29
HTML
voix_kas, 2016-06-28 17:58:29

What components to use for multi-threaded HTML parsing on VC++ using proxies?

Statement of the problem : it is necessary to parse a large number of websites on a daily basis (> 100 sites, > 1000 pages) and extract information about products from them. Let's say online shopping. We need multi-threaded work using a proxy (one page (not a site) - one proxy).
Actually the question - please advise a complete "binding" of the final solution with a focus on:

  1. Performance and economy (to run multiple threads on the same machine/link).
  2. Stability and compatibility (pagination, frames, unicode and all other layout features).
  3. Security (for example, destructive code / virus is transmitted along with the page).

What components to use to work with HTML? CsQuery/HtmlAgilityPack... How to access sites through pre-purchased proxies? Do the components work with HTML proxies, or do you need an additional "shim" to use proxies?
I would be wildly grateful for a detailed description and sequence of actions (I'm not a professional programmer).
PL / development environment - VC ++ 2015. I understand that this may not be the best PL for solving such problems. But I ask you not to raise the issue of choosing / changing the YP. Interested only in VC++.

Answer the question

In order to leave comments, you need to log in

1 answer(s)
M
Mark Adams, 2016-07-09
@ilyakmet

Processed 1k pages in Python 2.7, used from multiprocessing import Pool. Look at my farrows, there were links somewhere.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question