A
A
Alexander Savenko2015-02-10 10:28:01
PHP
Alexander Savenko, 2015-02-10 10:28:01

What is the best way to parse a large amount of data in php?

Hello!
I don’t know how best to process a large number of urls with parsing. I'm sorting out data from travel agencies. I choose between:
1) A crutch with multithreading in php. What is the best crutch to use?
2) Queue of tasks. For example, Rabbitmq (it seems to be convenient to administer, there is a web panel), can you advise something else?
Thanks in advance for your replies.

Answer the question

In order to leave comments, you need to log in

2 answer(s)
D
Dmitry Entelis, 2015-02-10
@DmitriyEntelis

1) Many people advise Gearman, but it didn't work for me personally.
We have in one project a task for || parsing, as a result, the “head on” option still works: at some point, the script forks as many threads as needed, performs all parsing tasks, dumps data into redis, the main process waits for the end of the forks and takes everything from redis.
2) Rabbitmq is a good thing, but look at your tasks here.
If you need to parse something in realtime (while the user is waiting) - then the queues here, in my opinion, are out of place.
If you need to parse in the background - yes, queues are great. You can rabbit, you can use the same redis (if you don’t need complex logic for distributing requests)

S
Sergey, 2015-02-10
Protko @Fesor

Crutch with multithreading in php

and where is the cast?
you have to have a line anyway. Use rabbitmq, zeromq, reddis or simply store an array of tasks in php (there are queues in stl) - it's up to you.
You should also have one worker for downloading information (maybe more than one) in several threads (multi curl) and a list of proxies if you need it. If you have a list of proxies, then it makes sense to also check the status of proxies in a separate worker (or periodically in between).
You also need workers that will parse the result of queries.
You can arm yourself with pthreads and spread all the workers into threads and make access to the queue thread-safe. If a rabbit is used as a queue, then each corker has a link to its own task queue. In this case, it is easier to scale the parser.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question