What approaches are there for parsing big data?

G

GregIv2017-08-07 05:35:42

PHP

GregIv, 2017-08-07 05:35:42

What approaches are there for parsing big data?
Good afternoon!
There are many settings in the php and server settings that do not allow the big data parsing script to be executed at one time.
What are the ways to get around this?
I know a way with a redirect.
The parsing of the n-th number of lines is performed, after that the redirect to the same page occurs and the next n lines are parsed.
What other options might there be?
Also interesting for console scripts….

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

�

⚡ Kotobotov ⚡, 2017-08-07
@angrySCV

usually if the request is blocking, then various restrictions are imposed on it so that the server at least somehow responds, and there are no long-term locks, it is better not to try to bypass these restrictions (because they are quite reasonable)
PHP is a scripting language, not quite for big data analysis , use it to shape task handling in other products like apache spark

S

Sergey Sokolov, 2017-08-07
@sergiks

The task queue will help.
It is necessary to accept the task, put it in the queue and give a response like “accepted. job number ZZZ" - this is done almost instantly. Then you need to find out, “Is the task ZZZ ready?” - for example, ajax'om once every couple of seconds to poll the server.
Jobs are executed in one or more threads, on this or another server, by a "worker" process. The workflow is launched not under the web server, but from the command line and has no restrictions on the execution time. The process either hangs constantly and waits for tasks to arrive, or is launched by cron every N minutes (and is executed if there is no other running process from the previous time).
For example, in the Laravel framework (and its light version Lumen) there is a very good implementation of the task queuefrom the box.

P

Philipp, 2017-08-07
@zoonman

Usually large volumes are parsed using map / reduce. In PHP, it is best to use console scripts and immediately allocate more memory. Parsing algorithms are very dependent on the file format. For example, csv is very easy to parse. Usually such files are parsed into several streams at once.
Various kinds of xml are parsed either through simple xml or through domparser. Sometimes parsed by manually building a tree. Those. Read the file character by character and form a tag tree. This approach works when the files are very large and the nesting depth is small and the format is predictable.
All sorts of xls are parsed through Phpexcel, it even knows how to get pictures.
Plus, parsing is done through the queuing mechanism. For example, a file is uploaded to the server, the task is to parse the file, then the file is parsed through a console script. In particularly perverted cases, like xls files with hellish macros inside, it comes to instantiating OLE objects on a separate Windows machine and pulling out data through vbs.