R
R
rjsem2017-02-08 21:09:20
Apache HTTP Server
rjsem, 2017-02-08 21:09:20

How will Apache Spark in parallel (or not) take and process data?

Hello,
I decided to try to deal with Apache Spark , and in the course of getting acquainted with the documentation and examples, I had the following question:
How will spark in parallel (or not) take and process data?
1. There are a lot of examples in the documentation with sc.textFile(“example.txt”) , but there are no examples with parallelize, so all this will be processed in 1 thread (for each spark-submit)?
2. There are examples with HBase, HDFS, tell me, how will data be taken from hdfs , 1 piece or a bunch at once (and will they be somehow distributed and summed up)? And how will it all be handled? in parallel (distributed by different workers)?
What will happen if you usehbase ? and in the case of JDBC(POSTGRES) ? How to distribute tasks in this case?
In addition:
How to send data to spark? I only see spark-submit, are there other ways and how to get only the result and not all the garbage?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
M
Miron, 2021-04-06
@Miron11

Wow... three years ago the question was asked.
So it's not interesting :)
But I'll try to answer, anyway. Even knowing that the author is probably more skilled and can probably answer his own question, much better.
So.
1. Spark always does everything in parallel. The only question is whether the user allows him to execute a parallel query using 2 or more so-called "executor(s)", in Russian probably "executable".
An executable is simply a java.exe executable. In addition to being an executable, it is one (or those) executable(s) that perform operations such as reading files, writing files, jdbc, spark sql and other queries.
There may be one or more depending on the Spark configuration.
As a rule, these executors run on machines that are distributed on the network and are physically separated from each other. There are settings that set exactly how many executors can be run on this array of machines.
These machines, in turn, are called "worker nodes".
In addition to worker nodes carrying executors, Spark requires a driver. While this driver also runs java.exe, this executable is not counted as executors unless you are running Spark in "spark-shell" local console mode. When running Spark in local console mode, all of Spark's parallel, distributed job execution capabilities are reduced to little more than zero.
Yes, no one bothers to create a Thread and run it in parallel, but this is not exactly Spark, but its own code.
So how do you get Spark to run a task in parallel using the built-in features of Spark.
Part of the answer has already cleared up - you need to start a Spark cluster with at least one worker on which you need to configure and run 2 or more Executors. After that, you need to complete the task that Spark can do in parallel.
At the same time, you must always remember that Spark will try to delay the execution of the task as much as possible, and instead of executing the task, when calling a particular function, remember some execution plan that will be executed in the future. Such a plan is called a DAG ( direct acyclic graph ).
A good example of an operation that Spark performs simultaneously by multiple threads is reading files from a directory
spark.read.format("json").load("my_path_to_json_files/files_subdirectory/*.json")
although it will not read the files as such, it will create DAG, while using a thread equal in number to executors. That is already some aspects of parallel execution appears.
Further, if you execute the following line, remembering this DAG in an unknown
val df = spark.read.format("json").load("my_path_to_json_files/files_subdirectory/*.json")
and the next step, let's say, write the same list of files in parquet format on another filesystem, like
val df = spark.read.format("json").load("
df.write.format("parquet").save("my_path_to_parquet_files/files_subdirectory")
Spark will take these two lines as a command to create a so-called job, within this job Spark will set executors of their own Tasks within this job itself, and by as the Task is executed on the executors, it will move towards completing the entire Job, while note that the
task will be executed in parallel
and
independently of each other, by the same Threads.
It is worth noting that if Job specifies a driver, then Spark Scheduler is already involved in the distribution of Tasks, and tracking the stages of their execution ( Stages ).
This is what the Spark architecture looks like when a task is performed.
There are several other ways to access the executors subsystem for the purpose of executing a job in parallel and simultaneously, this is
df.edd.foreachPartitionAsync ()
but in this case you will have to write code that will already be executed as parallel as you know how well you know the job description language for Spark, scala and somewhere java.
In addition to parallel execution of tasks, Spark supports, through its language, scala, the same Threads and parallel collections, which are also capable of supporting parallel execution of threads. But, for the most part, these threads will be executed on the driver, that is, they will not be distributed by themselves. And this already greatly reduces the ability of Spark to use parallel execution of tasks, since the resources of one individual driver are finite, and when the tasks are executed in parallel, the driver will quickly exhaust them, these resources, if you try to read, say, large files, or execute tasks on large data.
I tried to find a description of distributed parallel query execution in the documentation, but somehow I couldn’t find it quickly.
Go ahead and look for Spark documentation. All of these details are detailed there.

⚡ Kotobotov ⚡, 2017-04-26
@angrySCV

data from external sources is loaded into datasets (a special interface over RDD) - that’s why you didn’t see parallelize there, instead the toDF or toDS method is used.
in any case, spark only works with RDD and only in parallel / distributed (using or not additional interfaces)

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question