N
N
nurzhannogerbek2018-12-24 18:00:36
Scala
nurzhannogerbek, 2018-12-24 18:00:36

How to read multiple parquet files in Spark?

Hello comrades! Please help me deal with Spark .
There is a directory in which there are a bunch of parquet files. The names of these files have the same format: "DD-MM-YYYY" . For example: '01-10-2018' , '02-10-2018' , '03-10-2018' etc. As an input parameter, I get a start date ( dateFrom ) and an end date ( dateTo ). The value of these variables is dynamic.
If I use the following code, then the program is broadcast:

val mf = spark.read.parquet("/PATH_TO_THE_FOLDER/*").filter($"DATE".between(dateFrom + " 00:00:00", dateTo + " 23:59:59"))
mf.show()

As I understand *it, it checks all the files in the directory and therefore the program is broadcast.
How can I not run the program through the entire directory, but take only specific files?
I thought you can split the period into days and read each file separately. Then combine them. For example like this:
val mf1 = spark.read.parquet("/PATH_TO_THE_FOLDER/01-10-2018");
val mf2 = spark.read.parquet("/PATH_TO_THE_FOLDER/02-10-2018");

val final = mf1.union(mf2).distinct();

As I mentioned, the dateFrom and dateTo variables are dynamic. So what is the best way to organize the code by breaking down a period into days?

Answer the question

In order to leave comments, you need to log in

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question