Answer the question
In order to leave comments, you need to log in
How to filter data for a specific period in Spark?
Hello comrades! Please help me figure it out. Have n't worked
with Spark before. I'm trying to figure it out with a simple example. Suppose there is a large file with the following structure (see below). It stores the date, mobile number and its status at that time.
| CREATE_DATE | MOBILE_KEY | STATUS |
|---------------------|------------|--------|
| 2018-11-28 00:00:00 | 8792548575 | IN |
| 2018-11-29 20:00:00 | 7052548575 | OUT |
| 2018-11-30 07:30:00 | 7772548575 | IN |
val dateFrom = "2018-10-01"
val dateTo = "2018-11-05"
val numbers = "7778529636,745128598,7777533575"
val arr = numbers.split(",") // Создать массив из мобильных номеров
spark.read.parquet("fs://path/file.parquet").filter(???)
Answer the question
In order to leave comments, you need to log in
you can just try to filter as you write, for this, at the beginning, get a specific structure and data type:
источникДанных
.мап(созданиеСтруктуры)
.фильтр(текущаяЗапись => СписокТребуемыхНомеров.содержит(текущаяЗапись.телефон)
&& текущаяЗапись.дата<>требуемыйИнтервал)
Date in ISO format can be compared as strings. Arrange the list of telephones as a set.
Something like
val arr = numbers.split(",").toSet
spark.read.parquet("fs://path/file.parquet").filter(t => t("CREATE_DATE") < dateTo && t("CREATE_DATE") > dateFrom && arr(t("MOBILE_KEY")))
I don't know exactly how to access record fields in SPARC, maybe it will be necessary to redo it a bit.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question