Flume creates a lot of files in HDFS. How to force it to append to a file instead of creating a new one?

N

nickolas_php2015-11-11 16:45:17

big data

nickolas_php, 2015-11-11 16:45:17

I described the question and possible solutions in more detail here: bigdata-intips.blogspot.com/2015/11/hdfs-c-pache-k... but the bottom line is that new files are still created. For example, if there were no events for n seconds, then the idleTimeout parameter is triggered, after which all data is written to the file. After the data flow is resumed, a new file is created next to it, and the old one is not overwritten.

Does it make sense to fight this, because so many small files will quickly take up the namespace in NameNode hdfs-a?
Maybe there are other ways or approaches to save data in hdfs? For example, it can immediately write data to Hive?.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

P

protven, 2015-11-11
@protven

have you tried hdfs.rollInterval=0 ?

N

nickolas_php, 2015-12-28
@nickolas_php

I tried it, but the problem is that if the stream ends (no events) and then resumes, for example, after a few seconds, then a new file is created anyway. Different approaches I tried to collect here: bigdata-intips.blogspot.com/2015/11/hdfs-c-pache-k... . It seemed most adequate to glue the files as a background task. But this, alas, seemed to me a bad decision. So far, I switched to Spark Streaming, from where I write raw information in HIVE tables, and I write data that I need to get quickly for realtime analytics in HBASE. I would be glad to hear your working options.