Answer the question
In order to leave comments, you need to log in
Flume creates a lot of files in HDFS. How to force it to append to a file instead of creating a new one?
I described the question and possible solutions in more detail here: bigdata-intips.blogspot.com/2015/11/hdfs-c-pache-k... but the bottom line is that new files are still created. For example, if there were no events for n seconds, then the idleTimeout parameter is triggered, after which all data is written to the file. After the data flow is resumed, a new file is created next to it, and the old one is not overwritten.
Answer the question
In order to leave comments, you need to log in
I tried it, but the problem is that if the stream ends (no events) and then resumes, for example, after a few seconds, then a new file is created anyway. Different approaches I tried to collect here: bigdata-intips.blogspot.com/2015/11/hdfs-c-pache-k... . It seemed most adequate to glue the files as a background task. But this, alas, seemed to me a bad decision. So far, I switched to Spark Streaming, from where I write raw information in HIVE tables, and I write data that I need to get quickly for realtime analytics in HBASE. I would be glad to hear your working options.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question