How to parse large (>25GB) files (activity logs) and rank the received information, what technologies are better to use?

B

bakomchik2015-07-03 21:57:26

Java

bakomchik, 2015-07-03 21:57:26

Hello colleagues.
There was a need to parse large log files (> 25GB), rank them in a certain way and present the UI to the end user to analyze this ranked data.
I have never solved such tasks, I don’t know what is better to use for this (Hadoop, elasticsearch, mongo) I’m spinning
in the java eco-system.
Ask for advice from experienced colleagues!

Reply

Answer the question

In order to leave comments, you need to log in

5 answer(s)

S

sim3x, 2015-07-03
@sim3x

www.datacenterknowledge.com/archives/2012/03/08/th...
as long as you don't have a stream of such logs, you don't need to store and process history for 10+ years - you don't have bigdata

I don't know what to use for this

if you don't know what to use under the database - use postgres
Read line by line and push into the database in a normalized state
Learn hacks to boost database performance - temp table, disable indexes, etc
Hadoop is needed if it will be used by those who fumble in it. Your users will re-raise the cluster after it is screwed up due to the fact that the software was installed raw?

I

Ivan Velichko, 2015-07-03
@Ostrovski

Here is the answer to your question. Main idea:
log file -> parser -> logstash -> elastic search -> kibana

V

Vlad Zhivotnev, 2015-07-03
@inkvizitor68sl

If you understand SQL-like - then hadoop + hive.

�

⚡ Kotobotov ⚡, 2015-07-04
@angrySCV

Yes, that's right, we read -> we process.
but most of the classical algorithms that we usually use for data processing (for example, sorting) have a class of "offline" algorithms -> where you need to provide all the data at once to get an answer, which is sometimes simply not possible.
look at the class of online algorithms, and streaming data processing.
for example here www.cs.dartmouth.edu/~ac/Teach/CS85-Fall09/Notes/l...
well, or try to use streaming frameworks like spark.
for processing logs, of course, it is easier and faster to write your own algorithms than spark collective farms.