How to start learning BigData?

Alexander Vasilenko2015-10-05 20:51:55

Java

Alexander Vasilenko, 2015-10-05 20:51:55

Hi people. I decided to turn from a monkey into a man. I decided a long time ago, in fact, but I didn’t understand how I should study algorithms and data structures, so that it would be downright interesting to the point of horror. Well, I don’t need them in everyday life, it turns out, and somehow I forget to deal with them.
My background is Java, Android + a little bit of Clojure, just a little bit.
And here is the solution, from which, as good people, you may want to dissuade me. Already somewhere, but BigData algorithms and data structures are the central topic. I could, of course, be wrong, but it seems to me the probability is small. But where to start? Google, of course, is a good solution, but you can choose not what you want, but you want what you are , directly, sure for sure.
I want to note that the task here is not to become a world genius in this area, just like that, for myself, I want to exercise my brain, so to speak. But I do not rule out the prospect of full implementation in this area, why not. The main thing is to start right and choose the right materials.
I really hope for your advice, friends, and summarizing:
Is it worth learning BigData at all?
And where to start?

Answer the question

In order to leave comments, you need to log in

6 answer(s)

Yuri Yarosh, 2015-10-15
@SanchelliosProg

BigData is not really related to data structures - basically it is a variety of spatial structures, rather it is more related to NLP, classification and machine learning algorithms.
First of all, you need to choose a means of processing and storage.
In the case of Java, this is HBase Cassandra
HBase - when a lot is written to the database, and most of the indexes are "self-made".
Cassandra - when the read / write ratio is 4:3, since Cassandra already has column indexing facilities.
In the case of a real heavy load, this is ScyllaDB - it has the same features as HBase, but C ++ 11 and the Share-nothing approach are 6-7 times faster.
For a database up to 200GB, a banal MySQL with an R-tree index and Engine Archive will suffice.
Here, PostgreSQL, when properly configured, quietly builds B-tree indexes for data volumes of 500-700 GB, which is an impossible task for MySQL. Well, in PostgreSQL, you often have to add sish aggregation functions and build various indexes on them, sometimes spatial (gin / gist) .
Here is a small overview of the different types of indexes.
From myself, I’ll also add MVP-tree to search for similar perceptual hashes and Fusion-tree as a more edible version of the Van Emde Boas tree .
Regarding the hipster cult around MongoDB, I will say that PostgreSQL with indexes on hash tables and small sets of documents is 1.5-3 times faster, because "Building Index with Vodka". And normal replication and partitioning directly depends on the principles of solving the Consensus problem in each specific application, and without understanding the operation of Raft / Paxos , you should not rely on the miracles of the same MongoDB or PostgreSQL, they are nothing more than tools for solving this problem.
MongoDB is very good for Meteor-based reactive projects, and GoldenHammer™ for everything else.
For indexing, you should definitely read Hannah Samet's books
Foundations of Multidimensional and Metric Data St... =Applications of Spatial Data Structures: Computer ... + The Design and Analysis of Spatial Data Structures
In principle, the book Foundations of Multidimensional and Metric Structures should be enough, but you can "finish" a more complete description in more ancient works. In a word, my aunt "burns", and I don't know why no one has translated it yet.
Well, after we figured out what and where and how to store, now you can think about processing ...
There is an ancient book " Algorithms of the intelligent Internet " and " Programming the collective mind " Although the names are translated into Russian rather strangely and sound rather naive - this is a good introduction into simple data processing and analysis tools.
On machine learning, you can take Andrew Ng's course on the course .
There is a Southern DataScience Central , there is a lot of useful stuff there . It can be read. There are also superficial CheetSheet 's, I saw better ones, but did not find them.
As a DeepLearning adept, I advise you to deal with Theano , and the methods described here . In production, this thing is outrageously sloppy and I saw comrades who more or less successfully got down on Neon . If you go into Java, then on the example of Spotify, Apache Kafka -> Apache HBase -> Apache Storm -> Apache
bundles are most often used
Spark (mllib) -> Apache HBase -> Apache Phoenix -> Hibernate + any MVC framework etc.
Naturally, relatively high performance and good vertical scaling are out of the question, if you take C ++ 11 ScyllaDB -> Neon well profiled and finished, you can get 3-5 times higher performance and, accordingly, much lower delays, but usually everything is broken. REST API for this is usually tried to be written in syah (without pluses) in the form of extensions for Nginx, which is a rather thoroughbred perversion - in most cases, a banal golang / netty will be enough.
In Hadoopthe stack is now customary not to climb, since it is very "interactive" and without good support and doping from vendors in real projects it is simply not usable, so almost everyone, to one degree or another, scored on it. For example, the same Spotify.
You can see a lot of srach regarding HA and Zookeeper , especially in Netflix, so for high availability management it is better to use their solutions - eureka or Hystrix for fault tolerance . Although I can't say that these are quite mature projects - they also have enough flaws, but they are much faster than other Apache crafts.
You can't make fault-tolerant and highly available applications at the same time - because the CAP theorem has a place to be.
There is also a very subtle point with Java in general - you need to minimize the garbage collection time and climb into offheap, it's worth looking at how buffers are implemented in netty - this is an arena allocator similar to what is used by jemalloc and various misc.unsafe heresy. You can also try Hazelcast / Terracotta, but basically the same thing, only for a fee and "distributed".
For the REST API, I most often use Vert.x and vanilla Java.
Scala's overhead is pretty big, and the compile time is just outrageous.
It's safe to use Groovy with @Immutable and @CompileStatic to minimize copy-paste.
But in Vert.x it's all "dynamic" :|
I can't say anything about Clojure's performance, it's a bit too much invokeDynamic in places. Naturally, vanilla Java will be faster, but I have no idea by how much.
I wish you a pleasant evening.
ps I didn’t put links everywhere simply because I want to sleep.

Dimonchik, 2015-10-05
@dimonchik2013

1) you read the BigData book www.mann-ivanov-ferber.ru/books/paperbook/big-data 3) ShAD lectures habrahabr.ru/company/yandex/blog/206058 (somewhere in the Yandex blog area on Habré look for links to the rest) so as not to give up ahead of time - you can also see the conference materials bigdataconf.com.ua/2015/agenda xs only where to get the video, and some reports there a year ago, but still

xmoonlight, 2015-10-05
@xmoonlight

BigData is a repository of a lot of constantly collected data of the same type, possibly somehow related to each other, usually "located" along the timeline axis.
Why do you collect any data in such volumes? - you need to decide BEFORE "immersing" in this area!
And it depends on the ultimate goal: perhaps marketing.
What can give BigData in its "raw" form? - a large amount of useless data.
What can be done with BigData ? - For example, you can find out the dependence of some parameters on others in a selected time period.
Practical solutions using BigData?- marketing, risk forecasting, any filtering, forecasts and predictions of the behavior of changes in any parameters contained in the BigData storage.

abs0lut, 2015-10-06
@abs0lut

just like that, for myself, I want to exercise my brain, so to speak.

can then learn Haskell?

globuser, 2015-10-17
@globuzer

IMHO, in order to understand the essence of bigdata - you need to live and understand statistical processes, probabilistic models from the inside, feel mathematics and not be afraid of mathematical analysis, like science ...
this is a base, a data analysis base, and then a tool - programming languages and environments, technologies and td...

Mark Adams, 2015-12-12
@ilyakmet

Where did I start:
1. www.pvsm.ru/klassifikatsiya/40336
2. habrahabr.ru/post/264241 - now at this stage.
3. https://yandexdataschool.ru/edu-process/courses/ma... - then I plan