Assistance in designing a distributed architecture

A

Anton Martsen2013-11-14 10:52:50

Distributed Computing

Anton Martsen, 2013-11-14 10:52:50

Hello!

Now the task is to build a distributed storage. Initial data: 1) several (5 or more) heterogeneously distributed sites 2) each site generates gigabytes / terabytes of content (text, audio, video, records in different databases) 3) users need data from all sites equally to work 4 ) it is necessary to organize a quick search in all files, and in the future to build a system for data analysis 5) high availability and fault tolerance are needed

Now we are planning to collect all this data into one single repository that everyone can work with.

While little by little I am studying this topic and it is worth choosing the technology that we will use. I tend to deploy hadoop, because. HDFS and there is an opportunity to develop the necessary software.

Questions: 1) Is Hadoop the best choice? Are there any other suitable technologies? 2) Now the data is on different servers. Will it be necessary to transfer all this to HDFS or is it possible to somehow "set" hadoop on the existing data without transferring it? What to do with data stored in relational databases? Will it be necessary every time through some SQOOP to pull them into HDFS for further processing? In general, do I need to stock up on a whole bunch of hard drives?

I would be grateful for answers, criticism and links to useful articles and publications on this topic.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

R

relgames, 2013-11-14
@relgames

We are using Cassandra. Cons: Difficult to maintain, difficult to work with.

Hadoop is fundamentally storage-independent. YARN (their Map/Reduce 2.0) allows you to work on any data.

J

joann, 2013-12-09
@joann

Have a look at Spark ( spark.incubator.apache.org ) and Hadoop by MapR

P

plinyar, 2014-08-20
@plynyar

Questions make the goals of the project more murky.
On the one hand, you say that "we need high availability and fault tolerance", and then the question is "Now the data is on different servers. Will it be necessary to transfer all this to HDFS, or is it possible to somehow" set "hadoop" on the existing data without transferring it? "
Accordingly, derived questions:
* Do these data and data access services in heterogeneous sites themselves satisfy the requirements of availability and fault tolerance?
* Interfaces for accessing sites from Hadoop (or equivalent) meet the requirements for availability and fault tolerance? Doesn't the usual problem of reliability of systems with distributed data (data federation) arise here?
If they do not satisfy, then it is logical to concentrate on centralized storage in hadoop and, accordingly, drain all data there.
If satisfied, then a combined solution seems reasonable, consisting of three subsystems:
* Indexing - providing fast search on unstructured data. See SolrCloud for example as part of Cloudera's Hadoop. You can index data directly from sources.
* Data virtualization - a system that provides a single view of tabular data over a set of heterogeneous distributed databases (even in the clouds). Needed for detailed drill-down analysis without having to drag everything into a central repository (Hadoop?). In my opinion, SAS, SAP BI, Red Hat JBoss Data Virtualization have such solutions.
* BigData analysis - a system that allows you to analyze very large volumes. There may be Hadoop as well. The key feature is that you only drag and drop very large data into this system that you really need to analyze. And not all in a row.
Well, let's not forget that it is not very convenient to store video / audio data in Hadoop. Do not put too many files on HDFS (there is a limit on the number of files), and if the files are not large, then they will take away blocks of 256MB anyway (yes, multiply by 3). If, on the contrary, you put large files in HBase, for example, then, as far as I know, it does not support streaming data from a binary field - you will always have to read the entire byte array of one video material. IMHO, object stores like Swift (OpenStack) are more suitable for this purpose. Although it all depends on the use cases.