Azure HDInsight Vs Azure Databricks

Posted on by Sumit Kumar

Before discussing about Azure HDInsight and Azure Databrick Lets discuss Hadoop, Spark and Databricks.

Hadoop:-
Hadoop is tool to solve bigdata problem.
When we say bigdata problem we have problem to store huge data and process the huge data.
Apache Hadoop came with solution for these two problem.
Hadoop have HDFS to store and MapReduce to process the data.
We can say Hadoop is Scalable,fault tolerant,open source programming framework with distributed storage and distributed processing of large dataset on commotity hardware as a cluster.

Hadoop has been declared open source and it is now Apache Hadoop an open source product to solve bigdata problem.
There are some vender who commercialized apache hadoop and selling to client and providing support to hadoop.
Below are some famous Hadoop Vender.
1)CLoudera
2)HortonWork
3)MapR

Azure HDInsight Vs Azure Databricks

 

Spark:-
Spark is Lightning-fast unified analytics engine.
Spark extends Hadoop MapReduce framework to work on optimized way.
It support in-memory computation.
It is 100% faster than Hadoop when in work on in memory and 10 time faster than hadoop when it work on disk.
Spark does not provide storage. It is just a computation engine.

 

Databricks:-
Databrick is the company founded by the creator of spark.

Databrick team keep optimizing apache spark engine to run even more faster.
The Databricks Unified Analytics Platform provide 5 times more performance than open source apache spark.
It provide collaborative notebooks, integrated workflows, and enterprise security — all in a fully managed cloud platform.

What is Azure HDInsight:

It is Apache Hadoop running on Microsoft Azure.
It means apache hadoop is available as a cloud service or cloud platform.
Therefore we dont have to spend more time to build cluster.
We can pass instruction to azure ,which type of cluster we want and azure will create Hadoop cluster for us.
HDInsight use HortonWork data platform(HDP) configuration to create hadoop cluster.
HDP is one of famous vender who make change to original apache hadoop code and provide optimized hadoop.
HDInsight should we configured either Azure Blob Storage or Azure Data Lake Storage (ADLS) as its HDFS Storage.
Hadoop use mysql to store its metadata but HDInsight use SQL database to store its metadata.
We can configure SQL database to store HIVE/Oozie metadata.

We can select below cluster type:-
1)Hadoop
Petabyte-scale processing with Hadoop components like MapReduce, Hive (SQL on Hadoop), Pig, Sqoop and Oozie.
2)Spark
Fast data analytics and cluster computing using in-memory processing.
3)Kafka
Build a high throughput, low-latency, real-time streaming platform using a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system.
4)HBase
Fast and scalable NoSQL database.
5)Interactive Query
Build Enterprise Data Warehouse with in-memory analytics using Hive (SQL on Hadoop) and LLAP (Low Latency Analytical Processing).
Note that this feature requires high memory instances.
6)Storm
Reliably process infinite streams of data in real-time.
7)ML Services (R Server)
Analyze data at scale, build intelligent apps and discover valuable insights across your business using both R and Python. Adds 1.05754 INR per Core-Hour.

While configuring HDInsight we can select only one type of cluster at one time.
For example if we need Hadoop and Spark we need to create two different cluster.

Azure Databricks:-

Recently Azure collaborate with databrick and bring Azure Databrick.
We can say it is Databricks running on Azure.
Azure databrick provide premium spark that is faster than open source spark.
It is suitable for use case where client need there data engineer,data scientist and bussiness analytics work together and they can easily share their workspace,cluster and job through single interface.

Azure Databrick is fully PaaS provided by azure. It doed not require any admin work and it provide secuirity using Azure active directory(AAD) with no custom
configuration.
But in Azure HDInsight secuirity can not provided directly by AAD. It also require apache Ranger to configure Role Based Access Control(RBAC)
to secure apache Hive,Kafka,Hbase etc.

Conclusion:-

Choosing between Azure HDInsight and Databricks is tricky and vary from case to case. If  project is typical Bigdata project and require huge data processing and require tools like Hive,Kafka,Sqoop,Hbase and does not require much collaboration with data scientist  we can choose HDInsight over Databricks.

On the other hand if our project need collaboration with data scientist and process data with Spark we have to choose Databricks.

For Pricing kindly visit  https://azure.microsoft.com/en-in/pricing/details/databricks/

Note:-HDInsight provides most popular open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, R, and more. Databricks does not provide any above mentioned  open-source frameworks apart from Spark.

Reference:-

https://docs.microsoft.com/en-us/azure/azure-databricks/

https://docs.microsoft.com/en-us/azure/hdinsight/

Leave a Reply

Your email address will not be published. Required fields are marked *

*

*