In February 2020, Gartner released its magic chart for Data Science. A pleasant surprise, however, was to see Databricks amongst the leaders. Interestingly, it made a swift transition from the visionaries’ quadrant to the leader one within a year.
However, it is a well-deserved placement, since Databricks is steadily growing into a major analytics vendor. One might wonder the reason for such growth of the former since giants like Google and Microsoft are in the visionary quadrant while the grand old IBM is still a challenger. The primary reason that Databricks is a Unified Analytics platform. This brings us to our first and foremost point:
1.Unified Analytics platform
If I was asked to choose one single reason to choose Databricks over anything else, this would be it; the fact that it is a unified analytics platform. If one wishes to build a state of the art analytics system, it will consist of a team of Data Engineers, Data Analysts, Data Scientists and Machine Learning Engineers. The Data Engineers can build cutting edge data pipelines by realising data architectures like Lambda Architecture and Delta Architecture.
Furthermore, Data Analysts can leverage built-in visuals or can connect to Databricks from tools like Power BI to analyze the data, while the Data Scientists can build ML models. Lastly, Machine Learning Engineers can leverage tools like MLflow to manage end to end ML lifecycle.
This makes databricks a one-stop solution to the entire analytics teams as opposed to giant vendors like Microsoft, where multiple services need to be leveraged to build an end to end analytics system. This leads to high coupling and in turn low cohesion, leading to the high cost of integration and maintenance. I admit that we have tools like Azure Synapse Analytics that show a similar promise as databricks. However, it is still in its nascent stages. Nonetheless, what makes Databricks such a versatile platform? The answer is simplified Apache Spark!
2. Apache Spark simplified
I can clearly remember the days when installing spark was a nightmare. Spinning up a spark cluster on cloud services like Azure HD Insight wasn’t easy as well. However, with Databricks, creating and leveraging a spark cluster is a matter of few clicks. Furthermore, cloud hosting on AWS and Azure has made it accessible very easily. However, a key advantage in Databricks is the feature of autoscaling in Databricks. With that, the scaling of clusters is done automatically based on the compute requirements. This reduces operational and maintenance costs to a great extent.
3. Multilanguage and Multiple platform support
Since Databricks is based on Spark, all the benefits of apache-spark, the modern-day, in-memory distributed computing platform, are included naturally. For instance, the multi-language support of Spark can be leveraged by default. Hence, as of now, four programming languages viz. Python, Scala, R and SQL form the core of the platform. However, a key advantage which Databricks offers is language interoperability. This comes typically handy when traditional ETL developers move to Big Data environment like databricks. For instance, a developer might read the data from a data store into a pyspark dataframe, leverage all the power Spark SQL and convert the result of his SQL query it into a pyspark dataframe for writing it back to a datastore. This helps us leverage the best of both the traditional and big data world.
Moreover, we have a host of Data Engineers and Data Scientists who are comfortable with a particular toolset. For instance, a popular ETL tool viz. Informatica has thousands of developers. These developers are candidate data engineers. Hence, in order to facilitate their smooth transition to the big data world, while allowing them to retain their skillset, Databricks has partnered with Informatica for data ingestion into delta lakes. More details here.
Similarly, MATLAB is a famous tool for creating models. However, it has its own language and environment making it difficult for its users to migrate to spark. Hence, databricks has come up with MATLAB integration, thus bringing out the best of the two tools. Although this is in preview, it holds a lot of promise.
4. Rich Notebooks and Dashboards
The icing on the cake is a rich UI experience of Databricks. We know that the usage of Notebooks has risen amongst the Data Science and Data Engineering community exponentially. Databricks gives us the same Notebook experience along with rich visualization embedded into it. This gives an extra appeal to Data Scientists since they can skip some code to create visuals for data analysis.
5. Contributions to Open-source community
Last but not least, Databricks as a company has contributed immensely to the community. Two examples of this fact are MLFlow and Koalas package.
While MLFlow helps ML engineers to deploy, track and maintain models, Koalas help them scale pandas code effectively to a distributed environment, without rewriting the code in pyspark. More importantly, these two projects are open source. To know more about Koalas, read this article: Databricks Koalas: the bridge between pandas and spark.
With all the above advantages, is it a surprise that Databricks has made it to the coveted position?
- Dashboard: https://databricks.com/blog/2016/02/17/introducing-databricks-dashboards.html
- Featured Image: databricks.com
- This is adapted from my article which originally appeared in Analytics India Magazine.