Having spent 5 years in the space of Data Analytics, I have come across a few challenges that might hamper an organization’s efforts to mature as a Data-Driven one. Hence, here is a list of problems in modern data processing and information systems I personally went through, and it might potentially help others in their journey.
Infrastructure and velocity of data for data processing
The primary challenge in handling modern data processing requirements (especially streaming) is setting up the infrastructure owing to high volumes and velocity of data. However, we can handle this efficiently using cloud services like Microsoft Azure. Accordingly, two PaaS services stand out viz. Azure Stream Analytics and Azure Databricks.
The former is a first-party streaming service that gels well with messaging services like Azure IoT Hub or Event Hub. The article An Introduction to Azure IoT with Machine Learning elucidates more on this. However, the latter i.e. Azure Databricks is a unified analytics platform to implement Lambda Architecture.
As organizations grow in size and functions, different teams maintain their information systems differently. This leads to silos which in turn, leads to increasing efforts to discover, collect and consolidate data for building analytical models.
When life was simple with traditional databases, we had data dictionaries. However, with the complexity of modern data platforms, we need a more interactive solution like Data Cataloging. A data catalog is a collection of metadata along with tools to search and curate data. It gives a singular view of all the data sources present in the system so that Data Engineers and Scientists can leverage them for speeding up their development cycles.
Avoiding Data Swamps
There is little doubt that big data empowers us. However, as the classic saying goes, with great power comes great responsibility; the boon of big data can easily turn into a bane if mishandled.
One of the classic cases usually encountered in big data projects is the problem of data swamps: an uncontrolled state of a Data Lake. With the ease of ingesting the data into Data Lakes, you can easily lose control over the data lakes, thus making it worthless.
Hence, in order to overcome this, we adopted strict governance and security practices which is the process of transforming the Data Lake into a Data Hub.
Focus on tools instead of fundamental concepts.
This more of a people problem. I have often seen people holding on to tools rather than the fundamental concepts. Once you are addicted to a tool or a technique, stagnation is at the horizon.
A classic example amongst ML beginners is biased (should be controlled anyway in ML) towards an algorithm over others. However, this is much more evident amongst potential Data Engineers, who prefer sticking to their favourite tools for ETL (Extract, Transform, Load). We need to understand that solving the problem at hand is much more important than using a particular tool.
Also, analysts/developers tend to stick to their favourite visualization tools, thus hampering their own career growth as well as organizations’ growth.
Addressing Tech Debt
This is true with any domain including data processing. With fast-paced changes in tools and technologies, sticking to older technologies might expose the systems to serious security flaws. Hence, as and when feasible, organizations need to upgrade their infrastructure to address those debts. One example might be moving from Microsoft Azure Data Lake Gen1 to Gen2.
Scaling to Distributed Environment like Spark
Data Scientists usually create models using libraries like Pandas. However, generally, they work in single-node machines. However, ML engineers operationalize the models onto a distributed environment like Spark. In order to scale the models to big datasets on a spark environment, ML engineers need to use PySpark, thus leading to a lot of rework. Fortunately, databricks has come up with libraries like Koalas, which acts as a bridge between Pandas and spark. Read this article to know more.
These are a few problems/challenges that organizations tend to face in their journey data maturity. Hope this was helpful. Please note that this is for information purposes and comes from personal experience. Hence, we cannot guarantee its completeness. Moreover, challenges might vary from organization to organization. Therefore, we cannot vouch for 100 % accuracy as well.
Note: Inspired from my insights for Analytics India Magazine