Having spent 6 years in the space of Data Analytics, I have come across a few challenges that might hamper an organization’s efforts to mature as a Data-Driven one. Hence, here is a list of problems in modern data processing and information systems I went through, and it might help others in their journey.
Infrastructure and velocity of data for data processing
The primary challenge in handling modern data processing requirements (especially streaming) is setting up the infrastructure owing to high volumes and velocity of data. But, we can handle this efficiently with cloud services like Microsoft Azure. Two PaaS services stand out viz. Azure Stream Analytics(ASA) and Azure Databricks.
ASA is an azure native streaming service that goes well with messaging services like Azure IoT Hub or Event Hub. This article viz. An Introduction to Azure IoT with Machine Learning explains more about this. Azure Databricks is another unified analytics platform to implement Lambda Architecture.
As organizations grow in size and functions, different teams maintain their information systems differently. This leads to silos, leading to increased efforts to discover, collect, and merge data for building analytical models.
When life was simple with traditional databases, we had data dictionaries. However, we need a more interactive solution like Data Cataloging with the complexity of modern data platforms. A data catalog is a collection of metadata along with tools to search and curate data. It gives a singular view of all the data sources present in the system so that Data Engineers and Scientists can leverage them for speeding up their development cycles.
Avoiding Data Swamps
Inarguably, big data empowers us. But, with great power comes great responsibility; the boon of big data can easily turn into a bane if mishandled.
Usually, big data projects face the problem of data swamps viz. an uncontrolled state of a Data Lake. With the ease of ingesting the data into Data Lakes, you may probably lose control over them. Hence, to overcome this, we adopted strict governance and security practices, which are transforming the Data Lake into a Data Hub.
Focus on tools instead of fundamental concepts.
This is a people problem. I have often seen people holding on to tools rather than fundamental concepts. Once addicted to a tool or a technique, stagnation is on the horizon.
A classic example amongst ML beginners is being biased towards an algorithm over others. However, this is much more clear amongst potential Data Engineers, who prefer sticking to their favourite tools for ETL (Extract, Transform, Load). Also, analysts/developers stick to their favourite visualization tools, thus hampering their own career growth and organizations’ growth.
Addressing Tech Debt
This is true with any domain including data processing. With fast-paced changes in tools and technologies, sticking to older technologies might expose the systems to serious security flaws. Hence, as and when workable, organizations need to upgrade their infrastructure to address those debts. One example might be moving from Microsoft Azure Data Lake Gen1 to Gen2.
Scaling to Distributed Environment like Spark
Data Scientists usually create models using libraries like Pandas. However, they work in single-node machines. However, ML engineers operationalize the models onto a distributed environment like Spark. To scale the models to big datasets on a spark environment, ML engineers need to use PySpark, thus leading to a lot of rework. Fortunately, Databricks has come up with libraries like Koalas, which acts as a bridge between pandas and spark. Read this article to know more.
These are a few problems/challenges that organizations face in their journey of data maturity. Hope this was helpful. Please note that this is for information and comes from personal experience. Hence, we cannot guarantee its completeness. Challenges might vary from organization to organization. Therefore, we cannot vouch for 100% accuracy as well.
Note: Inspired by my insights for Analytics India Magazine