The V’s of Big Data: Before Data Hub
Life was simple when we restricted ourselves to spreadsheets and relational data stores. Data was structured, databases were simple, the ETL processes well defined, generally in batches.
However, with changing times, data started flowing in multiple formats, at high speeds, in high volumes. The aforesaid high volume, velocity and variety ushered in a new era of Big Data. These three V’s are dominating every CIO’s mind, and rightly so. Furthermore, new analytical and reporting needs are cropping up every day, which brings us to the fourth V i.e. veracity or accuracy of the data. It is noteworthy that with increased levels of complexity of systems, veracity can go for a toss. Let us delve a bit deeper to understand the effect of the changing landscapes and paradigms on the veracity of data.
The changing landscape
With a change in paradigm, methodologies have been changing. ETL processes are transforming into ELT processes. In an ETL process, we extract data from source systems, transform the data on the fly and load them into the destination systems. These destination systems are called as ‘Data warehouses’
To the contrary, in an ELT process, transformation does not follow extraction. We load the extracted data directly into the destination systems called ‘Data Lakes’. Furthermore, we transform the data in data lakes, as per reporting requirements, as opposed to pre-defined transformations in ETL. The below image will bring clarity regarding the two concepts.
Data Lake to Data Swamp.
Apart from ELT, a paramount advantage which a Data Lake offers over a Data warehouse is flexibility. We know that data comes in 3 broad varieties viz. structured, semi-structured and unstructured. Additionally, we can receive sensor data and real-time feeds at a high velocity, in high volumes. Data Lakes can accommodate any kind of data, whereas traditional data warehouses are limited to structured data. Furthermore, with cheap storage on the cloud, the volume is not a constraint anymore. Thus Data Lakes are a one-stop solution to the upcoming Big Data needs.
However, a Data Lake is no exception to the strength-weakness paradox quote: your greatest strength can become your greatest weakness. The aforementioned boon of flexibility can easily turn into a bane, if not handled with care. It is a matter of time that your Data Lake turns into a Data Swamp.
Data Lake to Data Swamp
Going by the concept of Data Lakes, we can say that it is a NoSQL or Hadoop Datastore, which can accommodate any kind of Data. Hence, we can ingest unrelated raw data, for undetermined use later. The below infographic illustrates a data lake.
However, the ease at which data can be ingested into a data lake can make an organisation lose control over its Data Lake. This uncontrolled state of a Data Lake is called a Data Swamp. The next infographic illustrates a Data Swamp.
The Data Hub
This problem of data swamp, which is degraded veracity of data, is solved by the concept of a data hub. Although a data swamp serves the purpose of unifying the data under one hood, it is practically useless since data scientists and analysts have to work extremely hard to extract insights out of it. Moreover, these swamps are useless as master data, due to unreliability and non-standardisation. As a result, business decisions cannot rely on these swamps.
To combat the inefficacy of these swamps, the concept of data bub emerged. Datahub is an extension to a data lake, a centralized location to store data. However, in addition to centralized storage, Data hubs support the following:
- Governance: To structure data processing and maintain it’s veracity.
- Security: To maintain access control.
- Indexing: To ensure fast retrieval of data.
- Transactional Integrity: ACID properties to the Data Lake.
A deeper analysis might reveal an intersection between Data warehouses and Data Hubs. In essence, we can call a Data Hub as a ‘Data Warehouse in a Big Data Ecosystem‘. However, traditional data warehouses, though reliable, are known for the homogenous, non-agile characteristics, whereas data lakes, although heterogeneous and agile, can degrade to data swamps. Hence, Mark Logic defines a Data Hub as ‘Data Lakes done right. Nonetheless, the key question remains how do we realise a Data Hub. Let us walk through a brief set of steps or guidelines to realise a data hub.
Implementation of a Data Hub
Ideally, an organisation should try not to get into a data swamp state at the very outset. However, even if the Data lake turns into a swamp, all hope for resurrection is not lost. The following broad steps might be helpful to implement a Data Hub.
- Identify Business objectives.
- Identifying Data Assets.
- Semi-structured data to NoSQL databases.
- Structured data to SQL databases.
- Security to Data assets.
- Appropriate data engineering processes to maintain the integrity of data.
Let us walk through each of these steps.
Identify business objectives
There is no magic trick to solve a data swamp problem. A data hub is realised progressively, by targeting one business objective at a time. Define the analytical goals you want to achieve and prioritise them according to business needs.
Identifying Data Assets
After defining and prioritising analytical goals, identify the related data assets. This helps in finding the right data assets available to you and any other gaps that might be present. If you are implementing your data hub in the Azure ecosystem, you might consider using Azure Data Catalog to identify your data assets. Azure Data Catalog helps you to discover your data assets and related metadata in one single view.
Semi-structured data to NoSQL databases
Furthermore, after identifying your data assets, classify them into structured, unstructured and semistructured data. This helps you decide the appropriate data stores/databases. For instance, in the case of semi-structured data like XML, JSON, graph data, we can opt for No SQL databases like Azure Cosmos DB.
Structured data to SQL databases
Likewise, all structured data assets like CSV, Excel etc. can be saved in a structured SQL database. We can use the Azure SQL database or Azure SQL Data warehouse. Moreover, structured databases(and No SQL databases) enable indexing and ACID properties for faster retrieval and transactional integrity.
Security to data assets
In the era of data explosion, security is a key concern. CIOs and architects need to spend a considerable amount of time to design for security of data assets. Fortunately, cloud technologies like Azure Data Lake gen2 come with their own security offerings like encryption and access control.
Appropriate data engineering processes to maintain the integrity of data.
Lastly, architects have to set up appropriate governance practices to onboard data assets into the data hub. This entails efficient ETL processes. However, this is dependent on the first step i.e. identifying business objectives.
It is imperative that with changing technological landscapes and business needs, paradigms change. Datahub is an outgrowth of the emerging Big Data landscape. This article is a brief introduction to the concept of Data Hubs along with a layout to realise them. Please note that it is not exhaustive in any measure. Furthermore, this is solely for information purposes and we don’t assure any warranty with regards to its completeness or accuracy.
Image credit: ETL vs ELT
Data Lake and Data swamp image credits: Cleaning the Data Lake with Operational Data Hub