Log Analytics using STRING_SPLIT function

447

Motivating Log Analytics

Being a Data Engineer comes with its own set of challenges and opportunities since 80 per cent time is spent on cleaning and munging the raw data, which comes in increasing variety and in order to facilitate complex analytical needs like Log Analytics. Logs are of two types viz. Event logs and transaction logs. Event logs like Active Directory logs contain vital information that could help us track user behaviour or any suspicious activity. Let us walk through an example in which we analyse an AD log using a function named built-in SQL function SPLIT_STRING().

STRING_SPLIT function in SQL server:

The below image brings an example of an AD Log for a logoff activity. It contains the following log inside a column:

Now, we want to process the Message to get a tabular view like this:

Fortunately, log messages are structured and delimited by a separator. In the above case, every entity in a log message is separated by a newline character viz. char(10) in SQL. As a result, we split the message column based on the separator char(10). Here comes the STRING_SPLIT function, which was introduced in SQL Server 2017.

1. Load raw data in staging:

With Azure Data Factory, load data residing Azure Data Lake gen2  to Azure SQL DB staging table called as ADLogs. You can use this template to create a copy activity in Azure Data Factory, or you can create your copy activity in ADF. The table structure of ADLogs is as follows:

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[ADLog](
[LogNumber] [int] IDENTITY(1,1) NOT NULL,
[Keywords] [varchar](100) NULL,
[DateandTime] [varchar](100) NULL,
[Source] [varchar](500) NULL,
[EventID] [int] NULL,
[TaskCategory] [varchar](50) NULL,
[User] [varchar](50) NULL,
[Computer] [varchar](200) NULL,
[Message] [varchar](4000) NULL
) ON [PRIMARY]
GO

We have an identity column ‘LogNumber’. Once the data is loaded in SQL Database, the message looks something like this.

2. Use STRING_SPLIT() to split the message:

Furthermore, in order to split the message into different cells, we have to use the built-in function introduced in SQL Server 2017.

In order to invoke this STRING_SPLIT function, we use cross apply which is similar to an Inner Join. This is imperative since we need to retain other columns of the row in which the message is split. Let us take one row (LogNumber =4) and apply the split function.

SELECT TOP 1 * FROM  ADLog WHERE TaskCategory = 'Logoff'
SELECT T.*, S.VALUE FROM DBO.ADLOG T CROSS APPLY STRING_SPLIT(T.[MESSAGE], CHAR(10)) S  WHERE LogNumber = '4' ORDER BY LOGNUMBER

3. Data cleansing and transformation:

Firstly, we load the above result in a temp table with the following code.

SELECT T.*, S.VALUE INTO #TEMP_LOGS FROM DBO.ADLOG T CROSS APPLY STRING_SPLIT(T.[MESSAGE], CHAR(10)) S WHERE LOGNUMBER= 4 ORDER BY LOGNUMBER

Secondly, we cleanse all the undesired data in the message column viz. the blank rows and rows not containing the character ‘:’.

DELETE FROM #TEMP_LOGS WHERE VALUE = ” OR VALUE NOT LIKE ‘%:%’

Thirdly, we use the CharIndex function to split the ‘VALUE’ column based on the character ‘:’ using the following code:

SELECT *, LTRIM(RTRIM(LEFT(VALUE, CHARINDEX(':', VALUE) - 1))) AS TITLE, 
LTRIM(RTRIM(REVERSE(LEFT(REVERSE(VALUE), CHARINDEX(':', REVERSE(VALUE)) - 1)))) AS [DATA] INTO #FINALDATA
FROM #TEMP_LOGS WHERE LOGNUMBER =4

SELECT TITLE, DATA FROM #FINALDATA

Lastly, we perform an upsert to get the target view.

INSERT INTO LogoffData
SELECT DATEANDTIME,COMPUTER,NULL, NULL FROM #FINALDATA GROUP BY DATEANDTIME,COMPUTER,LOGNUMBER

UPDATE LogoffData 
SET SECURITYID= B.[DATA]
FROM #FINALDATA B WHERE B.TITLE LIKE '%SECURITY ID%'AND B.LOGNUMBER =4

UPDATE LogoffData 
SET [ACCOUNTDOMAIN]= B.[DATA]
FROM #FINALDATA B WHERE B.TITLE LIKE '%ACCOUNT DOMAIN%'AND B.LOGNUMBER =4

SELECT * FROM LogoffData

Conclusion

Note that the final cleansing and transformation will vary according to the data received at the source. Further, the cleansing and transformation can also be automated using Dynamic SQL. Having said that, the driving point here is the STRING_SPLIT() function.

Also read: Azure Databricks tutorial: an end to end analytics

 




Leave a Reply