Thursday, June 23, 2011

SSIS:Data Validation

VALIDATION IN SSIS

Why we need Validation ?
I have not thought of validating the input in SSIS package until recently I found myself in a position to do it. In my previous packages I found a way to redirect the error outputs while doing a data conversion. But what about situations where you need to validate the input file row by row before updating in to the destination. The advantage of validating input file is mainly to have custom error messages logged in to my logging table.

For example when the format of my expected input column “A” is not valid then I could log the error message “The Column A is not valid in the Row 120” .This could save a lot of my time to correct the input file or for further analysis.

Pros and Cons



I can validate so much more apart from data types and null values. For example I could validate the mobile number format, my area code format or email id and so on. In .NET we normally do this validation to check the data. Similarly when we are moving the data from one source to another source we can implement the validation to check data quality. Again this comes under the cost of performance. These validation are normally done in a script component task in Dataflow Transformation task.
We could also have these validation methods in a common dll and use this across all your packages which would let you re use the code instead of writing it again and again.

EXAMPLE
The following sample shows how we could implement validation
My input file looks something like this

*Click the image for a clear view

A simple input file with columns Name, Mobile Number , DOB , EmployeeID and Salary.
This is my Package


*Click the image for a clear view

My package will read the data columns in the input textfile as strings.
My destination table is Employee with the following data types for the columns.


*Click the image for a clear view

I need to convert DOB from String to Date , EmployeeId from string to integer and Salary from string to decimal.Also I need to validate the input file data and log the error messages appropriately and send the error records to the error text file.

I have created two output paths in my script component


*Click the image for a clear view

Output 0 is the validated output where as Error Output has the error records.At this point I was stuck for a while on how to redirect the rows using methods in script. Finally after some googling I figured out the ExclusionGroup Property.
You need to manually set this to 1 .After doing so the system will generate 2 methods called DirectOutput0 and DirectErrorOutput .You can then use this method in your script component to redirect the output row to the respective output.



I am firing the warning messages here for each error in the script.This will be logged in the database table.I have enabled the logging messages for “OnWarning”


*Click the image for a clear view

I have a Dataconversion component to convert the input columns to the respective data types for the destination table and finally updating this to the Employee table.


This is just a simple example to show you how we could implement logging.In reality the validations will change as per the requirement.This approach is good as long as your number of records is less.Incase if there are millions of records , this might well affect the performance of the package.I used this in my case where I need to import some very crucial limited data.

Hope this article was useful.


Following is the code in script component which takes care of the validation
int RowNum = 1;

public override void PreExecute()
{
base.PreExecute();
}

public override void PostExecute()
{
base.PostExecute();
}

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
IList resultList = new List();
bool errorOccured = false;
// Validating the fields
if (string.IsNullOrEmpty(Row.DOB))
{
resultList.Add("The DOB is NULL for the Row " +RowNum.ToString() );
errorOccured = true;
}

if (!(isDateFormat(Row.DOB)))
{
resultList.Add("The DOB date FORMAT IS INCORRECT in the Row " + RowNum.ToString());
errorOccured = true;
}

if (string.IsNullOrEmpty(Row.EmployeeId))
{
resultList.Add("The EmployeeId is NULL in the Row " + RowNum.ToString() );
errorOccured = true;
}

if (!(checkIfNumber(Row.EmployeeId)))
{
resultList.Add("The EmployeeId is NOT AN INTEGER in the Row " + RowNum.ToString() );
errorOccured = true;
}

if (string.IsNullOrEmpty(Row.Name))
{
resultList.Add("The Employee Name is NULL in the Row " + RowNum.ToString() );
errorOccured = true;
}

if (string.IsNullOrEmpty(Row.Salary))
{
resultList.Add("The Salary is NULL in the Row " + RowNum.ToString() );
errorOccured = true;
}

if (!(isDecimal(Row.Salary)))
{
resultList.Add("The Salary is NOT A DECIMAL in the Row " + RowNum.ToString() );
errorOccured = true;
}

if (string.IsNullOrEmpty(Row.Mobile))
{
resultList.Add("The Mobile Number is NULL in the Row " + RowNum.ToString());
errorOccured = true;
}

if (isMobilePhone(Row.Mobile))
{
resultList.Add("The Mobile Number is NOT VALID in the Row " + RowNum.ToString());
errorOccured = true;
}

//Throw warning messages in to the log table
foreach (string message in resultList )
{
this.ComponentMetaData.FireWarning(0, this.ComponentMetaData.Name, message, "", 0);
}

//Directing the row
if (errorOccured)
{
Row.DirectRowToErrorOutput();
}
else
{
Row.DirectRowToOutput0();
}

RowNum = RowNum + 1;
}

public bool isDecimal(string value)
{
decimal number;
bool canConvert = true;
canConvert = decimal.TryParse(value, out number);
return canConvert;

}
public bool isDateFormat(string Transactiondate)
{

string month = Transactiondate.Substring(3, 3).ToLower();
string day = Transactiondate.Substring(0, 2).ToLower();
string year = Transactiondate.Substring(7, 2).ToLower();

if (month == "jan" || month == "feb" || month == "mar" || month == "apr" || month == "may" || month == "jun" || month == "jul" || month == "aug" ||
month == "sep" || month == "oct" || month == "nov" || month == "dec")
{

if (checkIfNumber(year))
{
if (Int16.Parse(year) < 100)
{

if (checkIfNumber(day))
{
if (Int16.Parse(day) < 32)
{
return true;

}
else
{
return false;
}


}
else
{

return false;
}

}
else
{
return false;

}

}
else
{

return false;
}

}

else
{

return false;

}

}
public bool checkIfNumber(String integer)
{

try {

int.Parse(integer);

}
catch (Exception e)
{
return false;
}

return true;
}
public bool isMobilePhone(string Mobile)
{

string areaCodeRegExp = @"\(?\s*\d{3}\s*[\)\.\-]?\s*\d{3}\s*[\-\.]?\s*\d{4}";

if (System.Text.RegularExpressions.Regex.IsMatch(Mobile, areaCodeRegExp))
{
return true;
}
else
{
return false;
}
}

7 comments:

  1. Very good article. How about reference data validation?

    ReplyDelete
  2. I agree with you… this is a great posted article.

    ReplyDelete
  3. great article; but when i use your code there is always an error:

    IList resultList = new List(); "type or namespace name List could not be found" Can you help me?

    ReplyDelete
    Replies
    1. check for using System.Collections.Generic;
      in the list of namespaces

      Delete
  4. This idea is mind blowing. I think everyone should know such information like you have described on this post. Thank you for sharing this explanation.Your final conclusion was good. We are sowing seeds and need to be patiently wait till it blossoms.

    Hadoop Training in Chennai

    Base SAS Training in Chennai

    MSBI Training in Chennai

    ReplyDelete
  5. Row.DirectRowToErrorOutput(); is not recognised

    ReplyDelete
  6. Thanks for sharing this informations.
    CCNA Training Institute in Coimbatore

    CCNA Course in Coimbatore

    Java training in coimbatore

    Selenium Training in Coimbatore

    ios training in coimbatore

    aws training in coimbatore

    big data training in coimbatore

    hadoop training in coimbatore

    ReplyDelete

Creating a Meta Data Driven SSIS Solution with Biml

Biml Biml ( Business Intelligence Markup Language ) is a markup language that enables you to quickly represent a variety of database ...