SQL for Data Quality
I gave a workshop on managerial and technical challenges of integration projects at Sage Insights in Nashville last week.
One of the areas I covered was some useful SQL queries for checking data integrity during a migration; the example used was from a Siebel to SageCRM migration project we are currently engaged in. I had a number of requests from audience members for details of the queries, so here they are.
An Inaport profile had been used to move Accounts and Contacts from the Siebel database to SageCRM. After the profile had been run, a simple sanity check was used:
select count(*) from accounts select count(*) from crm_db.dbo.company
The first point to note is that we can do a cross database query – the second query above is against the SageCRM database, even though the query window is currently logged into the Siebel database.
The more important point is that the count of Accounts in Siebel is 16,538, but only 7,794 in SageCRM – a difference of 8,744, This is probably not correct, and needs investigation. The first step is to confirm the problem:
select count(id) from accounts where id not in (select comp_siebelrowid from crm_db.dbo.company)
This query is an example of using a sub-query – it pulls all the siebel ids from the SageCRM table, then finds which ids from the source Siebel table have not made it across. The result shows that 8,744 ids are not there, which correlates with the simple counts above.
The most likely problem s the query being used in the Inaport profile is not getting all the Accounts. The query used was:
select * from accounts a inner join contacts c on a.id = c.acntid
Doing a count on this returns 11,282 rows. Initially this does not look like it matches against the numbers above, but the explanation is simple – there are more than one contact per account, so the inner join is resulting on more than one row per account. A quick look at the Inaport log file for details on the Company table shows 7,794 inserts and 3,488 updates, which ties in with the numbers above.
The experienced reader will see the isuse by now; a left outer join should have been used instead of an inner join. An inner join will return only rows from the Accounts table that have a matching row in the Contacts table i.e. Accounts with Contacts.
A left outer join will return all rows from the Accounts table (even if the Account does not have a Contact), and any rows from Contacts that match. Running the following query shows the effect:
select count(*) from accounts a left outer join contacts c on a.id = c.acntid
This returns 20,026 rows, which looks a lot more reasonable. To be absolutely sure, the following query gives us a count of the number of distinct account ids in the query (notice that the sub-query needs to be named as a virtual table):
select count(*) from (select distinct a.id from accounts a left outer join contacts c on a.id = c.acntid) sq
This query retuns 16,538, which is the correct number of accounts. So we have now confirmed that a left outer join query will return the correct number of accounts; this means that there are some accounts with no contacts.
Now that we are checking data integrity, the obvious question is, are there any contacts that are not linked to accounts? The following query tells us:
select count(*) from contacts where (acntid is null) or (acntid not in (select id from accounts))
This shows us that there are 2099 contacts that are not linked to accounts – they will therefore not be picked up by any query that has a join on account, and will need to ba handled seperately.
A final test – do we have any duplicates in the account table? A query to check for duplicates on company name is:
select acntName, count(acntName) from accounts group by acntName having count(acntName) > 1
This shows us each account which has more than one record with the same name; it also shows us how many records there are for each name.
An important note on NULLs in records. When checking whether there are contacts no linked to accounts, you might try a query like:
select count(id) from accounts where id not in (select acntId from contacts)
However, this query actually returns no records, whereas we know it should really return 2099. The problem is the NULL values in the accountid field, and their integration with the IN predicate.
The expression “value IN(a, b, c)” can only return TRUE or FALSE. If the set of values being tested contains NULL, however, then the expression “value IN(a, b, NULL)” can only return TRUE or UNKNOWN; it cannot return FALSE. Therefore NOT IN() can only return NOT TRUE or NOT UNKNOWN, neither or which is TRUE. So the WHERE clause always return FALSE, and we count zero records.
If any of you have favourite SQL for data checking, or you see better ways of achieving the results shown here, please feel free to comment.