Lift & Shift … with added Brains
Guest blogger, David Stewart from Aurise Consulting Ltd (www.aurise.com), outlines an application of Inaport to import and de-duplicate data records. David is a software consultant, based in Edinburgh, who recently delivered a project for a public sector organisation. The project was to deliver a database, built on the Dynamics CRM platform, to store consumer information. Inaport was a key component of the application fabric and fulfilled the requirement to import many millions of source data records.
Over to David …
I was recently engaged by a public sector organisation to build a series of databases consisting of individuals from around the world who asked to receive tailored information. The driver for the project was the replacement of a third party supplier, and the decision was taken to rebuild the new database from scratch rather than purchasing the existing database. This decision initiated the requirement to import ten years of interaction history for each customer.
The interaction history was sourced from 7 external agencies, with data being of varying degrees of quality. As a result of the data capture process, records associated with a single customer can span many years of communication both within and across each data feed. This data is cleansed and enhanced via an initial 5 stage process, after which an import tool is required to load the records into the Dynamics CRM database.
The import tool must inspect incoming records and search for a match within existing data records previously imported into the database. If a match is found then both records are merged. If no match is found a new record is created. Although an in house solution could have been developed, with time being of the essence, several existing products were assessed. Most vendors were discarded as they were found to update the CRM database tables directly. Inaport was selected as updates are performed via the Dynamics CRM API. In addition, Inaport provides enhanced matching functionality that goes beyond simple string matching.
The business team defined a key set of data fields that would establish the matching criteria. These data fields are Title, First Name, Surname, Postal Address, Email Address and Gender.
Inaport offers three principal modes of operation; Standard, SQL and Fuzzy matching. Standard matching builds an index of all data fields within the matching criteria. SQL matching uses a SQL statement, built from the matching criteria, to identify duplicate records. Fuzzy matching utilises ‘Fuzzy logic’ to build a match string consisting of the match data fields. This third mode allows for the matching of records against subtle variations in postal address e.g. 1a Station Road vs. 1/1 Station Road.
Inaport allows the fine tuning of the fuzzy matching via a confidence metric, to achieve the correct balance of matching.
A confidence measure of 100% does not tolerate any variation in characters between the match strings of the two records being compared and is equivalent to an exact string match. A confidence measure of 0% allows for any variation of characters and will result in two records being matched (incorrectly) every time. The recommended confidence measure of 95% allows a difference of one or two differing characters between the two match strings. It should be noted that the business did not approve the use of the email address field in the fuzzy logic match string as there is a higher degree of uncertainty as to whether these refer to the same person (i.e. “David Stewart david1@foo.com” and “David Stewart david2@foo.com”) .
Following a series of test runs, a combination of SQL and Fuzzy matching was determined to deliver the required results. Testing revealed that neither mode can handle NULL data within the match criteria. As none of the match fields are mandatory within the data capture process, an Inaport profile had to be created for each combination of match fields in order to handle the existence of NULL data. For the UK database, eleven profiles are created. For the International database, sixteen profiles are created. The first seven of the eleven UK profiles are SQL matching profiles that use a combination of the match fields to match and update the source records against existing consumers within the database. The eighth UK profile, the final match profile, is operated in Fuzzy Matching mode and the match string is built from Name and Postal address data fields. Following the match and update profiles, the final three profiles (2 x SQL and 1 x Fuzzy) are used to create any records that have not been matched to existing records within the database.
Instead of utilising the profile chaining functionality within Inaport, to automate the execution of the profiles, SSIS is used to instigate and track the progress of each profile. By wrapping the execution of the Inaport profiles within an SSIS package, the package is triggered on a weekly basis by a Scheduled Task on the server.
In terms of performance, each profile processes records at an average rate of 1 million records per day. The limitation in performance is not attributed to Inaport, but linked to the hardware hosting the Dynamics CRM API. The API needs to apply a series of business rules and security checks before updating the database. A notable performance improvement is experienced by using a checksum of the match fields within the profiles operating under SQL Matching mode. Additionally, the Estimated Execution Plan feature within SQL Management Studio has proven invaluable in recommending custom indexes to add to both the source and target tables.
In order to achieve a greater throughput of data, it is possible to run Inaport profiles in parallel. The source table is split in two and batches of 10,000 records are presented to separate instances of the SSIS package. The ability to run two sets of profiles in parallel allows the team to process 1.7 times as many records; compared with running a single instance of the package.
And so to the title of this post. The de-duplication capabilities of the product, Inaport, have led to the creation of a database that is focused on data quality rather than data quantity. It is clear the available modes of operation allow the team to automate the processing of millions of rows of data, while delivering a match success that strikes a balance between speed and an accuracy that is comparable to analysing each record with the human eye. The business sponsor is confident they now have a database with a single customer view of each customer that captures all inbound and outbound interaction.
The support from Inaport throughout the development process has played a significant contribution to the successful delivery of the project. Already the project team have started applying Inaport to scenarios never initially considered. I look forward to working with InaPlex in the future and continuing to discover new applications for the Inaport product.