How to Prevent Data Duplication in Bold Data Hub Pipelines
Overview
This article provides a guide to prevent data duplication in Bold Data Hub pipelines. It explains how to configure pipelines to either insert only new records or truncate the destination table before inserting fresh data. Following these steps will ensure data integrity and eliminate redundancy in your transformation tables.
Understanding Pipeline Behavior
Data Extraction
Bold Data Hub pipelines the destination table will be removed and re-extract data from the source every time they run.
While this ensures that fresh data is pulled, it does not inherently prevent the duplication of records in the destination table.
Transformation Tables
-
Without Primary Key: If no primary key is defined in a transformation table, Data Hub treats every incoming row as new, leading to duplicate entries on each pipeline run.
-
With Primary Key: Creating a transform table by defining a primary key allows Data Hub to match existing records. The transformation table will be updated for existing rows or inserted newly, maintaining data integrity and avoiding duplication.
Steps to Prevent Duplication
1. Define a Primary Key
A primary key helps uniquely identify each row in the transformation table and ensures deduplication.
-
Steps to Define a Primary Key:
-
After completing the data extraction, go to the Transform tab and click Add Table. Enter the table name to move into the destination database.
-
Select a column (or combination of columns) that uniquely identifies each row.
-
Use Run button to preview the data and Click Save and Transform to move the data into destination.
-
Note: When running the same pipeline again, a new property called isDropTable is added in the YAML template. By default, it is set to true.
If isDropTable is set to true, the existing table will be dropped and recreated before the data is moved, ensuring that no duplicate rows are added.
If isDropTable is set to false, the data will be moved again without deleting or modifying the existing data.
Note: isDropTable is supported only for Snowflake, Google Big Query and Amazon Redshift data stores.
Best Practices
- Always define a primary key when working with Data Hub using transform tables to ensure data consistency.
Additional Resources
https://help.boldbi.com/working-with-data-sources/working-with-bold-data-hub/transformation-preview/