Data Cleansing / Scrubbing

Data Cleansing / Scrubbing

  • Data Cleansing
  • Data cleansing (also known as data scrubbing) is the name of a process of correcting and - if necessary - eliminating inaccurate records from a particular database.  The purpose of data cleansing is to detect so called dirty data (incorrect, irrelevant or incomplete parts of the data) to either modify or delete it to ensure that a given set of data is accurate and consistent with other sets in the system.

    This procedure can be performed both within a single and between multiple sets of data, manually (where possible in simple cases) or automatically (in complex operations).

    Manual data cleansing is usually done by persons who read through a set of records for verification of accuracy of these, correct spelling errors and complete missing entries.  During this operation some unnecessary or unwanted data is removed in order to increase efficiency of data processing.

    In automated data cleansing, people are replaced by computer programmes which are faster and can deal with greater and more complex amount of work at a given time but the purpose does not change.  In some cases, it is possible to combine these two procedures.  After cleansing, a set of data is consistent with the rest of the system or, as it meets their standards and expectations, it can be delivered to the business community.

    The importance of regular data cleansing is unquestionable in any data based or data dependent business as using inaccurate and inconsistent data can cause serious problems on various levels (eg. government's wrong fiscal decisions based on unreliable data or loss of business partners when basing on outdated contact information).

    In the process of source data selection (for the BI application), five steps can be distinguished:

    • data identification
    • analysis of the content
    • selection of data for BI
    • preparation of data-cleansing specifications
    • selection of tools

    There are some key points to be considered when the operational data for the BI target databases is identified and selected.  Those key points are:

    • integrity (the importance of internal integrity of the data - the most crucial criterion)
    • precision (the precision of the data)
    • accuracy (the correctness of the data)
    • reliability (the source and generation of the data)
    • format (the source and target format of the data - the closer they are, the fewer conversions they require)

  • Data cleansing process
  • What should be kept in mind is that data cleansing is not an easy process.  Not only is it time-consuming and requires a considerable amount of work, but also the expense of it is significant.  This may be the reason why some organizations underestimate the importance of data cleansing, which can lead to numerous business failures as well as adverse effects caused by inaccurate or inconsistent data.

    The data cleansing process includes a few stages:

    • Auditing - statistical detection of anomalies
    • Workflow specification - consideration and specification of anomalies
    • Workflow execution - execution of workflow, data correction
    • Post–processing and controlling - manual checking and data correction which could not be corrected by the automatic process

    Data cleansing is especially of great importance when a large amount of data is stored.  The goal of corrective action on the dirty data then is to make any errors as insignificant as possible.  Unless data cleansing is undertaken regularly, mistakes can accumulate and lead to decreasing the efficiency of work.