The 5 point check for cleaning data

Every analyst knows that cleaning data is a mission. In fact, it’s part of the process that takes the most time. However, a clean structured dataset determines if your project will be a success or failure. We’ve developed a checklist to help you get that data in order.

Validity

  • Data-Type Constraints: Each column should be assigned a particular datatype such as numeric, date or string. You should always check that it is classified properly and you don’t have string columns classified as numbers etc.

  • Unique Constraints: a field, or a combination of fields, must be unique across a dataset. An example of this would be an order id for a transaction table.

  • Range Constraints: This insures that the specified column must have the value falling in the specified range. For example, a date column can have a range constraint where it excludes data before 1980.

  • Mandatory Constraints: certain columns cannot be null or empty.

  • Set-Membership constraints: Fields in a column like this must come from a specific set of options. For example, a person’s gender may be male, female or other.

  • Foreign-key constraints: as in relational databases, a foreign key column can’t have a value that does not exist in the referenced primary key.

  • Regular expression patterns: These are fields (usually text fields) that have to be in a certain specific pattern. For example, emails may be required to have the pattern xxx@domain.com.

  • Cross-field validation: certain logical conditions that refer across multiple fields must hold. For example, a customer’s delivery date for their order cannot be before the customer made the actual order.

Accuracy

Your data needs to be as accurate or close to accurate as possible,

In some instances this can easily slip away. For example, a valid customer’s address may not actually exist in the database.

This is where you need to set up a data quality audit and ensure that the data is checked with its original source. Inaccurate data can either be flagged on the database or completely removed.

Precision is another important factor. A customer inputing their address as “USA” may actually be true, but precision here is important as that may not give you the deeper analysis you want.

Completeness

The degree to which all required data is known.

Inconsistency occurs when two values in the dataset don’t make sense logically. Often, this isn’t built in the database logic and really messes up your data.

For example, having a customer with an age of 10 and a marital status as divorced. Which one do you determine is true?

Uniformity

Uniformity is the degree to which the data is specified using the same unit of measure. Units of measures should always be consistent across the whole database.

Foe example. weight can be records in kilos, pounds or stone. Currency can be chosen in USD, EUR or YEN. Make sure that this is propagated through the entire database. It should also be converted to a single unit of measure to avoid confusion.

Final thoughts…

Data cleaning is hard and you not only need to clean historical and present data, you also need to set rules and logic to clean future data coming in.

The earlier you start implementing a thorough plan on how your data will be cleaned, the better!


Previous
Previous

Business Analyst or Data Analyst?

Next
Next

5 Datasets to inspire your next project