data:image/s3,"s3://crabby-images/9430a/9430abbff97859b51583215ff7bb72e1d2697ea0" alt="Practical Data Analysis Cookbook"
Removing duplicates
We can safely assume that all the data that lands on our desks is dirty (until proven otherwise). It is a good habit to check whether everything with our data is in order. The first thing I always check for is the duplication of rows.
Getting ready
To follow this recipe, you need to have OpenRefine and virtually any Internet browser installed on your computer.
We assume that you followed the previous recipes and your data is already loaded to OpenRefine and the data types are now representative of what the columns hold. No other prerequisites are required.
How to do it…
First, we assume that within the seven days of property sales, a row is a duplicate if the same address appears twice (or more) in the dataset. It is quite unlikely that the same house is sold twice (or more times) within such a short period of time. Therefore, first, we Blank down the observations if they repeat:
data:image/s3,"s3://crabby-images/e4825/e482594c32b824cea26bd17fd7511acec6ff94f1" alt="How to do it…"
This effects in keeping only the first occurrence of a certain set of observations and blanking the rest (see the fourth row in the following screenshot):
data:image/s3,"s3://crabby-images/ea210/ea2101f11e140713e363ae261f5254dbd9f0eb61" alt="How to do it…"
We can now create a Facet by blank that would allow us to quickly select the blanked rows:
data:image/s3,"s3://crabby-images/f8121/f8121c3054b260c1e44b7fb3ff25a7c7e08c556e" alt="How to do it…"
Creating such a facet allows us to quickly select all the rows that are blank and remove them from the dataset:
data:image/s3,"s3://crabby-images/a576d/a576d3e73ec77b961f9d9cbd5e57a8e1f99f5c36" alt="How to do it…"
Our dataset now has no duplicate records.