How to Remove Duplicate Data in R

During the processing of data cleansing, it is often required to remove duplicate values from the database. A very useful application of subsetting data is to find and remove duplicate values. R has a useful function, duplicated(), that finds duplicate values and returns a logical vector that tells you whether the specified value is a duplicate of a previous value. This means that for duplicated values, duplicated() returns FALSE for the first occurrence and TRUE for every following occurrence of that value, as in the following example:

> duplicated(c(1,2,1,6,1,8))
[1] FALSE FALSE TRUE FALSE TRUE FALSE

If you try this on a data frame, R automatically checks the observations (meaning, it treats every row as a value). So, for example, with the data frame iris:

> duplicated(iris)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [10] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
....
 [136] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE

If you look carefully, you notice that row 143 is a duplicate (because the 143rd element of your result has the value TRUE). You also can tell this by using the which() function:

> which(duplicated(iris))
[1] 143

Now, to remove the duplicate from iris, you need to exclude this row from your data. Remember that there are two ways to exclude data using subsetting:

  • Specify a logical vector, where FALSE means that the element will be excluded. The ! (exclamation point) operator is a logical negation. This means that it converts TRUE into FALSE and vice versa. So, to remove the duplicates from iris, you do the following:

> iris[!duplicated(iris), ]

Specify negative values. In other words:

> index <- which(duplicated(iris))
> iris[-index, ]

In both cases, you’ll notice that your instruction has removed row 143.

How To Install R & R-Studio

R is a fundamental open source, case-sensitive programming language. RStudio is an active member of the R community and an integrated development environment (IDE)for R.

You need to install both R and R-Studio on your system before actually getting started with R. In this page, you will be guided through the installation process and get introduced to both of them.

Install R

Step 1: Download the package relevant to your system (Windows or Mac or Linux) from the Comprehensive R Archive Network (CRAN) website.

Step 2: Install R like you normally install any new software package.

Now, Install R-Studio

Step 1: Download the R-Studio Desktop package from the R-Studio website.

Step 2: Install R-Studio using user’s setup process.

Before you move on, make sure you have installed both R and R-Studio on your system. In this lecture, you will be introduced to different components of R-Studio.