376

I have a large data set and I would like to read specific columns or drop all the others.

data <- read.dta("file.dta")

I select the columns that I'm not interested in:

var.out <- names(data)[!names(data) %in% c("iden", "name", "x_serv", "m_serv")]

and than I'd like to do something like:

for(i in 1:length(var.out)) {
   paste("data$", var.out[i], sep="") <- NULL
}

to drop all the unwanted columns. Is this the optimal solution?

2
  • 1
    sleeping over the problem, i was thinking that subset(data, select=c(...)) helps in my case for dropping vars. the question though was mainly about the paste("data$",var.out[i],sep="") part to access columns of interest inside the loop. how can i paste or somehow compose a column name? Thanks to everyone for your attention and your help
    – leroux
    Mar 9, 2011 at 8:24
  • 9
    Possible duplicate of Drop columns in R data frame
    – jangorecki
    Dec 8, 2015 at 18:47

12 Answers 12

477

You should use either indexing or the subset function. For example :

R> df <- data.frame(x=1:5, y=2:6, z=3:7, u=4:8)
R> df
  x y z u
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
5 5 6 7 8

Then you can use the which function and the - operator in column indexation :

R> df[ , -which(names(df) %in% c("z","u"))]
  x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6

Or, much simpler, use the select argument of the subset function : you can then use the - operator directly on a vector of column names, and you can even omit the quotes around the names !

R> subset(df, select=-c(z,u))
  x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6

Note that you can also select the columns you want instead of dropping the others :

R> df[ , c("x","y")]
  x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6

R> subset(df, select=c(x,y))
  x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
9
  • 4
    the select argument of the subset function did the job perfectly! Thank you juba!
    – leroux
    Mar 8, 2011 at 15:13
  • 2
    which is not neccessary, see Ista's answer. But the subset with - is nice! Didn't know that!
    – Tomas
    Jul 25, 2013 at 19:03
  • 6
    subset looks good, but the way it silently drops missing values seems pretty dangerous to me. Jun 5, 2014 at 15:33
  • 5
    subset is indeed very convenient, but remember to avoid using it unless you're using R interactively. See the Warning in the function's documentation and this SO question for more. Aug 13, 2014 at 18:45
  • 8
    "you can even omit the quotes around the names !", you actually have to omit the quotes, otherwise you'll get invalid argument to unary operator. If you have certain characters (for example "-") in your names you cannot use this method at all since dropping quotes will cause R to be unable to properly parse your code.
    – oh54
    Jul 19, 2017 at 18:10
150

Do not use -which() for this, it is extremely dangerous. Consider:

dat <- data.frame(x=1:5, y=2:6, z=3:7, u=4:8)
dat[ , -which(names(dat) %in% c("z","u"))] ## works as expected
dat[ , -which(names(dat) %in% c("foo","bar"))] ## deletes all columns! Probably not what you wanted...

Instead use subset or the ! function:

dat[ , !names(dat) %in% c("z","u")] ## works as expected
dat[ , !names(dat) %in% c("foo","bar")] ## returns the un-altered data.frame. Probably what you want

I have learned this from painful experience. Do not overuse which()!

1
  • 44
    setdiff is also useful: setdiff(names(dat), c("foo", "bar"))
    – hadley
    Mar 9, 2011 at 2:20
57

First, you can use direct indexing (with booleans vectors) instead of re-accessing column names if you are working with the same data frame; it will be safer as pointed out by Ista, and quicker to write and to execute. So what you will only need is:

var.out.bool <- !names(data) %in% c("iden", "name", "x_serv", "m_serv")

and then, simply reassign data:

data <- data[,var.out.bool] # or...
data <- data[,var.out.bool, drop = FALSE] # You will need this option to avoid the conversion to an atomic vector if there is only one column left

Second, quicker to write, you can directly assign NULL to the columns you want to remove:

data[c("iden", "name", "x_serv", "m_serv")] <- list(NULL) # You need list() to respect the target structure.

Finally, you can use subset(), but it cannot really be used in the code (even the help file warns about it). Specifically, a problem to me is that if you want to directly use the drop feature of susbset() you need to write without quotes the expression corresponding to the column names:

subset( data, select = -c("iden", "name", "x_serv", "m_serv") ) # WILL NOT WORK
subset( data, select = -c(iden, name, x_serv, m_serv) ) # WILL

As a bonus, here is small benchmark of the different options, that clearly shows that subset is the slower, and that the first, reassigning method is the faster:

                                        re_assign(dtest, drop_vec)  46.719  52.5655  54.6460  59.0400  1347.331
                                      null_assign(dtest, drop_vec)  74.593  83.0585  86.2025  94.0035  1476.150
               subset(dtest, select = !names(dtest) %in% drop_vec) 106.280 115.4810 120.3435 131.4665 65133.780
 subset(dtest, select = names(dtest)[!names(dtest) %in% drop_vec]) 108.611 119.4830 124.0865 135.4270  1599.577
                                  subset(dtest, select = -c(x, y)) 102.026 111.2680 115.7035 126.2320  1484.174

Microbench graph

Code is below :

dtest <- data.frame(x=1:5, y=2:6, z = 3:7)
drop_vec <- c("x", "y")

null_assign <- function(df, names) {
  df[names] <- list(NULL)
  df
}

re_assign <- function(df, drop) {
  df <- df [, ! names(df) %in% drop, drop = FALSE]
  df
}

res <- microbenchmark(
  re_assign(dtest,drop_vec),
  null_assign(dtest,drop_vec),
  subset(dtest, select = ! names(dtest) %in% drop_vec),
  subset(dtest, select = names(dtest)[! names(dtest) %in% drop_vec]),
  subset(dtest, select = -c(x, y) ),
times=5000)

plt <- ggplot2::qplot(y=time, data=res[res$time < 1000000,], colour=expr)
plt <- plt + ggplot2::scale_y_log10() + 
  ggplot2::labs(colour = "expression") + 
  ggplot2::scale_color_discrete(labels = c("re_assign", "null_assign", "subset_bool", "subset_names", "subset_drop")) +
  ggplot2::theme_bw(base_size=16)
print(plt)
3
  • 2
    I like your second alternative using NULL, but why when you put more than two names is necessary to assign it with list(NULL) ? I'm only curious to know how it works, because I tried with only one name and I don't need list()
    – Darwin PC
    Jul 28, 2015 at 10:07
  • 3
    @DarwinPC Yes. If you access directly one vector element (with $ or [[), using <- list(NULL) will actually lead to wrong results. If you access a subset of the dataframe with one or multiple columns, <- list(NULL) is the way to go, even if it is not needed for a one column dataframe (because df['myColumns'] will get casted to a vector if needed). Aug 3, 2015 at 18:08
  • Has this behaviour changed? I get identical results with both NULL and list(NULL).
    – J. Mini
    Mar 19, 2021 at 21:46
34

You can also try the dplyr package:

R> df <- data.frame(x=1:5, y=2:6, z=3:7, u=4:8)
R> df
  x y z u
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
5 5 6 7 8
R> library(dplyr)
R> dplyr::select(df2, -c(x, y))  # remove columns x and y
  z u
1 3 4
2 4 5
3 5 6
4 6 7
5 7 8
2
  • 5
    Using dplyr::select(df2, -one_of(c('x','y'))) will still work (with a warning) even if some of the named columns don't exist
    – divibisan
    Mar 23, 2018 at 16:58
  • This was exactly the solution I was searching for @divibisan, thanks!
    – JJGabe
    Sep 23, 2020 at 17:31
19

Here's a quick solution for this. Say, you have a data frame X with three columns A, B and C:

> X<-data.frame(A=c(1,2),B=c(3,4),C=c(5,6))
> X
  A B C
1 1 3 5
2 2 4 6

If I want to remove a column, say B, just use grep on colnames to get the column index, which you can then use to omit the column.

> X<-X[,-grep("B",colnames(X))]

Your new X data frame would look like the following (this time without the B column):

> X
  A C
1 1 5
2 2 6

The beauty of grep is that you can specify multiple columns that match the regular expression. If I had X with five columns (A,B,C,D,E):

> X<-data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,8),E=c(9,10))
> X
  A B C D  E
1 1 3 5 7  9
2 2 4 6 8 10

Take out columns B and D:

> X<-X[,-grep("B|D",colnames(X))]
> X
  A C  E
1 1 5  9
2 2 6 10

EDIT: Considering the grepl suggestion of Matthew Lundberg in the comments below:

> X<-data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,8),E=c(9,10))
> X
  A B C D  E
1 1 3 5 7  9
2 2 4 6 8 10
> X<-X[,!grepl("B|D",colnames(X))]
> X
  A C  E
1 1 5  9
2 2 6 10

If I try to drop a column that's non-existent,nothing should happen:

> X<-X[,!grepl("G",colnames(X))]
> X
  A C  E
1 1 5  9
2 2 6 10
1
  • 4
    X[,-grep("B",colnames(X))] will return no columns in the case where no column name contains B, rather than returning all the columns as would be desired. Consider with X <- iris for an example. This is the problem with using negative indices with computed values. Consider grepl instead. Dec 25, 2015 at 2:33
11
df2 <- df[!names(df) %in% c("c1", "c2")]
1
7

I tried to delete a column while using the package data.table and got an unexpected result. I kind of think the following might be worth posting. Just a little cautionary note.

[ Edited by Matthew ... ]

DF = read.table(text = "
     fruit state grade y1980 y1990 y2000
     apples Ohio   aa    500   100   55
     apples Ohio   bb      0     0   44
     apples Ohio   cc    700     0   33
     apples Ohio   dd    300    50   66
", sep = "", header = TRUE, stringsAsFactors = FALSE)

DF[ , !names(DF) %in% c("grade")]   # all columns other than 'grade'
   fruit state y1980 y1990 y2000
1 apples  Ohio   500   100    55
2 apples  Ohio     0     0    44
3 apples  Ohio   700     0    33
4 apples  Ohio   300    50    66

library('data.table')
DT = as.data.table(DF)

DT[ , !names(dat4) %in% c("grade")]    # not expected !! not the same as DF !!
[1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

DT[ , !names(DT) %in% c("grade"), with=FALSE]    # that's better
    fruit state y1980 y1990 y2000
1: apples  Ohio   500   100    55
2: apples  Ohio     0     0    44
3: apples  Ohio   700     0    33
4: apples  Ohio   300    50    66

Basically, the syntax for data.table is NOT exactly the same as data.frame. There are in fact lots of differences, see FAQ 1.1 and FAQ 2.17. You have been warned!

2
  • 1
    Or you could use DT[,var.out := NULL] to delete the columns you wish to do so.
    – mnel
    Feb 13, 2013 at 0:21
  • The subset(x, select=...) method works for both data.frame and data.table classes
    – momeara
    Sep 25, 2014 at 18:32
6

I changed the code to:

# read data
dat<-read.dta("file.dta")

# vars to delete
var.in<-c("iden", "name", "x_serv", "m_serv")

# what I'm keeping
var.out<-setdiff(names(dat),var.in)

# keep only the ones I want       
dat <- dat[var.out]

Anyway, juba's answer is the best solution to my problem!

2
  • Why do you want to do this in a loop? The answers juba's answer shows you how to do it in one step. Why make it more complicated?
    – Ista
    Mar 9, 2011 at 15:38
  • of course i use the select argument of the subset function in my code. i just wanted to see how i could access arbitrary columns in a loop in case i wanted to do something else than just dropping the column. the original data set has about 1200 vars and i'm only interested in using 4 of them without knowing where exactly they are.
    – leroux
    Mar 9, 2011 at 16:02
3

If you know exactly the names of the columns in original dataframe called "df":

cols_to_drop <- c("A", "B", "C")
df_clean = df[,!(names(df) %in% cols_to_drop)]

Src: https://www.listendata.com/2015/06/r-keep-drop-columns-from-data-frame.html

2

Here is another solution that may be helpful to others. The code below selects a small number of rows and columns from a large data set. The columns are selected as in one of juba's answers except that I use a paste function to select a set of columns with names that are numbered sequentially:

df = read.table(text = "

state county city  region  mmatrix  X1 X2 X3    A1     A2     A3      B1     B2     B3      C1      C2      C3

  1      1     1      1     111010   1  0  0     2     20    200       4      8     12      NA      NA      NA
  1      2     1      1     111010   1  0  0     4     NA    400       5      9     NA      NA      NA      NA
  1      1     2      1     111010   1  0  0     6     60     NA      NA     10     14      NA      NA      NA
  1      2     2      1     111010   1  0  0    NA     80    800       7     11     15      NA      NA      NA

  1      1     3      2     111010   0  1  0     1      2      1       2      2      2      10      20      30
  1      2     3      2     111010   0  1  0     2     NA      1       2      2     NA      40      50      NA
  1      1     4      2     111010   0  1  0     1      1     NA      NA      2      2      70      80      90
  1      2     4      2     111010   0  1  0    NA      2      1       2      2     10     100     110     120

  1      1     1      3     010010   0  0  1    10     20     10     200    200    200       1       2       3
  1      2     1      3     001000   0  0  1    20     NA     10     200    200    200       4       5       9
  1      1     2      3     101000   0  0  1    10     10     NA     200    200    200       7       8      NA
  1      2     2      3     011010   0  0  1    NA     20     10     200    200    200      10      11      12

", sep = "", header = TRUE, stringsAsFactors = FALSE)
df

df2 <- df[df$region == 2, names(df) %in% c(paste("C", seq_along(1:3), sep=''))]
df2

#    C1  C2  C3
# 5  10  20  30
# 6  40  50  NA
# 7  70  80  90
# 8 100 110 120
-1

I can´t answer your question in the comments due to low reputation score.

The next code will give you an error because the paste function return a character string

for(i in 1:length(var.out)) {
   paste("data$", var.out[i], sep="") <- NULL
}

Here is a possible solution:

for(i in 1:length(var.out)) {

  text_to_source <- paste0 ("data$", var.out[i], "<- NULL") # Write a line of your
                                                  # code like a character string
  eval (parse (text=text_to_source)) # Source a text that contains a code
}

or just do:

for(i in 1:length(var.out)) {
  data[var.out[i]] <- NULL
}
-6
df = mtcars 
remove vs and am because they are categorical. In the dataset vs is in column number 8, am is in column number 9

dfnum = df[,-c(8,9)]

Not the answer you're looking for? Browse other questions tagged or ask your own question.