How to rename individual columns in a data frame, based on the previous names and without using previous names.
This post has been slightly modified from its original form on woodpeckR.
How do I change the name of just one column in a data frame?
This is a simple one that keeps coming up. Sometimes, whoever put together my data decided to capitalize the first letter of some column names and not others. Sometimes I’ve merged several data frames together and I need to distinguish the columns from each other.
Say my data frame is p8_0
and I’d like to change the column Area
to area
.
In the past, I’ve done this in one of two ways. Either I change all of the column names at once (if all of them need to be changed), or I use numerical column indexing. The latter makes a lot more sense if I have a lot of columns to deal with, but it means I have to know the number of the column whose name I have to change.
To find this out, I first have to look at all of the column names. Okay, no problem.
# See column names and numerical indices
names(p8_0)
[1] "FID" "Join_Count"
[3] "TARGET_FID" "Field1"
[5] "barcode" "stratum"
[7] "lcode" "sdate"
[9] "utm_e" "utm_n"
[11] "snag" "OBJECTID"
[13] "uniq_id" "aa_num"
[15] "AQUA_CODE" "AQUA_DESC"
[17] "pool" "Area"
[19] "Perimeter" "bath_pct"
[21] "max_depth" "avg_depth"
[23] "sd_depth" "tot_vol"
[25] "area_gt50" "area_gt100"
[27] "area_gt200" "area_gt300"
[29] "avg_fetch" "shoreline_density_index"
[31] "econ" "sill"
[33] "min_rm" "max_rm"
[35] "len_met" "len_prm_lotic"
[37] "pct_prm_lotic" "num_lotic_outl"
[39] "len_prm_lentic" "pct_prm_lentic"
[41] "num_lentic_outl" "pct_aqveg"
[43] "pct_opwat" "len_terr"
[45] "pct_terr" "pct_aq"
[47] "len_wetf" "pct_prm_wetf"
[49] "pct_terr_shore_wetf" "len_wd"
[51] "wdl_p_m2" "num_wd"
[53] "scour_wd" "psco_wd"
[55] "len_revln" "rev_p_m2"
[57] "num_rev" "pct_terr_shore_rev"
[59] "pct_prm_rev" "area_tpi1"
[61] "pct_tpi1" "area_tpi2"
[63] "pct_tpi2" "area_tpi3"
[65] "pct_tpi3" "area_tpi4"
[67] "pct_tpi4" "sinuosity"
[69] "year_phot" "NEAR_TERR_FID"
[71] "NEAR_TERR_DIST" "NEAR_TERR_CLASS_31"
[73] "NEAR_TERR_CLASS_15" "NEAR_TERR_CLASS_7"
[75] "NEAR_TERR_CLASS_31_N" "NEAR_TERR_CLASS_15_N"
[77] "NEAR_TERR_CLASS_7_N" "NEAR_TERR_HEIGHT_N"
[79] "NEAR_FOREST_FID" "NEAR_FOREST_DIST"
[81] "NEAR_FOREST_CLASS_31" "NEAR_FOREST_CLASS_15"
[83] "NEAR_FOREST_CLASS_7" "NEAR_FOREST_CLASS_31_N"
[85] "NEAR_FOREST_CLASS_15_N" "NEAR_FOREST_CLASS_7_N"
[87] "NEAR_FOREST_HEIGHT_N" "year.p"
[89] "depth.p" "current.p"
[91] "gear.p" "stageht.p"
[93] "substrt.p" "wingdike.p"
[95] "riprap.p" "trib.p"
[97] "snagyn" "area_le50"
[99] "area_le100" "area_le200"
[101] "area_le300" "pct_area_le100"
[103] "pct_area_le50" "pct_area_le200"
[105] "pct_area_le300" "stratum_name"
Okay, yes problem.
It’s not that hard to see that Area
is the 18th column. But there are a bunch of columns that start with NEAR_TERR_
and NEAR_FOREST_
that would be easy to confuse. And what if I later modify my data cleaning script, insert new columns, and mess up the numerical indexing?
The first solution I came up with is simple but pretty clunky. At least it solves the problem of numerical indices getting misaligned. And if you mistype the column name or try to change the name of a column that doesn’t exist, it doesn’t throw an error.
This works well, but it gets annoying if you have more than one column name to change. Every column requires typing names(p8_0)
twice, and that adds up to a lot of lines of code.
To no one’s surprise, dplyr
has a more elegant solution, using the rename
function.
A quick note on rename: somewhat counterintuitively, the new name comes before the old name. General example:
# General syntax for rename
#df %>%
# rename(newname = oldname)
rename
saves a whole bunch of keystrokes and also scales very well to multiple columns.
Let’s say I wanted to change Area
and Perimeter
to area
and perimeter
, respectively, and I also wanted to change the rather clunky shoreline_density_index
to sdi.
And while we’re at it, snagyn
, a factor variable that indicates whether a large piece of wood was present at the site (“yes” or “no”), might be clearer as snag_yn
, and sinuosity
could be shortened to sinu
Without dplyr
:
# Change each column name individually
names(p8_0)[names(p8_0) == "Area"] <- "area"
names(p8_0)[names(p8_0) == "Perimeter"] <- "perimeter"
names(p8_0)[names(p8_0) == "shoreline_density_index"] <- "sdi"
names(p8_0)[names(p8_0) == "snagyn"] <- "snag_yn"
names(p8_0)[names(p8_0) == "sinuosity"] <- "sinu"
With dplyr
:
# Change any column names you want to, all at once
p8_0 <- p8_0 %>% rename(area = Area,
perimeter = Perimeter,
sdi = shoreline_density_index,
snag_yn = snagyn,
sinu = sinuosity)
So pretty. As an added bonus, you’re saved from both quotation marks and the dreaded double equals sign (!!!).
In case anyone was counting, that’s 102 characters vs. 238 (spaces not included). 116 if you include loading dplyr
, but you already had it loaded because you’re using it throughout your code, of course.
Now I can rename only the columns I want, by name instead of numerical index, without fear of having to change everything if I insert or delete some columns later on.
More thoughts on changing individual variable names, including a couple other packages if you feel like trying them: https://stackoverflow.com/questions/7531868/how-to-rename-a-single-column-in-a-data-frame
If you see mistakes or want to suggest changes, please create an issue on the source repository.
For attribution, please cite this work as
Gahm (2018, July 11). Kaija Gahm: Changing individual column names. Retrieved from https://kaijagahm.netlify.app/posts/2018-07-11-changing-individual-column-names/
BibTeX citation
@misc{gahm2018changing, author = {Gahm, Kaija}, title = {Kaija Gahm: Changing individual column names}, url = {https://kaijagahm.netlify.app/posts/2018-07-11-changing-individual-column-names/}, year = {2018} }