Data Preparation and Data Cleaning With R Programming | Data Cleaning Using R | Realcode4you

Reading Data to R

Reading data in R is fairly simple. We’ll be looking at function read.csv which helps you in reading data from the flat file formats. Flat files are files which you can open in a simple notepad and view the data. Excel files or any other proprietry data file format is NOT a flat file and can not be read using read.csv. Each properietry data file format has dedicated packages and associated functions for them . For example Excel files can be read using function read.xlsx found in the packaga xlsx

Function read.csv comes with a lot options which you can see in the documentation. You dont need to pass values to all those options most of the time and set defaults work alright. However we are going to discuss few of them which you might use from time to time.

file : this is the name of the file to be read. In case file is in your working directory , only the file name is enough. If it is not in your working directory, you need to include entire path to the folder where file is along with the file name.

header : This is by default set to FALSE, if you set it to true, variable names are read and assigned from the first row of the file

sep : This is set to , by default. This tells R what symbol separates different columns in a row. For example a semi colon separated file has “;” as separator

row.names : You can pass a vector of row names if you want to set row names for your data. you generaly leave it as is

col.names : In case you want to force some other variable names you can pass those names as a vector to this option. Length of this vector or number of names that you are passing should match with number of columns. They are taken from first row of the file if header is not set to FALSE. You can set header to FALSE and R gives default col headers as V1 V2 V3....

stringsAsFactors : This is be default set to TRUE, you should always set this to FALSE while reading a flat file. What setting this to FALSE does that it imports character columns as character columns. You can later convert them to factors if you want; after pre processing the data.

na.strings : This is by default set to “NA”. This means that any value which is written as “NA” will be assigned a missing/NA after reading. You can change this to other strings as well.

colClasses : By default this is set to NA or no forced classes. However you can pass a vector to force classes on the incoming columns. Without forcing , a column which contains only numbers or NA strings will be read as numeric. Column which contains even a single character value will be read as character [ given that you have set stringsAsFactors to FALSE , otherwise it’ll be stored as factors]

nrows : By default it is set to -1 which means all the rows from the file will be read. You can restrict that by passing a number smaller than the number of rows in file.

skip : By default this is set to 0, by assigning some number you can force R to skip first few rows of the file. We will skip discussion on rest of the rarely used options. Lets look at one example. We’ll be using function read.csv.

Remember if you are going to pass just file name you need to set your working directory to the folder which contains the file. You can do this by using function setwd. This is short for setting working directory.

setwd(" Here/Goes/Path/To/Your/Data/Folder/")

Also note that if you are working on a windows machine, you’d need to replace all “" in your path with”/" or “\”

you can check what is your current working directory by typing in getwd()

getwd()

We are going to import data file bank-full.csv here. Lets begin, we’ll start with passing just the file name and let all other be option take their defaults. We’ll change some as we come across issue with the imported data.

bd=read.csv("bank-full.csv") 
head(bd,2)

you can see that we have been fooled by the file extension and assumed that the separator for the data is comma where as in reality it is “;”. Lets tell that to R by using option sep.

bd=read.csv("bank-full.csv",sep=";") 
head(bd,2)

ok, this looks better. Now lets look at our data.

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.3.2

glimpse(bd)

you can see that all of our character columns have been stored as factors. This needs to be avoided. And we can do so by using option stringsAsFactors. Best of us make mistake by misspelling that option. Dont get frustrated. It’ll become a parctice to make mistake, realise and correct.

bd=read.csv("bank-full.csv",sep=";",stringsAsFactors = FALSE) glimpse(bd)

So thats taken care of , big relief. Next lets look at what values our variable job takes.

table(bd$job)

you can see that there are 288 observations where the value is unknown , if you want you can set it to missing by using option na.strings. But remember this will set the value unknown as missing for all the columns. If you want to do it only for one of columns then do that after you have imported the data.

## [1] 288

bd=read.csv("bank-full.csv",sep=";",stringsAsFactors = FALSE,na.strings = "unknown") 
sum(is.na(bd$job))

You can see that , now column job has 288 missing values. This was to show you how to use option na.string. In general it is not a good practice to set any random value as missing . So , for practice its alright, but dont set unknown to missing in general unless you have good reason to do so. In fact in many of the cases of categorical variables , unknown itself can be taken as a valid category as you’ll realise later.

We dont need to change default values of other options for this importing. Same will be the case for you as well for most of the data. If it is not , feel free to use any of the option described above.