Sunday 21 January 2018

Introduction to R - Part 4 : Factors


Factors


If you have a background on Statistics you may have heard about categorical variables. Well maybe not.

Unlike numerical variables , a categorical variable can only hold a limited number of categories. In R there is a specific data structure for these variables which is known as Factors.

One of the examples for a Factor is the blood type.

Lets see how we can create a Factor in R.

Create Factor
We will first create a vector called blood_vector and create the Factor out of it.

blood_vector
[1] "B"  "AB" "O"  "A"  "O"  "O"  "A"  "B"

blood_factor <- factor(blood_vector)
blood_factor
[1] B  AB O  A  O  O  A  B
Levels: A AB B O

The Output looks different than the original one: there are no double quotes and also if you notice there is something called factor levels, these levels are corresponding to the different categories.

R performs 2 things when you call the factor function on a character vector:

  1. It scans through the vector to check the different categories that are in there. In out example that's "A", "AB", "B" and "O". Notice here that R sorts the levels alphabetically. 
  2. Then , it converts the character vector,to a vector of integers. These integers correspond to a set of character values to use when the factor is displayed. 

Lets look at the structure of the Factor to determine this.

> str(blood_factor)
 Factor w/ 4 levels "A","AB","B","O": 3 2 4 1 4 4 1 3

So in this example there are 4 unique categories.

Rename Factor Labels
To set labels you can use the levels functions. This is similar to the names function for vectors.
> levels(blood_factor) <- c("BT_A","BT_AB","BT_B","BT_O")
> blood_factor
[1] BT_B  BT_AB BT_O  BT_A  BT_O  BT_O  BT_A  BT_B
Levels: BT_A BT_AB BT_B BT_O

You could also do this at the time of defining the factor. You can us e the labels function to this as well.
> factor(blood_vector, labels = c("BT_A","BT_AB","BT_B","BT_O"))
[1] BT_B  BT_AB BT_O  BT_A  BT_O  BT_O  BT_A  BT_B
Levels: BT_A BT_AB BT_B BT_O

For both of these options you need to make sure you set the labels according to the order. This might be tricky because if the order was different then incorrect labels would be set for the elements.
To solve this problem you can manually set the levels as well as the labels when defining the factor.
> factor(blood_vector,
levels = c("O","A","B","AB") ,
labels = c("BT_O","BT_A","BT_B","BT_AB"))

[1] BT_B  BT_AB BT_O  BT_A  BT_O  BT_O  BT_A  BT_B
Levels: BT_O BT_A BT_B BT_AB

Nominal versus Ordinal
Nominal categorical variables will not have an implied order. Example blood type A is not greater than or less then blood type O.

Such comparison will result in a warning. Let's try that out in R.

> blood_factor[1] < blood_factor[2]
[1] NA
Warning message:
In Ops.factor(blood_factor[1], blood_factor[2]) :
  ‘<’ not meaningful for factors

But there are example where the order is required. Example is T shirt sizes. L > M > S

vector_tshirt <- c("M","L","S","S","L","M","L","M")

> factor_tshrt <- factor(vector_tshirt , ordered = TRUE , levels = c("S","M","L"))
> factor_tshrt
[1] M L S S L M L M
Levels: S < M < L

If we now try to do a comparison it will result TRUE or FALSE.

> factor_tshrt[1] < factor_tshrt[2]
[1] TRUE

Summary

  • Factors are used to store categorical variable in R
  • Factors are integer vectors
  • You can change the factor levels using the levels function or the labels argument in the factor function
  • R allows you to make the difference between ordered and non-ordered factors by catering to nominal and ordinal variables.


No comments:

Post a Comment

Blog Archive