Categorical data is a kind of data which has a predefined set of values. Taking “Child”, “Adult” or “Senior” instead of keeping the age of a person to be a number is one such example of using age as categorical. However, before using categorical data, one must know about various forms of categorical data
First of all, categorical data may or may not be defined in an order. To say that the size of a box is small, medium or large means that there is an order defined as small<medium<large. The same does not hold for, say, sports equipment, which could also be categorial data, but differentiated by names like dumbbell, grippers or gloves; that is, you can order the items on any basis. Those which can be ordered are known as “ordinal” while those where there is no such ordering are “nominal” in nature.
Many a time, an analyst changes the data from numerical to categorical to make things easier. Besides using “Adult”, “Child” or “Senior” class instead of age as a number, there can also be special cases such as using “regular item” or “accessory” for equipment. In many problems, the output is also categorical. Whether a customer will churn or not, whether a person will buy a product or not, whether an item is profitable etc. All problems where the output is categorical are known as classification problems. R provides various ways to transform and handle categorical data.
A simple way to transform data into classes is by using the split and cut functions available in R or the cut2 function in Hmisc library.
Let’s use the iris dataset to categorize data. This dataset is available in R and can be called by using ‘attach’ function. The dataset consists of 150 observations over 5 features – Sepal Length, Sepal Width, Petal Length, Petal Width and species.
attach(iris) #Call the iris dataset
x=iris #store a copy of the dataset into x
#using the split function
list1=split(x, cut(x$Sepal.Length, 3)) #This will create a list of 3 split on the basis of sepal.length
summary(list1) #View the class ranges for list1
Length Class Mode
(4.3,5.5] 6 data.frame list
(5.5,6.7] 6 data.frame list
(6.7,7.9] 6 data.frame list
#using Hmisc library
library(Hmisc)
list2=split(x, cut2(x$Sepal.Length, g=3)) #This will also create a similar list but with left boundary included
summary(list2) #View the class ranges for list2
Length Class Mode
[4.3,5.5) 6 data.frame list
[5.5,6.4) 6 data.frame list
[6.4,7.9] 6 data.frame list
The first list, list1 divides the dataset into 3 groups based on range of sepal length equally divided. The second list, list 2 also divides the dataset into 3 groups based on sepal length but it tries to keep equal number of values in each group. We can check this using the range function.
#Range of sepal.length
range(x$Sepal.Length) #The output is 4.3 to 7.9
We can see that the list 1 consists of three groups – the first group has the range 4.3-5.5, the second one has the range 5.5-6.4 and the third one has the range 6.5-7.9. There is, however, one difference between the output of list1 and list2. List1 allows the range in the three groups to be equal. On the other hand, list2 allows the number of values in each group to be balanced. An alternative code to the following is to just add the group range as another feature in the dataset
x$class <- cut(x$Sepal.Length, 3) #Add the class label instead of creating a list of data
x$class2 <- cut2(x$Sepal.Length, 3) #Add the class label instead of creating a list of data
If the classes are to be indexed as numbers 1,2,3… instead of their actual range, we can just convert our output as numeric. Using the indexes is also easier than the range of each group.
x$class=as.numeric(x$class)
In our example, the class values will now be transformed to either of 1,2 or 3. Suppose we now want to find the number of values in each class. How many rows fall into class 1? Or class 2? We can use the table() function present in R to give us that count.
class_length=table(x$group)
class_length #The sizes are 59,71 and 20 as indicated in the output below
1 2 3
59 71 20
This is a good way to get a quick summary of the classes and their sizes. However, this is where it ends. We cannot make further computations or use this information in our dataset. Moreover, class_length is a table and needs to be transformed to a Data Frame before it can be useful. The issue is that transforming a table into Data Frame will create the variable names as Var1 and Freq as table does not retain the original feature name.
#Transforming the table to a Data Frame
class_length_df=as.data.frame(class_length)
Class_length_df #The output is:
Var1 Freq
1 1 59
2 2 71
3 3 20
#Here we see that the variable is named as Var1. We need to rename the variable using the names()
function
names(class_length_df)[1]=”group” #Changing the first variable Var1 to group
class_length_df
group Freq
1 1 59
2 2 71
3 3 20
In this case where we have a few variables, we can easily rename the variable but this is very risky in a large dataset where one can accidentally rename another important feature.
As I said, there is more than 1 way to do the same thing in R. All this hassle could have been avoided if there had been a function that will generate our class size as a Data Frame to start with. The “plyr” package has the count() function which accomplishes this task. Using the count function in plyr package is as simple as passing the original Data Frame and the variable we want to use the count for.
#Using the plyr library
library(plyr)
class_length2=count(x,”group”) #Using the count function
class_length2 #The output is:
group freq
1 1 59
2 2 71
3 3 20
The same output, in less number of steps. Let’s verify our output
#Checking the data type of class_length2
class(class_length2) #Output is data.frame
The plyr package is very useful when it comes to categorical data. As we see, the count() function is really flexible and can generate the Data Frame we want. It is now easy to add the frequency of the categorical data to the original Data Frame x.
Comparison
The table() function is really useful as a quick summary and, with a little work, can produce an output similar to that given by the count() function. When we go a little further towards N-way tables, the table function transformed to Data Frame works just as count() function
#Using the table for 2 way
two_way=as.data.frame(table(subset(x,select=c(“class”,”class2″))))
two_way
class class2 Freq
1 (4.3,5.5] [4.3,5.5) 52
2 (5.5,6.7] [4.3,5.5) 0
3 (6.7,7.9] [4.3,5.5) 0
4 (4.3,5.5] [5.5,6.4) 7
5 (5.5,6.7] [5.5,6.4) 49
6 (6.7,7.9] [5.5,6.4) 0
7 (4.3,5.5] [6.4,7.9] 0
8 (5.5,6.7] [6.4,7.9] 22
9 (6.7,7.9] [6.4,7.9] 20
two_way_count=count(x,c(“class”,”class2″))
two_way_count
class class2 freq
1 (4.3,5.5] [4.3,5.5) 52
2 (4.3,5.5] [5.5,6.4) 7
3 (5.5,6.7] [5.5,6.4) 49
4 (5.5,6.7] [6.4,7.9] 22
5 (6.7,7.9] [6.4,7.9] 20
The difference is still noticeable. While both the outcomes are similar, the count() function omits the values which are null or have a size of zero. Hence, the count() function gives a cleaner output and outperforms the table() function which gives frequency tables of all possible combinations of the variables. What if we want the N-way frequency table of the entire Data Frame? In this case, we can simply pass the entire Data Frame into table() or count() function. However, the table() function will be very slow in this case as it will take time for calculating frequencies of all possible combinations of features whereas the count() function will only calculate and display the combinations where the frequency is non-zero.
#For the entire dataset
full1=count(x) #much faster
full2=as.data.frame(table(x))
What if we want to display our data in a cross-tabulated format instead of displaying as a list? We have a function xtabs for this purpose.
cross_tab = xtabs(~ class + class2, x)
cross_tab
class2
class [4.3,5.5) [5.5,6.4) [6.4,7.9]
(4.3,5.5] 52 7 0
(5.5,6.7] 0 49 22
(6.7,7.9] 0 0 20
However, the class type of this function is xtabs table.
class(cross_tab)
“xtabs” “table”
Converting the same as a Data Frame regenerates the same output as the table() function does
y=as.data.frame(cross_tab)
y
class class2 Freq
1 (4.3,5.5] [4.3,5.5) 52
2 (5.5,6.7] [4.3,5.5) 0
3 (6.7,7.9] [4.3,5.5) 0
4 (4.3,5.5] [5.5,6.4) 7
5 (5.5,6.7] [5.5,6.4) 49
6 (6.7,7.9] [5.5,6.4) 0
7 (4.3,5.5] [6.4,7.9] 0
8 (5.5,6.7] [6.4,7.9] 22
9 (6.7,7.9] [6.4,7.9] 20
There is another difference when we use cross-tabulated output for N-way classification when N>3. As we can show only 2 features in cross-tabulated format, xtabs divides the data based on the third variable and displays cross-tabulated outputs for each value of the third variable. Illustrating the same for class, class2 and Species:\
threeway_cross_tab = xtabs(~ class + class2 + Species, x)
threeway_cross_tab
, , Species = setosa
class2
class [4.3,5.5) [5.5,6.4) [6.4,7.9]
(4.3,5.5] 45 2 0
(5.5,6.7] 0 3 0
(6.7,7.9] 0 0 0
, , Species = versicolor
class2
class [4.3,5.5) [5.5,6.4) [6.4,7.9]
(4.3,5.5] 6 5 0
(5.5,6.7] 0 28 8
(6.7,7.9] 0 0 3
, , Species = virginica
class2
class [4.3,5.5) [5.5,6.4) [6.4,7.9]
(4.3,5.5] 1 0 0
(5.5,6.7] 0 18 14
(6.7,7.9] 0 0 17
The output become larger and difficult to read as N increases for an N-way cross tabulated output. In this situation again, the count() function seamlessly produces a clean output which is easily visualizable.
threeway_cross_tab_df = count(x, c(‘class’, ‘class2’, ‘Species’))
threeway_cross_tab_df
class class2 Species freq
1 (4.3,5.5] [4.3,5.5) setosa 45
2 (4.3,5.5] [4.3,5.5) versicolor 6
3 (4.3,5.5] [4.3,5.5) virginica 1
4 (4.3,5.5] [5.5,6.4) setosa 2
5 (4.3,5.5] [5.5,6.4) versicolor 5
6 (5.5,6.7] [5.5,6.4) setosa 3
7 (5.5,6.7] [5.5,6.4) versicolor 28
8 (5.5,6.7] [5.5,6.4) virginica 18
9 (5.5,6.7] [6.4,7.9] versicolor 8
10 (5.5,6.7] [6.4,7.9] virginica 14
11 (6.7,7.9] [6.4,7.9] versicolor 3
12 (6.7,7.9] [6.4,7.9] virginica 17
The same output is presented in a concise way by count(). The count() function in plyr package is thus very useful when it comes to counting frequencies of categorical variables.