Saturday, September 12, 2015

Example of Exploratory Data Analysis Using R

I used AADHAAR data set from Indian Government website. Since the data was huge so i only used 2 days data to do any analysis. This data contains : Registrar,Enrolment Agency,State,District,Sub District,Pin Code,Gender, Age Aadhaar generated,Enrolment Rejected,Residents providing email, Residents providing mobile number.
We will use R code to do our analysis here :

Lets first load our libraries.
suppressMessages(library(tidyr))
## Warning: package 'tidyr' was built under R version 3.2.2
suppressMessages(library(dplyr))
## Warning: package 'dplyr' was built under R version 3.2.2
suppressMessages(library(ggplot2))
In order to draw total Aadhar generated by each state and to see the distribution.Lets first read the data for two days and bind the data
df_05<- span="">read.csv(file=file.path("E:","DataScienceWithR/Nano Degree Udacity/Data Analysis using R/Lesson 5/Files/UIDAI-ENR-DETAIL-20150905-20150911/UIDAI-ENR-DETAIL-20150905.csv"))
df_06<- span="">read.csv(file=file.path("E:","DataScienceWithR/Nano Degree Udacity/Data Analysis using R/Lesson 5/Files/UIDAI-ENR-DETAIL-20150905-20150911/UIDAI-ENR-DETAIL-20150906.csv"))

df_05<- span="">rbind(df_05,df_06)
Now, lets try to see the distribution of data for each state.To do this, we would have to group the data by sate. We will use dplyr package to do this.
Now, lets draw a histogram

ggplot(aes(x=State,y=AadhaarGenerated),data=df_05_bystate)+geom_histogram(stat="identity",position="dodge")+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))




Now, lets see the distribution by gender. We would have to group the data again.
df_05_bystate_gender<-df_05>%
        group_by(State,Gender) %>%
        summarise(AadhaarGenerated=sum(Aadhaar.generated),
                  AadhaarRejected=sum(Enrolment.Rejected)) %>%
        mutate(rejection_ratio=AadhaarRejected/(AadhaarRejected+AadhaarGenerated))
Our data is now grouped in df_05_bystate_gender. Now lets draw the plot

ggplot(aes(x=State,y=AadhaarGenerated,fill=Gender),data=df_05_bystate_gender)+geom_bar(stat="identity",position="dodge")+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))




Now lets see the ratio of rejection in each state. To do this we would have to create new variable rejection_Ratio
df_05_bystate<-df_05_bystate>%
        mutate(rejection_ratio=AadhaarRejected/(AadhaarRejected+AadhaarGenerated))
Now ,we have the new variable in our state grouping data set.We can use this data set and plot the ratio to see the distribution.

ggplot(aes(x=State,y=rejection_ratio),data=df_05_bystate)+geom_histogram(stat="identity")+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))



now, lets see the rejection ratio by gender.See , what we discovered here. Rejection ratio for Transgender is 100% for Uttar Pardesh.Whats going on here? We already have our data grouped by stae and gender so we would use the same dataset.

ggplot(aes(x=State,y=rejection_ratio,fill=Gender),data=df_05_bystate_gender)+geom_histogram(stat="identity",position="dodge")+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))



Since it is just a ratio it is very important to know the numbers. If we subset our data to see how many transgender cases were rejected, we see that there was only 1 case and it was rejected, that could have happened due to several reasons. We can check the records for Trans gender to cross check the assumption.
df_05_bystate_gender[df_05_bystate_gender$Gender=='T',]

## Source: local data frame [7 x 5]
## Groups: State [7]
##
##            State Gender AadhaarGenerated AadhaarRejected rejection_ratio
##           (fctr) (fctr)            (int)           (int)           (dbl)
## 1      Karnataka      T                1               0             0.0
## 2 Madhya Pradesh      T                1               0             0.0
## 3    Maharashtra      T                2               0             0.0
## 4     Tamil Nadu      T                6               0             0.0
## 5      Telangana      T                1               0             0.0
## 6  Uttar Pradesh      T                0               1             1.0
## 7    West Bengal      T                1               1             0.5
Now, lets see the aadhar card distribution by age. To do this lets group our data by age.
df_05_byage<-df_05>%
        group_by(Age) %>%
        summarise(AadhaarGenerated=sum(Aadhaar.generated),
                  AadhaarRejected=sum(Enrolment.Rejected)) %>%
        mutate(rejection_ratio=AadhaarRejected/(AadhaarRejected+AadhaarGenerated))
our data is now grouped, so we can draw the plot
ggplot(aes(x=Age,y=AadhaarGenerated),data=df_05_byage)+geom_line()+scale_x_discrete(breaks=seq(0,100,5))+geom_point()+geom_smooth()






there seems to be a strong negative corelation between age and numbers of Aadhar generated.lets see the correlation coeefficient for these two variables
cor.test(df_05_byage$Age,df_05_byage$AadhaarGenerated)

##
##  Pearson's product-moment correlation
##
## data:  df_05_byage$Age and df_05_byage$AadhaarGenerated
## t = -18.472, df = 111, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9077159 -0.8146311
## sample estimates:
##        cor

## -0.8686419
Aadhar generated.As age increases, number Aadhar generation decreases.
Lets also see the distribution by gender.here also we can see the similar trend of male dominance over female. However, to confirm this assumption we would have to perform statistical tests. But this exploratory data analysis gives a good clue.

Now , lets see the distribution by age and gender. To do this lets again group the data , this time by age and gender
df_05_byage_gender<-df_05>%
        group_by(Age,Gender) %>%
        summarise(AadhaarGenerated=sum(Aadhaar.generated),
                  AadhaarRejected=sum(Enrolment.Rejected)) %>%
        mutate(rejection_ratio=AadhaarRejected/(AadhaarRejected+AadhaarGenerated))
Lets now , create the plot.
ggplot(aes(x=Age,y=AadhaarGenerated,color=Gender),data=df_05_byage_gender)+geom_line()+scale_x_discrete(breaks=seq(0,100,5))+geom_line()







Lets add one more variable to our analysis and see the distribution for each state.
df_05_bystate_age_gender<-df_05>%
        group_by(State,Age,Gender) %>%
        summarise(AadhaarGenerated=sum(Aadhaar.generated),
                  AadhaarRejected=sum(Enrolment.Rejected)) %>%
        mutate(rejection_ratio=AadhaarRejected/(AadhaarRejected+AadhaarGenerated))
Lets draw the plot facet for each state.
ggplot(aes(x=Age,y=AadhaarGenerated,color=Gender),data=df_05_bystate_age_gender)+geom_line()+scale_x_discrete(breaks=seq(0,100,10))+geom_line()+facet_wrap(~State)+ylim(c(0,5000))


Notice the trend for state TamilNadu here, There is lot of difference by Gender. However, we cant confirm this untill we do any statistical tests on the data