I used AADHAAR data set from Indian Government website. Since the data was huge so i only used 2 days data to do any analysis. This data contains : Registrar,Enrolment Agency,State,District,Sub District,Pin Code,Gender, Age Aadhaar generated,Enrolment Rejected,Residents providing email, Residents providing mobile number.
We will use R code to do our analysis here :
Lets first load our libraries.
suppressMessages(library(tidyr))
## Warning: package 'tidyr'
was built under R version 3.2.2
suppressMessages(library(dplyr))
## Warning: package 'dplyr'
was built under R version 3.2.2
suppressMessages(library(ggplot2))
In order to draw total Aadhar
generated by each state and to see the distribution.Lets first read the data
for two days and bind the data
df_05<- span="">->read.csv(file=file.path("E:","DataScienceWithR/Nano Degree Udacity/Data Analysis using
R/Lesson
5/Files/UIDAI-ENR-DETAIL-20150905-20150911/UIDAI-ENR-DETAIL-20150905.csv"))
df_06<- span="">read.csv(file=file.path("E:","DataScienceWithR/Nano Degree Udacity/Data Analysis using R/Lesson 5/Files/UIDAI-ENR-DETAIL-20150905-20150911/UIDAI-ENR-DETAIL-20150906.csv"))->
df_05<- span="">rbind(df_05,df_06)->
df_06<- span="">read.csv(file=file.path("E:","DataScienceWithR/Nano Degree Udacity/Data Analysis using R/Lesson 5/Files/UIDAI-ENR-DETAIL-20150905-20150911/UIDAI-ENR-DETAIL-20150906.csv"))->
df_05<- span="">rbind(df_05,df_06)->
Now, lets try to see the distribution
of data for each state.To do this, we would have to group the data by sate. We
will use dplyr package to do this.
Now, lets draw a histogram
ggplot(aes(x=State,y=AadhaarGenerated),data=df_05_bystate)+geom_histogram(stat="identity",position="dodge")+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
Now, lets see the distribution by
gender. We would have to group the data again.
df_05_bystate_gender<-df_05>%-df_05>
group_by(State,Gender) %>%
summarise(AadhaarGenerated=sum(Aadhaar.generated),
AadhaarRejected=sum(Enrolment.Rejected)) %>%
mutate(rejection_ratio=AadhaarRejected/(AadhaarRejected+AadhaarGenerated))
group_by(State,Gender) %>%
summarise(AadhaarGenerated=sum(Aadhaar.generated),
AadhaarRejected=sum(Enrolment.Rejected)) %>%
mutate(rejection_ratio=AadhaarRejected/(AadhaarRejected+AadhaarGenerated))
Our data is now grouped in
df_05_bystate_gender. Now lets draw the plot
ggplot(aes(x=State,y=AadhaarGenerated,fill=Gender),data=df_05_bystate_gender)+geom_bar(stat="identity",position="dodge")+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
Now lets see the ratio of rejection in
each state. To do this we would have to create new variable rejection_Ratio
df_05_bystate<-df_05_bystate>%-df_05_bystate>
mutate(rejection_ratio=AadhaarRejected/(AadhaarRejected+AadhaarGenerated))
mutate(rejection_ratio=AadhaarRejected/(AadhaarRejected+AadhaarGenerated))
Now ,we have the new variable in our
state grouping data set.We can use this data set and plot the ratio to see the
distribution.
ggplot(aes(x=State,y=rejection_ratio),data=df_05_bystate)+geom_histogram(stat="identity")+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
now, lets see the rejection ratio by
gender.See , what we discovered here. Rejection ratio for Transgender is 100%
for Uttar Pardesh.Whats going on here? We already have our data grouped by stae
and gender so we would use the same dataset.
ggplot(aes(x=State,y=rejection_ratio,fill=Gender),data=df_05_bystate_gender)+geom_histogram(stat="identity",position="dodge")+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
Since
it is just a ratio it is very important to know the numbers. If we subset our
data to see how many transgender cases were rejected, we see that there was
only 1 case and it was rejected, that could have happened due to several
reasons. We can check the records for Trans gender to cross check the
assumption.
df_05_bystate_gender[df_05_bystate_gender$Gender=='T',]
## Source: local data frame
[7 x 5]
## Groups: State [7]
##
## State Gender AadhaarGenerated AadhaarRejected rejection_ratio
## (fctr) (fctr) (int) (int) (dbl)
## 1 Karnataka T 1 0 0.0
## 2 Madhya Pradesh T 1 0 0.0
## 3 Maharashtra T 2 0 0.0
## 4 Tamil Nadu T 6 0 0.0
## 5 Telangana T 1 0 0.0
## 6 Uttar Pradesh T 0 1 1.0
## 7 West Bengal T 1 1 0.5
## Groups: State [7]
##
## State Gender AadhaarGenerated AadhaarRejected rejection_ratio
## (fctr) (fctr) (int) (int) (dbl)
## 1 Karnataka T 1 0 0.0
## 2 Madhya Pradesh T 1 0 0.0
## 3 Maharashtra T 2 0 0.0
## 4 Tamil Nadu T 6 0 0.0
## 5 Telangana T 1 0 0.0
## 6 Uttar Pradesh T 0 1 1.0
## 7 West Bengal T 1 1 0.5
Now, lets see the aadhar card
distribution by age. To do this lets group our data by age.
df_05_byage<-df_05>%-df_05>
group_by(Age) %>%
summarise(AadhaarGenerated=sum(Aadhaar.generated),
AadhaarRejected=sum(Enrolment.Rejected)) %>%
mutate(rejection_ratio=AadhaarRejected/(AadhaarRejected+AadhaarGenerated))
group_by(Age) %>%
summarise(AadhaarGenerated=sum(Aadhaar.generated),
AadhaarRejected=sum(Enrolment.Rejected)) %>%
mutate(rejection_ratio=AadhaarRejected/(AadhaarRejected+AadhaarGenerated))
our data is now grouped, so we can
draw the plot
ggplot(aes(x=Age,y=AadhaarGenerated),data=df_05_byage)+geom_line()+scale_x_discrete(breaks=seq(0,100,5))+geom_point()+geom_smooth()
there seems to be a strong negative
corelation between age and numbers of Aadhar generated.lets see the correlation
coeefficient for these two variables
cor.test(df_05_byage$Age,df_05_byage$AadhaarGenerated)
##
## Pearson's product-moment correlation
##
## data: df_05_byage$Age and df_05_byage$AadhaarGenerated
## t = -18.472, df = 111, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9077159 -0.8146311
## sample estimates:
## cor
## -0.8686419
## Pearson's product-moment correlation
##
## data: df_05_byage$Age and df_05_byage$AadhaarGenerated
## t = -18.472, df = 111, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9077159 -0.8146311
## sample estimates:
## cor
## -0.8686419
Aadhar generated.As age increases,
number Aadhar generation decreases.
Lets also see the distribution by
gender.here also we can see the similar trend of male dominance over female.
However, to confirm this assumption we would have to perform statistical tests.
But this exploratory data analysis gives a good clue.
Now , lets see the distribution by age
and gender. To do this lets again group the data , this time by age and gender
df_05_byage_gender<-df_05>%-df_05>
group_by(Age,Gender) %>%
summarise(AadhaarGenerated=sum(Aadhaar.generated),
AadhaarRejected=sum(Enrolment.Rejected)) %>%
mutate(rejection_ratio=AadhaarRejected/(AadhaarRejected+AadhaarGenerated))
group_by(Age,Gender) %>%
summarise(AadhaarGenerated=sum(Aadhaar.generated),
AadhaarRejected=sum(Enrolment.Rejected)) %>%
mutate(rejection_ratio=AadhaarRejected/(AadhaarRejected+AadhaarGenerated))
Lets now , create the plot.
ggplot(aes(x=Age,y=AadhaarGenerated,color=Gender),data=df_05_byage_gender)+geom_line()+scale_x_discrete(breaks=seq(0,100,5))+geom_line()
Lets add one more variable to our
analysis and see the distribution for each state.
df_05_bystate_age_gender<-df_05>%-df_05>
group_by(State,Age,Gender) %>%
summarise(AadhaarGenerated=sum(Aadhaar.generated),
AadhaarRejected=sum(Enrolment.Rejected)) %>%
mutate(rejection_ratio=AadhaarRejected/(AadhaarRejected+AadhaarGenerated))
group_by(State,Age,Gender) %>%
summarise(AadhaarGenerated=sum(Aadhaar.generated),
AadhaarRejected=sum(Enrolment.Rejected)) %>%
mutate(rejection_ratio=AadhaarRejected/(AadhaarRejected+AadhaarGenerated))
Lets draw the plot facet for each
state.
ggplot(aes(x=Age,y=AadhaarGenerated,color=Gender),data=df_05_bystate_age_gender)+geom_line()+scale_x_discrete(breaks=seq(0,100,10))+geom_line()+facet_wrap(~State)+ylim(c(0,5000))
Notice the trend for state TamilNadu
here, There is lot of difference by Gender. However, we cant confirm this
untill we do any statistical tests on the data