Monday, November 16, 2015

Machine Learning- Bias and variance Trade-off

In Machine learning world, there is hardly any person who has not heard about Bias and variance trade-off. In this post, I will briefly reiterate the term for the ones who are new to machine learning.

Let’s start with Bias, what is Bias in terms of Machine Learning? Well, Bias is nothing but the model error .In other words, Bias refers to ability of the model to fit the data or to approximate the data. Higher the bias, lower is the ability of the model to approximate the data or higher is the error. So what do you think? How should a perfect model behave? To answer the question, Model should certainly have low bias i.e the error should be low. Sounds simple so far, hmm? Ok let me ask you another question. How low should the bias be for a model so that model is considered an ideal model. Well, there is no answer to this question (I will explain why) and that is where I will introduce the second term, Variance.

Variance refers to consistency of accuracy of a model from data set to data set. In other words, the model should have consistent accuracy across different but similar data sets. The lower the variance more effective is the model. So we can say that in an ideal situation, the model would have low bias and low variance but this is where Bias and Variance Trade-off comes in.

Unfortunately, there is always a trade-off between bias and variance. If you try to achieve low bias on training data then you may suffer from high variance on test data; If you try to achieve low variance then it comes at the cost of higher bias. Let’s try to understand it in more detail with the help of an example:

Consider the linear regression model in below example. This model will have low variance because it is smoother predictor, which means that this model should behave consistently across different but similar data sets because this model is not trying to fit each and every training point. However, this model has bias because it has higher error rate.

Now, consider another example, with low bias. In this example, the model is trying to fit each and every training data point(over fitting) so this model will have low bias but will have higher variance .In other words, this model will behave perfectly on training data but would not predict well on test data.

On test data, it would have errors as shown below.

So question comes, what is the best state? Objective of any machine learning algorithm is to handle this trade-off in a way that there is neither too much bias and nor too much of variance. The objective is to attain that sweet point where your model fit the data enough that it describes it well but does not over fit to increase variance.

Reference:

Applied predictive Analysis- Dean Abbott-Wiley

Monday, November 9, 2015

A/B Testing- An Introduction

AB Testing

Overview:

A/B Testing, at high level, is to compare two versions of a change on a web page. This change could be a minor change such as changing the button color or this change can be somewhat significant such as allowing free access to resources of a content based online company. In AB testing there are always two variants , one without the change and one with the change. These two variants of the web page are called A and B, terms in AB testing. The basic idea behind AB testing is to show both variants of the webpage to similar audience (not the same) and seeing which version of web page gives better results.

Figure:1[i]

The expected results could vary based upon the business objectives for conducting the AB Test. For example: the business objective could just be improving the click through probability for a certain button on the web page. In another example, the business objective could be to increase the revenues by making more people buy the service or a given product.

In an AB test , the traffic is divided in to two parts , Control Group and Experiment Group. Control Group gets to see the page without the change and experiment group gets to see the page with the change. The choice of how much traffic should be assigned to control vs experiment groups is subjective to business objectives and considerations.

Though , it looks straight forward , but it requires good understanding of the mathematical and statistical concepts behind conducting an AB Test and analyzing the results to make correct recommendation.

Example :

Figure: 2[i]

Let’s unpack it little more to see what are the main design considerations behind conducting an AB Test.

How to Design an AB Test ?

It is very important to design the AB test correctly in order to avoid any confusion of incorrect interpretation at later stages. At high level, following are the main design components of an AB Test

Generate a Hypothesis
Metric Choice
Sizing
Duration
Sanity Checks
Result Analysis
Making a decision or recommendation.

Let’s try to understand each of these design phases.

Generate a Hypothesis

First step for conducting a given AB test is to understand what you want to test and formulate a hypothesis. This hypothesis would then be used to decide whether our test gave expected results or not.

For example: An online education company tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that these courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free.

Hypothesis for this test can be that this change might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, company could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.

Metric Choice:

The most important part of any AB test is to identify correct Evaluation and Invariant Matrices.

Evaluation metric defines the parameters that are expected to change between the control and the experiment group. Invariant matrices, used for sanity checking(explained in later section) , defines the parameters that are not expected to change between the control group and experiment group. For example : Let’s consider a change in button color (Start Now). The business objective is make more user click on start Now button. In this case , evaluation metric can be Click through probability i.e total click on start now / Total number of page views. Since , as per the hypothesis , changing the color of the button will have significant impact on users clicking start now button. This metric would be good to measure the change.

On the other hand, Invariant metric in this experiment could be total page views. Since, total page views are not impacted by the color of start now button, this metric should not change between control and experiment group. In other words, how many users arrive on the page (where start now button is located) is not impacted by choice of color. However, clicking on that button might be impacted, as mentioned earlier.

Choosing right evaluation and invariant metric is the most crucial part of AB testing, more crucial than the actual change itself. Depending on the business objective, there are several choices for these matrices. Let’s consider an example and see what are the available choices for matrices.

In the above example , mentioned in previous section , we can choose following matrices:

Invariant Metrics: Number of Cookies, Number of Clicks
Evaluation Metrics: Gross Conversion, Retention, Net Conversion

Lets see why we choose these metrices out of following choices:

Number of Cookies:

It is the number of unique cookies to visit the course overview page.

Since the Unit of Diversion is cookie and the number of cookies is not going to be affected by the change that company is launching at the time of enrolment. Therefore, number of cookies is well suited for invariant matric because there is no impact on the number of cookies

Number of Clicks:

Number of unique cookies to click the start free trial button.

Since, the page asking for the number of hours the student can devote to the course appeared after clicking the “ Start free Trial ” button, the course overview page remains the same for both the control and experiment group.

Number of user-ids:

As per the experiment, the new pop up message is likely to affect the total number of user ids who enrol in the program. For this reason , this matric cannot be used an invariant matric because it is more likely to differ in control and experiment group.

Click through probability :

That is, number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page.

Since, the page asking for the number of hours the student can devote to the course appeared after clicking the “ Start free Trial ” button, the click through probability should remain the same for both the control and experiment group. Therefore, it can be chosen as invariant matric.

Gross Conversion:

Number of users who enrolled in the free trial / Number of users who clicked the Start N Free Trial Button

After clicking the Start Free Trial button, a popup appears for the Experiment group users asking for the amount of time that the Student can devote to the course. Based on the user choice it then makes a suggestion whether the student should enroll for the course or continue with the free course material.In the experiment group the user can make a decision based on the pop message and choose to continue exploring the material only. But for the control group no popup page appears, thus the user enrols for the course anyways. Hence, the gross conversion can be different in both control and experiment group hence can be used as an evaluation matric.

Retention:

Number of user − ids to remain enrolled for 14 days trial period and make their first payment /

Number of users who enrolled in the free trial.

Retention ratio can also be a good evaluation matric because the retention ratio in experiment group is expected to be higher because of low enrolment , if experiment meets the assumption. After seeing the message , there might be fewer users who would enrol and hence retention would be higher because it should filter out those users who can leave the course as frustrated user. Therefore this ratio can be used as evaluation matric because ratio in control group and experiment group would be different and for the same reason it cannot be chosen as invariant matric.

Net Conversion:

Number of user − ids to remain enrolled for the 14 days trial period and make their first payment /Number of users who clicked the Start Free Trial button

As per the intention and assumption of the experiment , experiment group users are made aware that the course require some minimum of hours each week by showing the pop message at the time of enrolment. This message should filter out those users who cannot devote the required hours and are prone to be frustrated later on. This ratio should be different among control group users and experiment group users, if experiment’s assumptions holds true.Hence , it can be used as evaluation matric and for the same reason it cannot be chosen as invariant matric.

Gross Conversion, will show us whether we lower our costs by introducing the new pop up. Net conversion will show how the change affects our revenues. After the experiment, we expect that, Gross conversion should have a practically significant decrease, and Net conversion should have a statistically significant increase.

Number of Samples vs. Power:

Next step is to size the experiment, which means, given the set of probabilities (history values), how many days would be required to conduct the experiment. Though, there are numerous ways to calculate how many days it will take to complete the experiment based on business objectives , I have shown one example for the above scenario using an online calculator.

Using the online calculator , we calculated number of samples required as following:

Probability of enrolling, given click:
20.625% base conversion rate, 1% min d.
Samples needed: 25,835

Probability of payment, given enroll:
53% base conversion rate, 1% min d.
Samples needed: 39,115

Probability of payment, given click:
10.93125% base conversion rate, 0.75% min d.
Samples needed: 27,413

However, Number of samples calculated could be different if your click through probability is less so these numbers need to be adjusted based on the probability.

Duration vs. Exposure

Once you know how many samples are needed for the experiment and how many days it will take to finish the experiment? Next, you need to decide what percentage of traffic you want to put in control vs. experiment group. This could vary from 50-50 % to 20-80 % based on the risks and other considerations.

In the above example, Number of days needed to perform the experiment:

685325 page views / 20,000 page views per day = 35 days approx. (The numbers comes from the data collected from the experiment. These number are shown here for demonstration only.)

We can see that 35 days are a long period of time, so we may have to rethink about our decision, which is very subjective and can depend on various factors. In any given situation , this decision is not in the hands of experiment designer only , there should be other groups involved and should take this decision on a mutual consent .This decision should align with the business objectives. For now, we can assume that we can conduct this experiment for 35 days.

Experiment Analysis

Sanity Checks

Once you have the results, you need to do couple of sanity checks. What are sanity checks? Hmm , let’s see what are they. Remember, we discussed about the invariant matrices. Sanity checks ensures that nothing went wrong in your experiment, it is kind of security valve in AB testing. For example: In a given AB test, the counts of click could be incorrect due to some JavaScript bug and if this sort of thing is not identified then the results are of no use.

It is therefore necessary to do the sanity checks by analyzing the invariant matrices before jumping to analyse the evaluation matrices. Invariant matrices, by definition, should not change among control and experiment group. I would not get in to the mathematics of calculating the confidence intervals of Invariant matrices and checking if the observed values are within range or not.

If you want to see the details you can refer to the detailed report on my github profile.

Result Analysis

Last but not the least, is Result analysis step. In this step we analyse the evaluation matrices to see if we get expected results or not. The observed results have to be both statistically and practically significant in order to be confident that the given experiment has expected outcomes.

Recommendation

Based on the results, it is time to make the recommendation whether to launch or not launch the change. If we are considering multiple evaluation matrices then it is very important to make sure that all the matrices are both statistically and practically significant. In our above example,

Gross Conversion evaluation matrix is both statistically and practically significant, which means that the change(pop up) has positive impact on experiment user group i.e reducing the number of users enrolling after viewing the message. As expected, it should reduce number of frustrated users , who leaves the course in the middle. This should also allow the coaches to spend more time helping those students, who are really likely to complete the course. However, Net conversion evaluation matric results were neither statistically significant nor practically significant. This means that there is a risk that the introduction of the trial screener may lead to a decrease in revenue and company should consider test other designs before deciding whether to release the feature, or abandon the idea entirely.

In this case ,online education company should not launch this change even though company could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course .This decision is in line with the hypothesis that the results might set clearer expectations for students upfront but can impact revenues(net conversion matric). From the experiment results, we can see that second part of hypothesis “without significantly reducing the number of students to continue past the free trial and eventually complete the course” , which pertain to Net conversion metric does not hold true. Hence, it is not recommended to take the risk and launch the change.

Note: If you want to know more about the calculation and the data used in this article. You can click here to view the complete report for the case study

[i] https://vwo.com/ab-testing/

[1] https://s3.amazonaws.com/snaprojects/blog/recsysab/1_Diane_RecSys+AB+Workshop+Oct+2014+--+Shared+with+External+Organizers.pdf

Monday, November 2, 2015

Future of Moore’s, Kryder’s, & Robert’s Laws

Computers,as originally designed, were not meant for personal computing and entertainment purposes. During early ages of computer world, computers were mainly used for complex computations by researchers and scientists; Setting up this monster, computer, was very tedious and cumbersome. A use case for home computing or using computers for entertainment was insane idea during those times. However, the revolution of home computing changed the industry and business expectations from computers. Computer designs were made simple enough for general public to embrace the computers and it’s advantages. Soon computers became part of home computing and personal devices, which could be used for education; calculations; document keeping etc. Industry was witnessing a revolution before a revolution. Later on, Introduction of internet gave birth to “dot com boom” and changed how we perceive the world around us. Businesses embraced the internet and took the advantage of online business opportunities and changed how traditional business used to run and operate. Wherein internet provided a playground of information resources , information era gathered those resources to draw conclusions and inferences and landed us in whole new smarter world.

The changes we have seen so far in the industry have become possible because of 3 main factors, as explained by 3 laws

· Moore’s Law, according to which “The number of transistors on a chip will double about every two years”
· Kryder’s Law , According to which there will be “40 % per year decrease in the cost performance of hard disks”
· Roberts Law , According to which cost of data transmission over network would decrease every year.

As stated in Moore’s law ,we have witnessed significant increase in number of transistors on the chip and how that has helped industry in terms of storing more data; compact design of storage devices; new use cases for technologies. Today, a compact memory card or flash drive can store as much data as a whole hard disk could store, a decade ago.

Use of computers and storage devices has increase at an exponential rate because of one important factor i.e cost. Cost and performance of the storage devices has reduced to a point where industry can store any amount of information and data , as stated by Kryder’s law.

On the other hand , information era has become possible due to increased efficiency of networks and computers. Today, we are living in a world, which is always connected ; any information is just a click away from us.

Where is Industry moving?

Increased speed, power and efficiency of computers, network and storage devices have shaped all the industries and world around us but what’s next? Where the industry goes from here ? if we follow the same track of progress that we have been following since last one decade. What is the future of technology, if we continue to see same progress over the next coming years? To answer all these questions lets unpack our discussion a little more.

So far two worlds, computer world and human world , have remained different to a large extent. Technology helps in making our day to day decisions but we still can’t use the information and technology as we can use rest of our physical objects. Our day to day physical objects are still far disconnected from computers and networks. For instance :Candle ,which is placed in front of me as I am writing this paper, is still not connected to our information and computer world. What if technology world and physical world start talking to each other? Our physical objects such as chairs, dish washer, refrigerators, homes, keys, doors, plants, cars , dining tables, clothes could all be connected to computers and networks ;we would land in a whole new smarter world where we can efficiently use and manage these devices. What is the next big thing that the industry is anticipating? As termed by the industry “Internet of things “ is going to revolutionize our industry and going to make the world ever more smarter. Internet of things , is set of connected devices , which can not only talk to certain applications but can talk to each other to take decisions and handle day to day human activities more efficiently by reducing human intervention. These devices are equipped with sensors and/or actuators to send the information to other devices and can process information received from other devices to take intelligent decisions. We are already on the track of paving path for future world. Market has started flooding with smart things such Fitness bands, smart watches, Health trackers, baby monitoring devices, smart homes , smart cars , smart parking , smart bins, smart buildings, smart lightings ,smart locks , smart doors and many more. Billions of such devices are anticipated to be connected to network and as we all know once any device is connected to network , it opens up unprecedented opportunities to utilize the information generated by the device. Hence, Internet of things is going to land us in a whole new smarter world, as we have seen in Sci-Fi movies or even better. Big Market players have already started betting on the potential of Internet of things and have started the journey towards future.

Future of Manufacturing:

Even though, application area of Internet of things is very huge covering health industry, Home appliances, Energy Industry, Manufacturing industry, Health Industry and many more , we will pick one industry vertical to discuss how future of the industry can be imagined even beyond current state of Internet of things.

Internet to things hold true potential to bring Fourth industrial revolution following- steam engine; the conveyor belt; and the first phase of IT and automation technology. (Löffler 2013) . Smart sensor and actuator based devices can not only be used to do a lot of automation in a factory but they can also be used and become independent processes in any manufacturing life cycle of a product. We can see another revolution in supply chain management if we can intelligently access the warehouse data, manufacturing units data and monitor logistics to make right business decision at right time. What we have talked so far is reality and industry is already in progress to use these devices to manage the resources efficiently. If we think beyond current state of progress then we can see that the possibility of linking all the physical objects used in the manufacturing industry such as raw material, packing material etc to the network ; we can further reduce human intervention if all these raw materials are intelligent enough to decide their final destination, location and unit position in the product, based on the physical characteristics of the material. In an extreme case these materials can be intelligent enough to know what customer order they are going to be part of and how to arrange themselves in the required fashion within given timelines for the given order. This is all possible ,looking at the current growth of the industry and how industry is utilizing the potential of reduced cost, better systems, intelligence built in smaller devices and reduction in cost of transmission of data.

Challenges and concerns:

As we all know convenience does not come without a cost so surely there are some disadvantages of this new future. One of the major concerns that we can see or anticipate to come across during our journey is privacy and security. We have seen history of data breaches during information era and with billions of more devices coming to network this situation would only worsen. Therefore , security and privacy is the major question that industry is trying to find an answer for. Another social problem associated is digital divide, the gap between set of people who can access the information and who can’t access the information using modern resources. It limits one’s ability to take part new market dynamics. New technological tools for businesses give new opportunities to business to compete in whole different way. However, Digital divide can limit this participation.

It will be interesting to witness this journey.

Need of Modern Data Architecture

Background:

Data Architecture, one of the pillars in enterprise architecture, is the most important part for any organization. Data Architecture consists of data models, set of rules; set of policies to govern data; set of standards to store and access data etc. [1] From a conservative IT perspective, data architecture , describes the data structures used by the business and its corresponding software. However, from an holistic perspective, data architecture provides appropriate methods to design , develop and implement a complete business driven data architecture , which not only includes set of standards , policies or rules but also include real world object mapping with underlying environments in the organization. In overall enterprise architecture, data architecture provides a blue print to guide the implementation of a physical database. The Data architecture describes the way the data will be processed, stored and used by the organization[2]. .In Zachmann framework, one of the most popular enterprise architecture frameworks, the data column , what column , describes data aspect for an organization[3].

View	Data (WHAT)	Stakeholder
Scope / Contextual	Material List	Planner
Business / Conceptual	Entity Relationship Model	Owner
System / logical	Data Model Diagram	Designer
Technology / physical	Data Entity specification	Builder
Detailed	Data Details	Subcontractor

Figure 1

As we can see that Data architecture consists of following major steps[4]:

Creating list of things and architectural standards important to the business.
Creating Semantic model or conceptual enterprise data model.
Creating enterprise logical data model.
Creating Physical data model.
Creating Actual database.

At High level, Data architecture is process of breaking down a given subject to its lowest level and then building the architecture backwards. At each level , data architecture is viewed from one of following aspects[5] :

Conceptual – it includes all business entities.
Logical – It provides logical relationship between entities
Physical – Implementing the conceptual and logical view in to actual database.

Following are important data elements that must be considered while engineering the data architecture for an organization[6]:

Administrative structure to manage the data
Methodology to describe the data
Description of database technology
Description of the process to process data.
Interfaces to other systems
Standards for common data operations.

Modeling the “as-is” data architecture is extremely useful in terms of getting insights of current situation however in order to continuously improve the data architecture it is very important to have a data strategy in place , which can provide enough guidance to realize the “to-be” architecture. This is very important task which should include business managers and data architect to define[7]:

How the data is collected, managed and used.
Data models both “as-is” and “to-be” models
Data governance and change control processes.
Policies for data management such as how data is collected , managed , stored; how long it should be stored; access rights ; appropriate security measures etc.

It is important to note that within each organization, various constraints will have an impact on overall data architecture. These constraints include any specific enterprise requirements, laws , data processing needs and business policies etc[8].

From its theoretical definition so far, data architecture may sound simple but the reality is quite different. We are living in information era, the exponential growth of data through advancement in technology has increased the importance as well as complexity of managing such huge data. Moreover, most of the data in an organization is either held in legacy systems without any descriptions or is dispersed among various non-standard tools or applications[9]; this can include personal worksheets, personal Microsoft access databases etc. Additionally, some of the key data resources may lie with vendor or other stakeholders, outside the organization boundaries. This dispersed data not only lacks in quality but may also have lot of duplicity.

As per Gartner, by 2020 the number of new devices connected to internet would cross count of 26 billion[10]. This exponential growth in internet driven devices would bring unprecedented challenges and complexities to manage data, generated from such devices. Industry has already started to journey towards Internet of things.[11] Traditional data ware housing solutions are being complemented by data lake solutions and advanced analytics solutions are used to drive insights of data. Therefore, the task of data architecture becomes one of the most complicated parts of entire architecture. Given the situation that industry has come long way, it is very important to understand that the traditional data solutions only cannot solve the current business problems. Industry cannot sustain only on traditional data architectural frameworks or solutions anymore, it is very important to devise new solutions and address the new challenges to have efficient data architecture of modern information era. For instance : data warehousing solutions was one of the most important part for the organization in an effort to centralized the data in the organization and use the data through unified processes. However, with increase in data , data ware houses needs to be coupled with by the data lakes[12] , where data can be stored in its Raw format and the responsibility to clean and use the data is left to the end user. New database technologies and frameworks such as Big data , Hadoop , hive , Mongo DB etc. are changing the way data is used traditionally.[13]

In the next section, we would discuss some of the issues that exist in tradition data architecture.

Major enterprise architectural issues:

· Inefficient in keeping pace with increasing data demands
· Inefficient in defining strict security principals
· No guidance for real time analytics
· Change management
· Incapability to offer data as Service
· No Guidance to promote Self Service Environments
· Data Redundancy
· Complex, lengthy and inflexible process

1. Inefficient in keeping pace with data growth:

In 21^st century, Data is one of the longest tails being produced and consumed every day through our day to day activity. Any online shopping , internet browsing , credit card usage , smart phones , smart watches , activity trackers etc is generating huge amount of data[14] . Data does not only vary in variety but also in volume. Industry is expecting even more growth in data with the realization of Internet of Things. It is estimated that by 2020, 26 billion devices would be connected to internet and each[15]. These devices would not only be able to connect to internet only but also would be able to communicate with each other in order to accomplish certain tasks. This is going to add a new dimension to the data growth. Businesses are working towards having effective data storage and usage strategy to deal with the data challenges of information era.Of course, there is no point in saving the data if business can’t use it to mine useful information. Businesses are continuously investing in new data analytics initiatives to reap the true potential of this data.[16]

With increasing data, the importance of having efficient data architecture has increased even more. This exponential growth of data requires that business need to devise new framework and architectures. Traditional data architecture was not meant to handle such huge data. It does not address the issue of voluminous and versatility of data. Current(Traditional) data architecture provides enough ground work to devise strategy, policies and procedures to manage the data within the organization but the boundaries of the data are mostly limited within an organization. In today’s world, organization data is being shared and used quite extensively outside the organization as well. The need for having efficient data architecture, geared towards dealing with this situation, is not limited to volume of the data but also includes complexity and variety of the data. Data is no longer being generated from limited set of applications[17]; it is generated through varying sources and is different in structure. Moreover, current data architecture is geared more towards storing and managing the data and not towards mining or analyzing the data. It focuses on Entity Relationship models, Data relationships and devising data models. However, with the latest trends and technology frameworks, data is being stored in its raw form and not in a well-defined clean form.

Also , It is very important to understand that the value of the data lies in the information it contains and there should be enough consideration given to how to mine and analyze the data, right from the conceptual phase of the data architecture[18]. Below is an example of new data architecture with one of the leading frameworks in Big data, hadoop. As evident from the figure , the new data architecture with hadoop has enough focus on Statistical analysis and business intelligence.[19] . Current data architecture is limited in terms of dealing with 21^st century’s volume and variety of data[20].

Figure 2

2. Inefficient in defining strict security principals:

Increasing number of data breaches in last one decade are evident that industry has yet not become successful in incorporating enough security principals in data architecture. Data architecture does an outstanding job in terms of providing guidance for data governance and data access policies[21]. However, data is increasing beyond leaps and bounds; it is no longer limited to the premises of an organization so that enough principals and policies could be enforced to secure the data. Data is travelling outside the organization across different networks; this situation adds much more complexity and importance in defining security principals for the data[22]. Recent Target breach, in holiday season of 2013, was targeted from point of sales; it was evident from the incident that the data architect needs to devise security from all possible point of communications[23].

Even though, Current data architecture provides some level of guidance in terms of authorization and access rights but it does not address the issue of security to the required detail. The increasing importance of securing the data has led to whole new Security architecture[24], which should be used along with data architecture process. Current Data architecture fails to provide enough details on what all policies and procedure put in place in underlying situation to ensure required security. There is not enough guidance in categorizing the organization’s data and defining security principals to secure the access to the data. Data security covers vast variety of tools and techniques for security such as tools to ensure data compliance; data security solutions such as firewalls; data encryption; data encryption; physical security and data analytics solutions for security[25].

Big data brings whole new complexity to the security aspect of the data[26]. It has proved out to be extremely difficult to secure this huge volume of data, if proper consideration is not given to the security aspect of data right from the inception. Industry has been facing new security challenges every day. Increasing popularity of service oriented architecture or cloud services add whole new another perspective to data security in the data architecture. Security of data has gone beyond the hands of the organization to the cloud service provider and it is very important to consider the required security level in Service level agreement with the cloud vendor. The importance of security cannot be ignored in this information era. Current data architecture not only fails to provide enough guidance for data security with changing market needs but also fails to address the concept of data recovery, in a situation of failure.

Even though data recovery is a separate issue all together but it forms an integral part of any data architecture while devising data security principals and policies. Due to Increasing dependence on data , business can’t afford to lose any data. Current data architecture does not provide enough guidance to recover data in an event of security mishap.

3. No guidance for real time analytics:

The value of the data lies in the information it contains. In today’s world, data is not worth if it is not being used to drive useful information. Data analytics has become an integral part for an organization. With the changing dimensions of data, the importance of data analytics cannot be ignored in the data architecture. Increasing popularity of data analytics tools and the potential of data science has led the industry to invest heavily in data analytics. However, in order to reap the true potential of the data, it is necessary to give due consideration to data analytics right from the conceptual phase of data architecture. Data analytics is not only limited to inferential analytics but also provides tools for predictive data modeling to help businesses make useful decisions[27]. The exponential growth in data, in recent years, has led to whole new perspective of storing, managing, mining and using the data to draw conclusions. Traditional database systems are neither able to support this huge volume of data nor they provide right analytical tools to deal with this huge data. Future data architecture needs to consider the data analytics as an integral part of data architecture. Below is a pictorial representation of current data architecture[28]. As we can see the current data architecture does not provide direct support for emerging new data types and the volume of the data, which is expected to grow 40 ZB by 2020[29]. This huge data with wide variety cannot be used to drive analytics if required consideration is not given right from the conceptual phase of data architecture. Even though, current data architecture provides some guidance to envision data analytics and business applications but the traditional solutions does not fit in the modern situation.

Figure 3

Below is one example of new data architecture with focus on varying data types and volume of data.[30]

Figure 4

Current data architecture fails to address this new dimension of the changing market. It not only fails to provide enough consideration to the huge volume and variety of the data but also fails to address how to process data when data is being generated at very high speed[31]. There is no guidance for real time data processing.

Current data architecture does not provide enough guidance for data analytics and it becomes very difficult to incorporate the data analytics initiative in existing architecture. Therefore this issue must be addressed during the design of data architecture.

4. Change management:

Current data architecture does not provide enough guidance for managing the changing data needs with an organization[32]. Current data architecture is focused more toward selecting a given database technology or solution and using that to devise the data architecture [33] however with changing market and data needs it has become very important that data architecture is flexible enough to accommodate new technologies and solutions . It should be adaptable to future changes with respect to technology and solutions. Current data architecture is neither flexible nor adaptable to accommodate change in the system. The whole baseline of choosing any given data solution or technology needs to be replaced with an architecture that can support varying data structure and data types and can provide plug and play solutions in the architecture without worrying about the underlying environment. For instance, when NOSQL arose in the market – the data architecture should have been able to accommodate it in the current architecture. (8-Steps-to-Building-a-Modern-Data-Architecture-101417.aspx)

Current data architecture is not only inflexible at high level in accommodating any new data solution but is also inflexible at low level in accommodating new data types and data structures. In today’s world, data is being generated from varying number of resources and the structure of data varies a lot. Each day, more and more such devices or resources are coming to market; under this situation, business can’t afford to work with any single data type or data structure. The success of a good architecture lies behind the collaboration of business and technology department to work for a common purpose. Information has value only when it is able to meet the business needs. It is very important to understand the underlying data needs of business to devise data architecture. In current data architecture, there is no focus on data integration architecture[34]. Efficient data integration architecture makes change management a simple task and also adds flexibility in the overall architecture, improve reusability and consistency and reduce number of interface to reduce complexity.[35]

Change management is not limited to only incorporating a change in the system but at broad level it includes other important aspect of data management such as defining the lifetime of the data, defining the volatility of the data, defining the reusability of the data and defining the CRUD (Create, Read, update and Delete) cycle of the data[36]. The Data architecture , from it foundation , needs to consider this fundamental issue of data variety and needs to provide sufficient tools to accommodate these data types in the data architecture with minimized impact on the overall architecture. There is only one solution possible to meet this future need of the market and that is making the data architecture flexible to change.

5. Incapability to offer data as Service:

Current data architecture provides no guidance to access data from existing range of databases within an organization. Data is often spread across different database and legacy systems within an organization and it is often challenging to pull data from these dispersed systems (McKendrick, 2015). With the increasing dependence on data within any organization makes it very important data is treated as a service and required tools are provided to access the data as easily as possible. Increasing importance of data in business decision making and increasing popularity of data analytics platforms makes it essential that required design considerations are taken in to account while designing data architecture for an organization[37].

Gone are the days, when data used to be accessed from limited number of devices. Now a days data is being accessed from variety of devices such as mobile phone, tablets, laptops , smart watches , computers and virtualized networks etc. Therefore, it is very important to provide access through virtualized data access layer[38]. Providing a data access layer would not only help in providing a unified access method to data but would also help in standardizing data management tools, platforms and application across the organization. It is also equally important to provide data services as reusable components that can be integrated with applications. Data as a service (DaaS) provides following benefits in terms of data management (http://www.dataversity.net/data-as-a-service-101-the-basics-and-why-they-matter/, 2013):

· Agility

· Quality of data

· Cost effectiveness

According to Gartner, DaaS is going to act as Launchpad for business intelligence and big data analytics. The market for BI and big data analytics is expected to reach $17.1 billion by 2016. (Gartner, 2013).

The high importance of offering data as a service makes is very important that this aspect is taken in to account right from conceptual phase of data architecture. Current data architecture fails to address this fundamental issue to cope up with changing market dynamics. The incapability of current architecture to consider Data as a service not only makes it difficult to manage the overall data but also proves incapable in providing a unified method across different business intelligence tools and application within an organization.

6. No Guidance to promote Self Service Environments

In today’s world, data is ubiquitous, so is the number of methods to process and refine the information from the data. Same data set can contain numerous sets of information and it depends on the user how to process the data set to fetch the required information. For example: Business user and a software developer can have different requirements from the same data set. It essentially depends on the type of underlying problem and task to decide what information is required. With the huge amount of data being generated each second with varying types, it is not a good idea to build the foundation of data architecture on limited data structures and data models. Data architecture needs to consider the raw form of the data and the end users should be responsible to process and mine the data, the way they want[39]. Concept of Data Lake in big data is based on raw data, whereas data ware house is cleaner but limited and abstract way of storing the data in today’s world. Traditional data ware house only cannot fulfill the changing business needs.

Traditional data architecture solutions are not designed to provide guidance for self-service environments (8-Steps-to-Building-a-Modern-Data-Architecture-101417.aspx, 2015). With the increasing demand of real time data analysis it is required that data interoperable interfaces are developed and deployed for self service by different department inside as well as outside the organization for easy access by all stakeholders. Gartner defines Self Service business Intelligence as “End users designing and deploying their own reports and analyses within an approved and supported architecture and tools portfolio.” [40] . Self service environments not only augments the existing BI environments but also help business users , often called as power user, to become producer of the information ; this information can then be consumed by different departments, if required. Below is an example of Self service environment. (COATES, 2013). Current data architecture fails to address the changing business needs in terms of self-service environments and solutions. It is based on the assumption that one-size-fits-all . This approach can lead to various business problems in today’s world.[41] Below is an example of self service environment.

Figure 5

In order to develop the right solution, it is essential that due importance is given to ensure data is as easily accessible as possible and there is minimal dependency in terms of accessing the data from different stakeholders. (Recipe-for-self-service-BI-calls-for-flexibility-governance-user-aid)

7. Data Redundancy

Data redundancy, as the name suggests, is the issue of duplicity of the data within an organization (redundant-data.html, 2009). It means same data set is stored at multiple systems, resources or applications. This data may have different structure depending upon where it is being used and processed but at low level it is essentially the same data . It not only costs storage space on the systems but also leads to redundant efforts in order to align the information produced through these different systems[42]; it also leads to the issue of data synchronization. Even though, by using the current data architecture ,data redundancy can be avoided to an extent. Data model and Entity Relationship diagrams helps in having non-redundant data architecture but it provides no guidance how redundancy can be avoided with varying data types and structures, the problem of modern era. Current data architecture does not provide enough guidance to deal with three basic characteristics of growing data (BIG DATA,BIG DEMANDS)

· It is voluminous

· It is highly unstructured

· It’s constantly changing

With the increasing volume of data and variety of data, it has become very important to be able to categorize the data to avoid redundancy. It is also important to note that the issue of redundancy and data variety needs to be handled simultaneously without confusing one for the other. It is essential to devise unified method to deals with varying data sources to avoid data redundancy but it does not mean that varying data sources would essentially produce redundant data[43]. The data strategy should be able to identify and deal with such situation. Current data architecture does not provide any direct guidance to deal with the situation of varying data sources and volume of data. With changing data demands, the issue of data redundancy needs to be scrutinized in much more detail than provided by the current data architecture.

8. Complex, lengthy and inflexible process:

As per Zachmann framework the implementation to data architecture consists of six levels : Scope , business , system , technology and detailed level [44]. The overall data architecture require long term data strategy be defined at early stages , which become extremely difficult for mid- level and small companies. Infact , even for big organization it becomes increasingly difficult to define long term goals and objectives. Additionally, the incapability of data architecture towards change makes it a complex, lengthy and inflexible process. Any change in the systems is coupled with length reiteration of the documentation and remodeling of the various artifacts. With the often changes in the system, at times , this become a never ending documentation oriented process .

Incapability of current data architecture to deal with varying data sources and structures is one of the underlying issues related to change[45]. Given that the current architecture does not directly support varying data sources and structures, the change become inevitable. The provided set of principles and guidelines are not sufficient to meet the changing market needs and increasing dependence on the data. In order to align with the future goals of the enterprise, it is very important that the issue of complexity and inflexibly is scrutinized right from the most fundamental level of the overall architecture[46]. Adoption of enterprise architecture or data architecture, in particular, cannot be increased if this fundamental issue of complexity is not addressed from its core.

Conclusion:

Data architecture is most important pillar in overall enterprise architecture. It has many uses such as:

· It acts as a key artifact to devise data governance strategy.

· It guides through cross systems developments such data ware housing solutions etc.

· It helps in providing insights and end to end view of organization data.

Enterprise data architecture is essentially a collection of blue prints to align IT initiative and information resources with overall business strategy with in an organization.

Increasing dependency on IT functions , few years ago , led enterprise adopt to large scale systems, the enterprise Data ware house, to manage and process data. Today, each organization has one EDW to serve various data needs across the enterprise. However, in recent years , the introduction of various new data types and volume of data has put enormous pressure on data ware house. It has become very important to efficiently store and process this growing data with modern data architecture. Even though, current data architecture does an outstanding job in providing a solid framework for any organization to deal with its data needs but it presents some serious implications, given the current market conditions, and fails to directly support growing data needs. The basic idea behind the traditional data architecture is based on relational database systems and the steps to develop data models and entity relation diagrams are centered towards relational database systems however, relational database systems are not scalable enough to support the exponential growth of the data and organizations are choosing to shift to new database technologies and frameworks such as HADOOP , NOSQL , MONGODB , HIVE etc. New database technologies and frameworks are not based on relational model but are based on storing data in its raw format of the data in order to store the growing volume and data types. Traditional data architecture has been able to provide a firm foundation to any enterprise in the past years but with changing data needs it has become important to consider the required changes in the traditional data architecture and provide new capabilities to deals with modern situation. Therefore, data architects are facing new challenges of data quality, data governance, end to end view of an organization and big data challenges.

The issue of change needs to be addressed from its core. With current data architecture, changes in the systems leads to rework of developing the artifacts, data models and entity relationships models. The current data architecture needs to be flexible enough to support change without involving lot of rework and efforts. This fundamental issue of change needs to be considered right from the conceptual phase of the data architecture. In order to be adaptable to future, data architecture needs to emphasis on data value chain – discovery of data , processing of data , analysis of data , integration and access to data. Data value chain in an organization needs to be re analyzed in order to provide the required flexibility in the architecture. Data value chain not only helps in providing the insights for future capabilities but also help in driving the technology options. For example, the requirement of real time data analytics not only provides certain performance requirements related to data processing but also dictate the technology solutions for presentation of the results and any considerations related to service agreements. From its nature , enterprise data architecture is an iterative and continuous process but the issue of change management needs to be addressed explicitly so that architecture is adaptive to changes with minimal efforts and impact.

Additionally, in the current era, it is not sufficient to only support the business processes but is has become utmost important to innovate and iterate to provide future capabilities. Efficient future data architecture needs to:

· Support current Systems.
· Provide capabilities for easy access and use of data.
· Support Future changes, technologies and frameworks.
· Provide a flexible implementation plan to migrate from current data architecture.
· Provide guidance to select required technology.

Also, Issue of data management, data security, self-service environments, data as a service , real time data analytics, change management and complexity needs to be worked upon in order to develop a modern data architecture. Traditional data architecture lay down a good foundation for modern architecture so there is no need to re-invent the wheel; current data architecture needs to be improved to support the growing data needs and changing market dynamics.

On a separate note, the increasing popularity of new database technologies and framework has added a new dimension for required skill set in the market. The future data architecture would require set of new skills and capabilities to manage this new architecture. People in the industry would have to either enhance their skill set to be able to sustain in this changing market or develop new skills by learning Pig , Hive , Python , R , SAS etc. to be able to help business reach their objectives. With increasing dependence on data , a new Job Title “Data scientist” has emerged in recent years. Even though, a data scientist does not play any significant role in data architecture process but is responsible to reap the benefits of successful data architecture by providing insights on organization data. Data scientist , needs to have appropriate knowledge of underlying business domain , statistical concepts , forecasting models and programming languages , which means data scientist must possess a blend of computer science , statistics and mathematics . The importance of data related initiatives and new trend of data architecture can be estimated from the fact that Data Scientist is anticipated to be the most demanding Job Title by the year 2020.

To summarize, data architecture is going through evolution and change in order to keep pace with changing demands and technologies. The fact that the new volume of data being generated is huge and unstructured, the overall data architecture in the industry is going through a revolutionary stage. Companies have already started the journey to evolve the existing data environments to deals with high volume and unstructured data and to add analytical capabilities. A significant percentage of companies have already started the adoption of new data architecture and many are in process of starting the change. Current data architecture still provides a solid foundation to an organization to work upon and evolve from “as-is” state to a “to-be” state. Most of the early adopters of new data architecture are working to provide an abstraction layer to data so that the end user does not have to know lot of details to access the required piece of data. From the technology front, there is a clear shift towards open source tools, languages and frameworks; current proprietary technology products would continue to evolve but there would be an increase demand in open source tools and packages in new data architecture. Many of such tools, frameworks, languages and technologies such as Hadoop , Pig , Hive , R , Python , MongoDB etc. have already started gaining a wide popularity and adoption in the market.

This is needless to say that new data architecture would also have some of its own disadvantages and shortcoming to meet demands of ever changing market but new data architecture would essentially provide a good foundation for the data needs for a long time.

Recommendations:

· Data architecture needs to be capable of handling voluminous and versatile data by incorporating new frameworks and tools such as big data etc.
· Data architecture should provide a separate prospective, a separate row if we talk in terms of existing enterprise frameworks, for security to deal with security issues of modern era.
· Data architecture should incorporate data analytics and business intelligence right from the lowest fundamental block of the architecture.
· Data architecture should provide flexibility and adaptability to varying data types and structures by incorporating and general and flexible, a plug and play kind of framework.
· Data architecture needs to be more adaptive to change process without the need of long length processes and redevelopment of artifacts.
· Data architecture should be able to identify the issue of data redundancy with a future vision of having varying data sources.
· Data architecture should encourage self-service environments by offering data as a service across the organization.
· Data architecture should be able to depict the future state of data architecture clearly.
· The process of data architecture should consider bringing the business and technology people together to identify current and future types of data.
· Data architecture needs to be design to encourage change.
· Data architecture needs to consider the Real time processing needs of modern era.
· Data architecture should support data as a service.

Bibliography

(n.d.). Retrieved from http://www.dbta.com/Editorial/Think-About-It/8-Steps-to-Building-a-Modern-Data-Architecture-101417.aspx

redundant-data.html. (2009, December 9). Retrieved August 8th, 2015, from http://www.learn.geekinterview.com/: http://www.learn.geekinterview.com/data-warehouse/data-types/redundant-data.html

http://www.dataversity.net/data-as-a-service-101-the-basics-and-why-they-matter/. (2013, June 18). Retrieved August 9, 2015, from http://www.dataversity.net/: http://www.dataversity.net/data-as-a-service-101-the-basics-and-why-they-matter/

8-Steps-to-Building-a-Modern-Data-Architecture-101417.aspx. (2015, jan 8). 8 Steps to Building a Modern Data Architecture.

8-Steps-to-Building-a-Modern-Data-Architecture-101417.aspx. (n.d.). Retrieved august 8th, 2015, from http://www.dbta.com: http://www.dbta.com/Editorial/Think-About-It/8-Steps-to-Building-a-Modern-Data-Architecture-101417.aspx

BIG DATA,BIG DEMANDS. (n.d.). Retrieved August 10, 2015, from http://www.emc.com/: http://www.emc.com/collateral/white-papers/idg-bigdata-storage-wp.pdf

COATES, M. (2013, Aug 21). The Role of Power Users in a Self-Service BI Initiative. BUSINESS INSIGHTS.

Gartner. (2013, February 19). http://www.gartner.com/newsroom/id/2340216. Retrieved August 9, 2015, from http://www.gartner.com/: http://www.gartner.com/newsroom/id/2340216

Magoulas, T. (2012). Alignment in Enterprise Architecture: A Comparative Analysis of Four Architectural Approaches.

McKendrick, J. (2015, June 8). 8-Steps-to-Building-a-Modern-Data-Architecture-101417.aspx. Retrieved August 9th , 2015, from http://www.dbta.com/: http://www.dbta.com/Editorial/Think-About-It/8-Steps-to-Building-a-Modern-Data-Architecture-101417.aspx

Recipe-for-self-service-BI-calls-for-flexibility-governance-user-aid. (n.d.). Retrieved August 12, 2015, from http://searchbusinessanalytics.techtarget.com/: http://searchbusinessanalytics.techtarget.com/feature/Recipe-for-self-service-BI-calls-for-flexibility-governance-user-aid

End Notes:

[1] https://en.wikipedia.org/wiki/Data_architecture

[2] http://www.learn.geekinterview.com/data-warehouse/data-architecture/what-is-data-architecture.html

[3] https://www.zachman.com/about-the-zachman-framework

[4] http://www.dataversity.net/the-steps-of-data-architecture/

[5] http://searchcio.techtarget.com/definition/Zachman-framework

[6] https://en.wikipedia.org/wiki/Data_architecture

[7] http://www.ibm.com/developerworks/rational/library/754.html

[8] https://msdn.microsoft.com/en-us/library/Bb945098.aspx

[9] http://www.ibm.com/developerworks/rational/library/754.html

[10] http://www.gartner.com/newsroom/id/2684616

[11] http://searchdatacenter.techtarget.com/feature/Plan-an-Internet-of-Things-architecture-in-the-data-center

[12] http://searchaws.techtarget.com/definition/data-lake

[13] http://www.computerworld.com/article/2690856/big-data/8-big-trends-in-big-data-analytics.html

[14] https://en.wikipedia.org/wiki/Big_data

[15] http://www.gartner.com/newsroom/id/2636073

[16] http://www.ey.com/Publication/vwLUAssets/EY-ready-for-takeoff/$FILE/EY-ready-for-takeoff.pdf

[17] http://www.datanami.com/2015/07/20/big-datas-small-lie-the-limitation-of-sampling-and-approximation-in-big-data-analysis/

[18] http://harvardmagazine.com/2014/03/why-big-data-is-a-big-deal

[19] http://info.hortonworks.com/rs/h2source/images/Hadoop-Data-Lake-white-paper.pdf

[20] http://hortonworks.com/hadoop-modern-data-architecture/

[21] http://www.informationweek.com/whitepaper/Security/Security-Administration/data-security-architecture-overview-wp1389200608

[22] http://searchsecurity.techtarget.com/answer/Securing-big-data-Architecture-tips-for-building-security-in

[23] http://www.nbcnews.com/business/40-million-credit-debit-card-accounts-may-be-hit-data-2D11775203

[24] https://en.wikipedia.org/wiki/Enterprise_information_security_architecture

[25] https://en.wikipedia.org/wiki/Enterprise_information_security_architecture

[26] http://www.ey.com/Publication/vwLUAssets/EY_-_Big_data:_changing_the_way_businesses_operate/$FILE/EY-Insights-on-GRC-Big-data.pdf

[27] http://searchbusinessanalytics.techtarget.com/definition/big-data-analytics

[28] http://www.revolutionanalytics.com/sites/default/files/modern-data-architecture-predictive-analytics-hortonworks-revolution-analytics.pdf

[29] http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm

[30] http://www.revolutionanalytics.com/sites/default/files/modern-data-architecture-predictive-analytics-hortonworks-revolution-analytics.pdf

[31] https://blog.kissmetrics.com/real-time-analytics/

[32] http://www.opengroup.org/architecture/0404brus/presents/rajesh/aandc1.pdf

[33] http://www.dbta.com/Editorial/Think-About-It/8-Steps-to-Building-a-Modern-Data-Architecture-101417.aspx

[34] https://tdwi.org/articles/2008/05/27/data-integration-architecture-what-it-does-where-its-going-and-why-you-should-care.aspx

[35] https://tdwi.org/articles/2008/05/27/data-integration-architecture-what-it-does-where-its-going-and-why-you-should-care.aspx

[36] http://www.eiminstitute.org/library/eimi-archives/volume-2-issue-4-july-2008-edition/201cwhat2019s-in-your-data-architecture-201d-part-three

[37] http://searchbusinessanalytics.techtarget.com/news/4500243099/Self-service-analytics-needs-strong-data-architecture-foundation

[38] http://www.compositesw.com/data-virtualization/data-abstraction/

[39] http://www.informationbuilders.com/applications/foodlion

[40] http://www.gartner.com/it-glossary/self-service-business-intelligence

[41] http://searchbusinessanalytics.techtarget.com/feature/Recipe-for-self-service-BI-calls-for-flexibility-governance-user-aid

[42] http://agiledata.org/essays/legacyDatabases.html

[43] http://www.businessdictionary.com/definition/data-architecture.html

[44] https://en.wikipedia.org/wiki/Zachman_Framework

[45] http://www.ibm.com/developerworks/library/bd-archpatterns1/

[46] http://international.informatica.com/Images/02354_next-generation-data-integration-transform-data-chaos_wp_en-US.pdf

Raj's World