Wednesday, March 16, 2016

Machine learning- Getting started guide

Abstract

Today, one of the most rapidly growing fields is Machine learning, which is a mix of computer science and statistics. Increased computational power and access to massive data have allowed the applications of machine learning to grow by leaps and bounds. Applications of machine learning can be found in any major industry/ business or vertical. In the simplistic terms, machine learning employs statistical models to learn from underlying data and derive useful insights without explicitly defining where to find[i].These algorithms or statistical models are not only good at learning from given data but are also good at generalizing the new or unseen data. In this paper, we will discuss the various aspects of machine learning. This paper provides a background of what is machine learning followed by the challenges involved in machine learning.




[i] http://www.sas.com/en_us/insights/analytics/machine-learning.html

General Overview

Thesis Statement

This paper provides an introduction to Machine learning techniques and key considerations for any machine learning project.

Context/Framework

In this information era, increasing computational power, ability to generate and store massive data sets have accelerated the necessity and growth of analytics applications for any business. Potential of machine learning techniques can not only be used for solving a business problem in hand but can also be leveraged towards to uncovering the hidden patterns. However, machine learning at its core requires a deep understanding of the quantitative methods and programming techniques to reap the full potential of machine learning.

Justification/Argument

            The increasing dependence on machine learning techniques in businesses has made it important to have a firm understanding of machine learning and it’s potential. We have witnessed a convergence of software engineering and statistical methods to machine learning. None of the businesses are untouched from machine learning and business analytics. Therefore, it becomes increasing important to understand the machine learning and its potential.

Literature Review

            The research materials used for this paper involves :
1.      Blog posts from acknowledged industry leaders in the field of web development
2.      Online articles from companies working in the field of web development
3.      Online articles from web development publications / magazines
4.      Press releases / announcements from companies
5.      Web development training web sites
6.      Web standards sites (e.g. W3C)
7.      Information from web analytics companies
8.      Articles from print journals and magazines

Background:

Evolved from pattern recognition and computation learning, Machine learning is a subfield of computer science. It overlaps with statistics and quantitative methods. It is basically a branch of Artificial intelligence that allows computers to take decisions without being explicitly programmed. Machine learning provides a set of algorithms that can be used to learn, predict and classify the data .The goals is to devise models and algorithms to do learning with little or no human intervention. These algorithms or models iteratively learn from the data and allow computers to find hidden insights without explicitly defining where to find that information. Advancement in computational power of modern era and ability to store and deal with huge amount of data has acted as catalyst in the growth of machine learning and analytics applications. Machine learning applications in Fraud detection; Web search results; Real-time ads on web pages and mobile devices; Text-based sentiment analysis; Credit scoring and next-best offers; Prediction of equipment failures; New pricing models; Network intrusion detection; Pattern and image recognition; Email spam filtering are in commonplace now a days. There is hardly any industry vertical, which has remained untouched by the influence of machine learning.
At high level, machine learning can be divided in to four major categories.
  • ·         Supervised learning
  • ·         Unsupervised Learning
  • ·         Semi supervised Learning
  • ·         Reinforcement Learning

Supervised learning:

In supervised learning, there is always a target variable, a column that represents the values to predict from other columns in the data. This target variable is generally representative to questions that business is trying to find answer for. If the target variable is a continuous variable then regression techniques are used and if the target variable is categorical then classification techniques are used within supervised learning. Supervised learning algorithms are trained using some sort of past data where data contains set of input variables and corresponding value for target variables; models based on this data generalize to the given characteristics of data so that predictions can be made on unseen data. For instance, supervised learning techniques can be used to detect fraudulent transactions by learning on the history data and transactions of a credit card holder. Predicting the behavior of future events based on some learning from past events is the core capability of supervised learning methods. Supervised learning methods can further be classified in to two groups:
·         Regression( Continuous Target variable)
·         Classification ( Categorical Target variable)
Following is a high level view of when to use a given supervised learning algorithm.

Unsupervised learning:

Unsupervised learning also called descriptive modeling has no target variables. In unsupervised learning, inputs are analyzed and clustered based on some common characteristics. In simple words, there is no historical data present with labels and the objective is to discover the hidden structures within the data. The most common application of unsupervised learning algorithms is in marketing, where customers are divided in to different segment based on common characteristics and then these segments are uses strategically for email or marketing campaigns. Unsupervised learning algorithms such as PCA(principal components Analysis)can also be used for feature selection and dimensional reductions methods. Some of the most commonly used unsupervised learning methods  are:
 

Semi supervised Learning:

Semi supervised learning algorithms deals with acquiring knowledge in the presence of both labeled and unlabeled data. Situations where labels are hard to attain and unlabeled data is in surplus, semi supervised learning algorithms can significantly improve the accuracy of the models. Semi supervised learning algorithms are used where the availability of labeled data is either very low or very difficult/expensive to accomplish.
It allows finding a better classifier from both labeled and unlabeled data by identifying rather than specifying relationship between labeled and unlabeled data. Some of the commonly used semi supervised learning algorithms are:

  • ·         Self-training
  • ·         Mixture models
  • ·         Graph-based methods
  • ·         Co-training
  • ·         Multi view learning

Reinforcement learning:

Reinforcement learning algorithms are special kind of algorithms where the learner is not told what actions to take but instead is rewarded for taking right actions. Each wrong actions results in negative feedback and right action results in positive feedback so that learners can improve as humans do by learning from past actions. Trial and error search and delayed reward are the main characteristics of reinforcement learning algorithms.[1] They are sometimes referred to as learning from example as well. Reinforcement learning algorithms are generally preferred over supervised learning algorithms in situation where it is often impractical to obtain examples of given behavior, which are both accurate and representative of all possible situations. The key feature of reinforcement learning is ability to take decision under uncertain conditions. However, reinforcement learning generally has tradeoff between exploration and exploitation. It needs to exploit what it already knows so that it can obtain rewards but at the same time it needs to explore so that it can make better future selections. 
Some of the examples of reinforcement learning are:
  • In the chess game, a master chess player makes a move. The choice is informed both by planning--anticipating possible replies and counter replies--and by immediate, intuitive judgments of the desirability of particular positions and moves[1].
  • A mobile robot decides whether it should enter a new room in search of more trash to collect or start trying to find its way back to its battery recharging station. It makes its decision based on how quickly and easily it has been able to find the recharger in the past.
Next, we will discuss some of the challenges and key issues associated with implementation of machine learning algorithms.



[1] https://webdocs.cs.ualberta.ca/~sutton/book/ebook/node8.html


Analysis

Defining the problem:

The first step towards solving any business problem using machine learning techniques is to gain the business understanding and problem understanding. Even the most useful machine learning algorithms can become meaningless, if we the underlying problem is not well defined and articulated. This is the most important part for any machine learning project and it helps to identify if machine learning techniques are useful to solve the problem or not. The process of business understanding and problem definition is very subjective and it is depends on individual approach but in general this process consists of
·         Defining the problem
·         Why problem needs to be solved
·         How to solve the problem

Defining the problem:

Informal description:

At high level, it can be very useful to give an informal description to the problem. This could be just couple of sentences, which act as a high level understanding point and starting point. For example: I need to design a voice assistant program which can understand emotions or I need programs that will identity which tweets will be re tweeted[i]. Gaining high level understanding of the problem is as important as low level details.

Formal description:

Next step is to define the problem in more detail, formally. This step gives you an opportunity to think more in terms of how the problem would fit in a machine learning paradigm. (Brownlee, How to Define Your Machine Learning Problem, 2013). In the language of machine learning, A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. (Brownlee, November 17, 2013, 2013). In this step, we can use T,P and E to formalize our problem
Example:
·         Task : Classify all possible tweets that has not yet been classified for re tweets.
·         Experience: A data set for a given account that contains points for tweets and re tweets.
·         Performance: How many tweets are classified correctly as the one to be re tweeted.

Assumptions:

The next step is to phrase list of assumptions that are key for underlying problem. These assumptions could be some business rules and prior information. Even though, the assumptions are sometimes taken for granted but it is very important to question the assumptions for any machine learning problem because this can often lead to better understanding of the problem and identify any parameters that need to be customized . For example:
·         The type and tone of words used in the tweet is important.
·         User of re tweets may not be important.
·         Total number of re tweets might be useful for the model to identify strong patterns.
·         Recent tweets might be more informative than older tweets.
It can also be very useful to identify references of similar problems. These similar problems can help saving a lot of time and help in learning from the various problem faced.

Why problem needs to be solved?

The next step to understand the compelling business reasons behind why a given problem needs to be solve? In other words, what is the motivation behind solving the problem? ; how the business would benefit from the solution and how the solution will be used.
This is very important to understand why the problem needs to be solved. What benefits or improvement the businesses would have by solving the problem. By having a fair understanding of the benefits, one can focus more closely on achieving the objectives. Additionally, it is also very important to understand how the solution will be used. Sometimes solutions are more focused towards high accuracy rather than the simplicity of the solution. In other situation, it might be more important for the solution to be interpretable even if it is not very highly accurate.

How to solve the problem?


Conceptualizing and documenting the various steps to solve the problem are essential part to foresee the challenges involved in data collection and data preparation. In this step, list of all the required steps is generated, which helps in understanding how the system would be designed and also helps in understanding the various dependencies in the project. Additionally, the process and various experiments are listed and explained in detail.  Sometimes prototyping and manual solutions are also used to explore the best solution. This process often uncovers various hidden requirements and complexities.

Data understanding:

Once you have defined the problem, the next step is to understand the available data. Data understanding, allows you to get familiar with the data and build a good understanding about the distribution of the data. Data exploration is the term used in the machine learning paradigm to explore and understand the data.  Quality of the output generated by any machine learning algorithm directly depends on the quality of the input variable. If the input variables have issues such as lot of missing values, invalid categories, magnitude issues or outliers present then the output of the machine learning algorithms can be highly deceptive. In data exploration stage, data is analyzed to find any such underlying issues and then can be fixed in data preparation stage. At high level, data exploration can be divided in to three major categories. (Ray, 2016)
  • ·         Univariate Analysis
  • ·         Bivariate Analysis
  • ·         Multivariate Analysis.


Before choosing the right set of methods or techniques, it is essential to find the type of variables we are trying to analyze. Input variables and target variables falls under one of the following categories. (types-of-variables-87-4406, 2013)


[i] http://machinelearningmastery.com/how-to-define-your-machine-learning-problem/

Continuous variables: Continuous variables are numeric variables that can take any values between a certain set of real numbers. Example: height, time, age, and temperature.
Discrete variables: Discrete variables are numeric variables that cannot take the value of a fraction between one value and the next closest value. Example: Number of registered cars, number of children in a family.[i]
Categorical variables: Categorical variables are qualitative variables and represent a non-numeric value. Categorical variables can further be divided in ordinal and nominal variables. Ordinal variables are categorical variables that represent ordinal values such as grades. On the other hand nominal variables do not represent any order within the values such as Gender, eye color, Religion etc. (Vistasc - Analytical journey starts Here, 2016)

Univariate Analysis:

In univariate analysis, each variable is analyzed one by one. Depending on variable type (categorical or Numeric) different techniques are used to analyze the data.
Continuous Variables: If the variable is continuous then it is important to analyze the central tendency and spread of the data using descriptive statistics techniques and visualization methods.



Categorical values: For categorical values, frequency tables are used to understand the distribution of data among each category. Also, bar charts are used to visualize the distribution of data in each category.       

Bivariate Analysis: 

In bivariate analysis, the relationship between two variables is analyzed. The objective of this analysis is to find any association between the variables. Different techniques are used based on the types of variables being analyzed i.e whether continuous vs. continuous variable, continuous vs categorical variable or categorical vs categorical variable is under examination. (Afsar, 2012)

Continuous vs. continuous variable: Scatter plots are used to analyze the relationship between two continuous variables.



[i] http://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/

Correlation is defined by the mathematical formula =Covariance(X,Y) / SQRT( Var(X)* Var(Y)). The value of correlation varies between -1 to 1. The closer the value to 1,stronger the positive correlation between the variables; closer the value to -1, the stronger the negative correlation between the variables. If the value is close to 0 then it means there is no correlation between the variables. (http://src-bd.weebly.com/, 2014)

Categorical vs. categorical variables:

In order to analyze the relationship between two categorical variables, following techniques can be used.
·         Two-way table: By creating the two way tables of count and count%, we can analyze the relationship between two categorical variables. This is a good way to understand the data distribution among each category.
·         Stacked Column Chart: This method is another way to visualize the two way tables on a stacked column chart. It provides the same information but in a more visualized manner.

·         Chi-Square Test: Chi-square test helps in identifying the relationship or association between two categorical variables. It helps in finding out if the evidence present in the sample is significant enough to generalize to entire population.It is given by the formula




Categorical vs. Continuous variables:

 In order to analyze the relationship between a continuous and categorical variable, we can use visualization techniques such as box plots with each level on x axis and values on y axis. In order to test if the difference in the given groups is statistically significant we use following statistical tests. (Vistasc - Analytical journey starts Here, 2016)
Z-Test/ T-Test:
This test helps in identifying if the difference in mean of two groups (or categories) is statistically significant or not. Given by the formula (http://src-bd.weebly.com/, 2014)
T test is very similar to z test but is generally used when the populations parameters are unknown or the number of observations are less than 30 in the sample.
ANOVA:-
If we want to verify the statistical significance among more than two groups then we use ANOVA test. (MANOVAnewest.pdf, 2014)

Multivariate Analysis:

In multivariate analysis, relationship between more than two variables is examined. There are following techniques available for multivariate analysis; (Afsar, 2012)
Multiple Regressions: Often referred as simply regression, this technique helps in identifying the effects of each variable in the equation on the target variable. Regression helps in identifying the coefficient of each independent variable as well as the statistical significance of the observed effect on the target variable. Each time other predictors are kept constant. For example: Examining the effects of gender, age, race, education on INCOME. (Analysis_of_variance, 2015)

Factor Analysis: Factor analysis also known as dimensionality reduction, is used to for data reduction as well as finding the interrelation variables. This technique helps in identifying the patterns among variables and clustering these highly interrelated variables in to clusters also known as factors. (multivariate-statistical-analysis-2448.html, 2012)

Path Analysis: Path analysis is an extension of multiple regressions. It helps in identifying the magnitude and hypothesis casual connection between set of variables with the help of a path diagram.




[i] http://userwww.sfsu.edu/efc/classes/biol710/manova/MANOVAnewest.pdf

Manova:

            Multiple analyses of variance is an extension of simple Anova with the difference that in MANOVA we can have multiple dependent variables. For example, we may be interested in a study where we have two different textbooks, and we want to see the improvements in students in subjects math and chemistry. In this case, improvements in math and chemistry are the two dependent variables, and we are trying to find out if both dependent variables together are affected by the difference in textbooks. (MANOVAnewest.pdf, 2014)

Data Preparation:

After gaining a firm understanding of the data distribution and underlying issues with the data, next steps is to prepare the data for modeling stage. This step consists of series of methods and techniques to deal with the issues present in the data. This is considered to be the most intensive step in the entire process. Data is not generally present in the expected format so that it can be fed to machine learning algorithms; in this step all the data transformations, scaling,missing value computation, fixing outliers techniques are used so that data can be brought in to expected format. The following techniques are used to prepare data in columns of a data set (Abbott, Applied Predictive Analytics: Principles and Techniques for the Professional Data Analyst, 2014)
  • ·         Variable cleaning
  • ·         Variable selection
  • ·         Feature Creation

Data preparation related to rows consist of following techniques
  • ·         Selection
  • ·         Sampling
  • ·         Feature creation

Preparing data in columns:

Variable cleaning:

Variable cleaning refers to fixing the problems with values of variable themselves including incorrect or miscoded values, outliers or any missing values. All the issues must have been identified during the data understanding stage.[i]
Incorrect values: Incorrect values refer to those values, which are either mistyped or miscoded. Incorrect values in a categorical variable can represent a level that is not valid. For example: For a gender variable, value 23 can be considered as incorrect value because it does correctly represent any gender. In order to identify such values we can run frequency counts and see the distribution of data among each category and decide which categories does not make sense. For continuous variables, incorrect values are generally considered to be outliers in the data. The decision of fixing the incorrect value is very subjective and largely depends on the underlying problem. In some situations, if the number of records with such values are low, these records are delete from the data set, while in other situations, they are replaced by some meaningful value determined by the domain experts. (Kerravala, 2015)
Consistency of Data formats: Another major issue with the data is inconsistency of data formats. The format of data within a single column needs to be consistent otherwise it can generate unexpected results during modeling stage. Generally, columns that represent date and amount are subject to this cleaning. Date and amount field can be present in many formats and it becomes mandatory to bring all this values in to a single format before moving forward.
Outliers: Outliers are unusual or extreme values present in the data. Outliers are generally measured in terms of standard deviations. Any values + or – 3 standard deviation away from the mean of the data is considered to be outlier. There are different techniques to deal with outliers in the data but the decision to choose any such method depends on the problem in hand. For example, age 141 can be considered as outlier in the data but the same time this could actually be a correct age and this record infact can provide useful information. In another example, if the problem in hand is to detect credit card fraudulent activity then these unusual values are the most important for the model and cannot be ignored. Various techniques to deal with outliers are following but the selection of the technique largely depends on the type of the problem
  • ·         Remove the outliers from the data
  • ·         Separate outliers and created separate model
  • ·         Transform outliers
  • ·         Bin the data
  • ·         Leave the outliers in the data

Missing values: Missing values are the most problematic and can create lot of problems in the data modeling stage. Missing values are generally coded as NULL or empty values. However, they can also represent some predefined abbreviations. Missing values falls under one of following categories; (Abbott, Applied Predictive Analytics: Principles and Techniques for the Professional Data Analyst, 2014)
MCAR(Missing completely at random): There is no way to logically deduce the value
MAR(Missing at random): There is some conditional relationship between missing value columns and other column. In other words, value is missing because of the value present in some other column. For example, in a survey, if there is a question that required further explanation if you answer yes for the question may contain missing values in explanation depending on what was the answer to the question.
MNAR(Missing not at random):In this case values for the missing records can be inferred by the mere fact that values are not present in the column.
There are following techniques that can be used to fix missing values in the column but the decision to choose any of these techniques is very subjective and depends on the nature of problem and data in hand.
  • ·         List wise and column wise deletion
  • ·         Imputation with a constant
  • ·         Mean and median imputation for continuous variables
  • ·         Imputing with distributions
  • ·         Random imputation from own distributions
  • ·         Imputing missing values from a model

Feature Creation: Features creation refers to the process of creating derived columns from the existing columns (the-problem-solver-approach-to-data-preparation-for-analytics.html, 2014). This process requires fair understand of the data and business domain. There are following techniques:

·         Simple variable transformation
o   Fixing Skew: Positive or negatively skewed data can create bias in numerical models such as linear regression, k nearest neighbor, k-means . Therefore, it is necessary to fix the skew in the data. There are following techniques available:

Positive Skew
Log Transform
Log(x),logn(x),log10(x)
Multiplicative inverse
1/x
Square root
Sqrt(x)

Negative Skew
Multiplicate power
X^n
Log transform
-log10(1+abs(x))
o   Binning continuous variables
o   Numeric variable scaling: there are following techniques available:
§  Magnitude scaling
§  Sigmoid
§  Min-Max normalization
§  Z score
§  Rank binning

·         Nominal variable transformation: Nominal variables can create problem in numerical algorithm because they are assumed to represent numbers than categories. It is a good idea to transform these variables in to dummy variables.
·         Ordinal variable transformations: ordinal variables have the same problems as nominal variable but since ordinal variable represent a certain order, different techniques are used such thermometer scale to transform the ordinal variables.
·         Date and time variable features: Fixing date and time variable usually referred to technique to transform data in a consistent format.

Variable selection:

Once the data has been cleaned any new features have been derived, the next steps is to carefully examining what variables are best suited to solve the given problem. Variable selection is a technique to reduce the total number of variables through variable selection.The idea is to remove those variables which are not relevant or are not good predictors There are certain algorithms that does the task of variable selection automatically such as decision trees and steps wise linear regression techniques (adaptive-data-preparation/, 2011). For other sophisticated techniques following methods are used:
  • ·         Removing irrelevant variables
  • ·         Removing redundant variables
  • ·         Selecting variables when there are too many: When the number of input variables is too many then it is very important to use techniques to identify the best predictor variables. There are certain techniques that can be used (preparing-data-for-analytics.aspx, 2012).



[i] Applied Predictive Analytics: Principles and Techniques for the Professional Data Analyst

Technique
Type of variables
Chi-square test
Categorical input vs categorical target
Association rules
Categorical input vs categorical target
ANOVA
Continuous input and continuous target
Linear regression forward selection
Continuous input and continuous target
Principal component analysis
Continuous input vs continuous target

Preparing data in Rows:

 Apart from doing the required transformation of data present in columns, it is very important to carefully select the data from the rows. Preparing data in rows means using right techniques to sample the data and partition the data. (Abbott, Applied Predictive analytics, 2012)

Sampling:

Sampling refers to the process of selecting random records to train and test the machine learning algorithm. There are various sampling techniques that can be used depending on the problem nature. (http://sceweb.uhcl.edu/boetticher/ML_DataMining/SAS-SEMMA.pdf, 2011)
Random Sampling: The simplest form of sampling that does not take in to account the distribution of data.
Bootstrap Sampling: If the data is relatively less and we cannot afford to partition the data in to testing and validation then bootstrap sampling can be used. Sometimes it is also referred as sampling with replacement.
Stratified sampling: When the target variable is skewed then it becomes very important that any partition created through sampling has comparative proportion of target variables. Stratified sampling balances counts based on one or more variables.

Feature Engineering:

            Feature engineering is the most important step while solving a machine learning problem. The outcome of machine learning algorithm depends directly on the quality of the inputs. It is important how you present the data in order to achieve best results from the algorithm. Feature engineering is an art of creating and testing most relevant features for a given problem. The power of useful features can be estimated from the fact that even the simpler models can outperform the most complex models with help of right set of features. In other words, better features mean better results. Feature engineering can be defined as process of transforming raw data in to features that represent the problem in better way and result in model accuracy and performance. Feature engineering can be thought of as a representation problem as it is simply deriving a new representation from existing set of features. Let’s try to formalize the process of feature engineering. Before we dive deep in to the process of feature engineering it is important to understand what is a feature? If data is present in tabular format (in the form of rows and columns) then any attribute can be considered as a feature. Process of feature engineering can be divided into 3 major categories.
. (http://pslcdatashop.org/KDDCup/workshop/papers/kdd2010ntu.pdf)
  • ·         Feature Extraction
  • ·         Feature Selection
  • ·         Feature Construction

Feature Extraction:

There is variety of statistical methods available that can be used to automatically extract the most important features from a data set. Feature extraction is a process of reducing the dimensionality in to smaller set in an automatic way. This process of feature extraction is centered the concept of dimensionality reduction. Within voluminous data set, feature extraction can help in identifying the abstracted set of features that best represent the problem. There are following techniques available for feature extraction: (Brownlee, Discover Feature Engineering, How to Engineer Features and How to Get Good at It, 2014)
·         Principal Component Analysis
·         Clustering methods
The key idea is to aggregate the most important and related features in to subset of features to improve the performance of machine learning algorithm
(http://blog.bigml.com/2013/02/21/everything-you-wanted-to-know-about-machine-learning-but-were-too-afraid-to-ask-part-two/, 2013)

Feature Selection:

Each feature in a data set has different impact on the target variable. There can be some features, which are important than others. Also, there can be some features which are redundant and do not provide any additional benefit in predicting the outcome. Feature selection is a process of identifying these redundant, irrelevant or correlated features and selecting only the relevant features for the given problem. Feature selection is again an automatic process in most of the statistical tools available in the market. Feature selection algorithm use scoring methods to rank features such correlation and chi square statistics etc. Some more advance feature selection algorithm employ trial and error techniques, where in multiple models are created using various combinations of features with the objective of optimum performance with least number of features. Stepwise regression is one example of such algorithm, which performs automatic feature selection in the process of modeling. Regularization methods such as Ridge and Lasso are other examples, which employ automatic feature selection by penalizing the irrelevant features in the process of feature selection. (learning-data-science-feature-engineering, 2016)

Feature Construction

Feature construction is the most indirect and manual process in all the categories of feature engineering. Feature construction requires a good understanding of problem in hand and business domain. It is manual process of deriving new features by aggregation or other empirical methods in order to improve accuracy and performance of the machine learning algorithms. It is by far the most time consuming process and requires intensive brainstorming and understanding of the problem in hand. If data is present in a tabular format then feature construction may involve aggregating or combining some of the columns to create new columns, which represent the data in more abstract but effective way (http://go.sap.com/docs/download/2015/09/2a16f496-3f7c-0010-82c7-eda71af511fa.pdf, 2015). The process generally involve following steps:
  1. ·        Brainstorm features
  2. ·         Devise Features
  3. ·         Select Features
  4. ·         Evaluate models

Each of the above mentioned steps is self-explanatory. Let consider an example to understand the process of feature construction. In this process we will decompose categorical variable so that our numeric algorithm can interpret the features correctly. Let’s say we have an item_color column that contains values such as Red, Blue, Unknown. Here unknown is representing a category that means the information is not available but the machine learning algorithm this can be considered as a valid category for the item_color that represent some color. It is possible to create a new feature called has_color that contains 1 for red or blue and 0 for Unknown category. We could also create 3 more binary features is_red, is_blue or is_unknown (http://machinelearningmastery.com/an-introduction-to-feature-selection/). These three binary variables would contain 1(True) if the color is red, blue or unknown respectively. These new features can be used to train and test our machine learning algorithm instead of using the raw feature item_color. (Abbott, Applied Predictive analytics, 2012)

Model Selection:

Model selection is a tedious job to do and requires a firm understanding of the problem in hand. The reason why model selection is a tedious job is that there are more than one available machine learning algorithms that can fit the underlying problem and there are numerous algorithms to try and test. Within each algorithm, there are several settings that can be done to better fit the underlying problem. Therefore, the number of options to try and test really explodes and this task can take considerable amount of time. It is true that type of problem such as classification, regression or clustering can give you initial hint to try only a subset of algorithms; however, the number of different settings within each algorithm can make it really difficult to reach to the optimum solution. Below picture depicts the available algorithms and roadmap to select any given algorithm. (http://machinelearningmastery.com/why-you-should-be-spot-checking-algorithms-on-your-machine-learning-problems/, 2014)


Machine learning algorithms can be grouped by learning style and by similarity. We have already discussed machine learning algorithms by learning style in the Background:3 section. We can group machine learning algorithm by similarity as well. There are follow major categories, if we group algorithms by similarity. (machine-learning-cheat-sheet-for-scikit.html, 2013)
Regression Algorithms: Refers to modeling the relationship between the variables where in the underlying model continuously refine using a measure of error. 
The most often used regression algorithms are: (http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/, 2015)
    • Ordinary Least Squares Regression (OLSR)
    • Linear Regression
    • Logistic Regression
    • Stepwise Regression
    • Multivariate Adaptive Regression Splines (MARS)
    • Locally Estimated Scatterplot Smoothing (LOESS)

Instance Based Algorithms:

These algorithms make the decision based on instance or examples of training data based on some similarity measure. These algorithms are also some time referred as memory based learning.[i]

Following are most often used instance based algorithms.
    • k-Nearest Neighbour (kNN)
    • Learning Vector Quantization (LVQ)
    • Self-Organizing Map (SOM)
    • Locally Weighted Learning (LWL)

Regularization algorithms: Regularization algorithms are extension to regression algorithms that penalized models if the models are too complex. They use regularization technique to measure the impact of each variable added to the regression and helps in creating more simple and generalized models.

Following are the most often used regularization algorithms.
    • Ridge Regression
    • Least Absolute Shrinkage and Selection Operator (LASSO)
    • Elastic Net
    • Least-Angle Regression (LARS)

Decision Tree Algorithms: Decision tree algorithms create set of rules laid out in a tree format. They are often very fast and easy to interpret models.


Following are most often used decision tree algorithms
    • Classification and Regression Tree (CART)
    • Iterative Dichotomiser 3 (ID3)
    • C4.5 and C5.0 (different versions of a powerful approach)
    • Chi-squared Automatic Interaction Detection (CHAID)
    • Decision Stump
    • M5
    • Conditional Decision Trees

Bayesian Algorithms:

These algorithms are based on Bayesian theorem and are used for classification as well as regression, where posterior probability is calculated based on prior probability. 


[i] http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/


Following are most often used Bayesian algorithms
    • Naive Bayes
    • Gaussian Naive Bayes
    • Multinomial Naive Bayes
    • Averaged One-Dependence Estimators (AODE)
    • Bayesian Belief Network (BBN)
    • Bayesian Network (BN)

Clustering Algorithms: Clustering algorithms are concerned with using the inherent structures in the data to best organize the data into groups of maximum commonality.      



Following are most often used clustering algorithms
    • k-Means
    • k-Medians
    • Expectation Maximisation (EM)
    • Hierarchical Clustering

Association Rule learning algorithms:

These algorithms helps in extracting rules that best represent the relationship among the variables. [i]



[i] http://machinelearningmastery.com/why-you-should-be-spot-checking-algorithms-on-your-machine-learning-problems/


Following are most often used learning algorithms.
    • Apriori algorithm
    • Eclat algorithm

Artificial neural network Algorithms:

Inspired by the function of biological neural network, Artificial neural network algorithms consist of hundreds of algorithms and variations for the given problem.
Following are most often used Algorithms under this class.
    • Perceptron
    • Back-Propagation
    • Hopfield Network
    • Radial Basis Function Network (RBFN)

Dimensional Reduction algorithms:

 These algorithms come under unsupervised learning like clustering methods. Here the objective is to aggregate the data to provide an abstraction of the underlying features.
Following are most often used dimensionality reduction algorithms.
    • Principal Component Analysis (PCA)
    • Principal Component Regression (PCR)
    • Partial Least Squares Regression (PLSR)
    • Sammon Mapping
    • Multidimensional Scaling (MDS)
    • Projection Pursuit
    • Linear Discriminant Analysis (LDA)
    • Mixture Discriminant Analysis (MDA)
    • Quadratic Discriminant Analysis (QDA)
    • Flexible Discriminant Analysis (FDA)

Ensemble Algorithms: Ensemble methods are models composed of multiple weaker models that are independently trained and whose predictions are combined in some way to make the overall prediction. 
Following are most often used ensemble algorithms.
    • Boosting
    • Bootstrapped Aggregation (Bagging)
    • AdaBoost
    • Stacked Generalization (blending)
    • Gradient Boosting Machines (GBM)
    • Gradient Boosted Regression Trees (GBRT)
    • Random Forest

A systematic approach to model selection can certainly help in making the choice. Spot checking algorithm techniques can be used to simply the process. Spot checking refers to making a quick assessment on bunch of algorithms and deciding which algorithm to focus on for the given problem. For a given machine learning problem, it is very helpful to be able to quickly determine which class or type of algorithms is good in explaining the patterns in the problem. There are following benefits of using spot checking algorithms to narrow down the search of optimum solution.
·         Speed: Speed is the major benefit of using spot checking algorithms as it can save lot of time by narrowing down the number of algorithms to focus on.
·         Objective: Spot checking algorithms helps you objectively discover the suitable algorithms so that you can focus your attention.
·         Results: Spot checking algorithms helps you in getting the optimum results, faster. You can get a list of possible suitable algorithms and can focus on only these algorithms to improve the accuracy and performance.
Below are some guidelines to choosing right approach for spot checking algorithms for your problem
1.      Algorithm Diversity: Have a good mix of algorithm types to include such as pick some instance based, some kernel functions, decision trees and rule systems.
2.      Best Foot Forward: Each algorithm should be given a fair chance to prove its creditability by changing any quick parameters.
3.      Formal Experiment: Execute the entire the process in a more formal and systematic way. The idea behind is to gain the first level understanding about what are the suitable algorithms.
4.      Jumping Off point: The suitable algorithms are just the pointers to the direction and not the solution. One you have the direction, explore it and find the optimum solution.
5.      Build your short-list: As you shortlist your algorithm, keep iterating your list in order to try different algorithms of the same category to achieve the best results.

Model Evaluation:

Model evaluation is another important step in any machine learning project. Model evaluation not only provides an opportunity to gauge the performance and accuracy of given model but it also allows comparing and contrasting various models. In fact, by using model evaluation techniques an optimum solution for the given problem can be achieved. Model evaluation is performed by using data, which has not yet seen by the model. Using the training data during model evaluation can lead to over fitting in the model, where in model memorize the training data set and does not generalize well to new data. Following tehcniques are used to avoid over fitting. (model_evaluation.htm, 2013)

Hold out:

In this method, the data set is divided in two or some time three sets namely training , testing and validation data sets.

Cross Validation:

When available data is limited i.e data set contains few observations then  to avoid over fitting k-fold cross-validation is used. In k-fold cross-validation, data  is divided in to into k subsets of equal size. Model is built k times, each time leaving out one of the subsets from training and use it as the test set. If k equals the sample size, this is called "leave-one-out".
At high level, model evaluation can be divided in two groups. (http://hillside.net/plop/plop2002/final/PLoP2002_jtsouza0_1.pdf)
  • ·         Classification Evaluation
  • ·         Regression Evaluation

Classification Evaluation

There are following evaluation techniques for classification problems. (http://www.saedsayad.com/model_evaluation_c.htm)

Confustion Matrix:

Confusion Matrix is a tabular format of presenting correct and incorrect cases predicted by the model.


Gain Charts

Gain or lift is a measure of the effectiveness of a model calculated as the ratio between the results obtained with and without the model. Gain and lift charts are way of visualizing the results for performance evaluation. Results are divided in to deciles and model performance is measured against baseline at each level. This evaluation technique is very useful in marketing applications.[i]



[i] http://www.saedsayad.com/model_evaluation_c.htm

Lift Charts:

Lift charts provides a way to evaluate how much more likely the positive response is going to happen with the model as compare to without model. Similar to gain charts predicitons are divided in to equal partitions and lift is calculated at each level. For example, by contacting only 10% of customers based on the predictive model we will reach 3 times as many respondents, as if we use no model. (http://stats.stackexchange.com/questions/116585/confusion-matrix-of-classification-rules)


ROC Charts:

The ROC chart shows false positive rate on X-axis against true positive rate (sensitivity) on Y-axis. (http://sceweb.uhcl.edu/boetticher/ML_DataMining/SAS-SEMMA.pdf, 2011)


Area Under Curve:

Area under curve is measure of effectiveness of the model. Without the model, random classifier would have area 0.5 under the curve. The optimum model would have area 1 under the curve.Most of the models fall between 0.5 and 1 score.

Regression Evaluation:

There are following evaluation techniques for regression problems.[i]

Root Mean Square Error:

 Root mean square error is given by the following formula. It measure the difference between predicted and actual value and provide a measure to gaude the performance of the regression model.

Relative squared Error:

Relative squared can be used to compare two different models whose errors are measure in different units. It is given by the following formula.

Mean Absolute Error:

Mean absolute error is similar to RMSE in that it calculate the difference between the actual value ; It then takes the average of the differences to calculate MAE. It is given by the formula. MAE is given in the same units of the data.



[i] http://www.saedsayad.com/model_evaluation_r.htm

Relative absolute Error:

Similar to RSE, RAE allows comparing the absolute errors between two models, whose errors are measure in different units. It is given by the following formula:

Coefficient of Determination:

Coefficient of determination gives a measure of predictive power of the model.It scores the variance determined by the model against the overall variance present in the target variable.
If the regression model is “perfect”, SSE is zero, and R2 is 1. If the regression model is a total failure, SSE is equal to SST, no variance is explained by regression, and R2 is zero. (http://www.slideshare.net/pierluca.lanzi/machine-learning-and-data-mining-14-evaluation-and-credibility)

Model Deployment:

In order to integrate model in to day to day decision making, Model deployment is necessary. Model deployment allows deploying the model in the underlying environment so that model can be actually used for business scenarios. This activity can take significant amount of time depending on the situation. This task can be very challenging as many organizations lack an integrated technical infrastructure for model deployment. (predictive-modeling-production-deployment/)
After evaluating and validating the model, model is moved to production by implementing a scoring system in which model is fed with new data that does not contain the target variable. There are following approaches to deployment (/predictive-analytics-model-deployment-monitoring-graham-smith-phd)
·         Scoring the model: Model is used as scoring model and the score is provided to respective departments in order to take decisions.
·         Integrate with Reporting: Model is used as a reference point and is collaborated with business intelligence tools.
·         Integrate with Application: Model is integrated within an existing application and is used to improve the processes or decisions.
Once the models are deployed, they are monitored in the production environment periodically because models tend to deteriorate with usage and need to be rebuilt. This is primarily due to the fact that models tend to over fit the same kind of fed data over and over again. (p5V11n1.pdf)
Sometimes, it is necessary to move the models from one application or platform to another. Most of the model building software provides an option to export the models in a standard language i.e PMML (Predictive Model Markup Language). PMML is an xml language, which makes it easier to move the models with different applications and platforms[i]. It is the leading standard for predictive software and is supported by over 20 vendors (pmml.html). The different sections of PMML are:
  • ·        Data Dictionary
  • ·        Mining Schema
  • ·        Data Transformations
  • ·        Model Definition
  • ·        Outputs 
  • ·        Targets 
  • ·        Model Explanation
  • ·         Model Verification 

There are numerous open source and proprietary software available, which takes PMML file as an input and allow deploying the model as a service. One such option is Zementis ADAPA (https://www.sas.com/content/dam/SAS/en_us/doc/whitepaper2/three-steps-put-predictive-analytics-work-105837.pdf)



[i] http://www.predictiveanalyticstoday.com/deployment-predictive-models/

ADAPA provides following variants to suit various business needs:[i]
·         ADAPA in the Cloud
o   Private, virtual ADAPA Instance
o   Self-service Scoring Engine
·         ADAPA on Site
·         ADAPA as a Library

Problem with over Reliance on Machine learning:

Machine learning algorithms hold good potential in generalizing the problem scenario and help in providing a good direction for taking decision. However, over reliance on machine learning algorithm can lead to severe consequences. Machine learning algorithms are infact, some probabilistic models, which can only give the likelihood of the events. This information should be used to further understand the underlying situation before taking any decision.





[i] https://www-01.ibm.com/support/knowledgecenter/SS3RA7_15.0.0/com.ibm.spss.modeler.help/models_import_pmml.htm

Modern analytical tools are very good at extracting the hidden patterns from the massive data sets but the over reliance on computer algorithms can lead to inaccurate end results under certain situations. For instance, Google had to apologize when Google’s Photos picture tagging service labeled a black woman as a ‘gorilla’ as shown in below picture.


In another interesting instance, Indian Prime Minister Narendra Modi was placed in the search results for top 10 criminal lists. 

Also,Google issued an apology after a translation tool suggested a number of offensive translations for the word “gay” in January last year.
Machine learning has proved its potential in the field of AI, Marketing, Medicine, Healthcare and other major Industries. The dependence on decision science using machine learning algorithms is anticipated to grow with growth in the data generation. It is estimated that by 2020, there will be more than 21 billion devices connected to internet.


These devices would not only be able to communicate to the internet but would be communicating to each other. The increasing dependence on such devices would further strengthen the role of machine learning in problem solving as we would need sophisticated algorithms to derive insights from the data generated by the devices. As discussed in the article above, it is obvious that machine learning requires a firm understanding of underlying algorithms and business problem. Machine learning has not evolved from the traditional software engineering and the methods to develop, integrate and deploy the models are therefore different than the traditional approaches.
One of the misconceptions about the machine learning algorithms is related to the performance. The objective of any machine learning algorithm is not always the best performance. In fact, efforts to improve the performance beyond certain point on  training data sometimes leads to over fitting of the model, where in model remembers the training data points and does not generalize to the new data. The objective of machine learning algorithms is to find the sweet spot between model complexity and model accuracy. Again, even the most optimal machine learning algorithms are not useful if they cannot be deployed to solve the business problem. Therefore, model complexity is an important factor to keep in mind, while building machine learning algorithms. Many new comers in the field fail to understand the importance of model simplicity and focus on improving the model performance only. We have also discussed how over reliance on machine learning algorithms can have severe consequences. Machine learning algorithms are good at providing the answer to the problem but most often it is not clear why the answer is the one suggested by algorithm. Given the nature of some discriminative models(such as linear regression, logistic regression, neural networks, Boosting, Support vector machines etc), algorithms develop a mathematical formula or equation to effectively distinguish different classes, which can be very helpful to get great answers but you may not understand why. Generative models, on the other hand, still do a better job in explaining why part.     Given the fact that the output of machine learning algorithms may not always be easy to decipher, it is important to evaluate the results by human, under certain situation. Human in loop computing can help overcome various problems of machine learning algorithms.
Under Human in loop computing, first a machine learning algorithm takes pass on the data and assigns a confidence score to each predicted label. If the confidence score is below a certain value, it is send to human annotator to make decision. This human judgment is then used by the business as well as by the machine learning algorithm to improve it further. In simple words, complement machine learning algorithm under those situations when machine learning algorithms are not very confident about a given prediction. Self-driving cars are a good example of human in the loop computing. Tesla has recently introduced an automating driving mode, which is a good example of human-in-the-loop pattern. The car mostly drives automatically but under certain situations it alerts that driver keeps their hands on the wheel. When the machine learning algorithm system cannot confidently score the surrounding environment of the car due to any unexpected condition such as – any construction, snow, or anything unusual on the road – it alerts the driver to take the control back. The car can derive automatically under most situations it needs a human intervention to be full proof.
Another interesting fact about the machine learning is the direction it is moving in. Growth and dependence on data has increased exponentially in recent years. Given the improved computational power of computers, machine learning can now handle enormous data sets with hundreds of features such as Natural Language Processing, image recognitions etc. However, the process of selecting and choosing the best feature for the algorithms is still quite a manual process. With more and more features in the data sets, it becomes increasingly difficult and tedious to find out the optimum features for the machine learning algorithm. Models based on these huge data sets are not only computationally very expensive but also create difficulties in terms of interpretation. Moreover, the process of model creation and evaluation is an iterative process and it takes many iterations before discovering the right set of parameters and algorithm for a given business problem. Given these huge data sets, the process of models creation and evaluation becomes very time consuming. Even though, techniques such as Principal component analysis and deep learning can help to discover the most relevant features of the given data set yet the process is intensive and requires a deep understanding of business domain and statistical methods. This is where deep learning, another branch of machine learning, pitches in. Deep learning helps in abstracting the features through multi-level abstraction and help finding the optimum features for the underlying problem. Deep learning algorithms provide richer generalization of input features by providing a multilayer abstraction approach. Deep learning can help in solving the following major issues in machine learning
  • Inefficiency in dealing with high Dimensional Data
  • Large scale Models
  • Problems with Increment learning
Deep learning holds true potential of automating the model creation process even for the massive data sets. Deep learning can provide solutions to many of the problems faced in machine learning but there is lot of research work required to make deep learning more useful.
We have witnessed a trend in the job market, which is centered around data analytics. With increasing dependence on data we have seen new job title such as data scientist, data engineer, business intelligence etc. Given the shortage of data professionals, These skills are very in very high demand.Data scientist is considered to be the sexiest job of the 21st century. (https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/). Below is a snapshot of the increasing jobs with the title data science and data scientist.


 From the graph, it is evident that since 2011, the jobs for data professionals have been increasing at a significant rate.In order to meet the market demand,  more courses at university level need to be taught so that fresh graduate can excel in their future. We have seen various such certificate and degree courses being introduced in universities. However, these courses are still not become the part of core curriculum. Given the importance of data in every industry, it has become very important that introductory level data courses are taught in every degree program. The availability of open source languages and tools has made it very easy for anyone to start the journey of acquiring these skills. There are number of available tools in the market such as SAS
IBM SPSS, R, Python, MATLAB etc.
It is true that Machine learning is already a powerful tool that can do incredible job in solving really hard problems but there is still lot of research required to replace the human behavior in true sense. Also, we have not seen lot of stable and good applications of machine learning on sentiment analysis. For instance, speech assistance applications still can’t understand the pitch of the speaker to figure out the emotions in the speech and always gives same results irrespective of the emotions. Nevertheless, Machine learning applications have covered quite a journey and have provided numerous fruitful results in all the major industries. The field is still growing and it will be interesting to see how industry takes advantage of this vast field and how machine learning shapes the future of technology.



Bibliography

/predictive-analytics-model-deployment-monitoring-graham-smith-phd. (n.d.). Retrieved Feb 25, 2016, from https://www.linkedin.com/: https://www.linkedin.com/pulse/predictive-analytics-model-deployment-monitoring-graham-smith-phd
adaptive-data-preparation/. (2011, Jan 7). Retrieved Feb 14, 2016, from http://www.paxata.com/: http://www.paxata.com/adaptive-data-preparation/
http://sceweb.uhcl.edu/boetticher/ML_DataMining/SAS-SEMMA.pdf. (2011, Aug 23). Retrieved Feb 21, 2015, from http://sceweb.uhcl.edu: http://sceweb.uhcl.edu/boetticher/ML_DataMining/SAS-SEMMA.pdf
multivariate-statistical-analysis-2448.html. (2012, Sep 16). Retrieved Feb 23, 2016, from http://classroom.synonym.com/: http://classroom.synonym.com/multivariate-statistical-analysis-2448.html
preparing-data-for-analytics.aspx. (2012, Feb 11). Retrieved Feb 19, 2016, from https://tdwi.org/: https://tdwi.org/articles/2015/04/14/preparing-data-for-analytics.aspx
http://blog.bigml.com/2013/02/21/everything-you-wanted-to-know-about-machine-learning-but-were-too-afraid-to-ask-part-two/. (2013, Aug 12). Retrieved Feb 21, 2016, from http://blog.bigml.com: http://blog.bigml.com/2013/02/21/everything-you-wanted-to-know-about-machine-learning-but-were-too-afraid-to-ask-part-two/
machine-learning-cheat-sheet-for-scikit.html. (2013, jan). Retrieved Feb 18, 2016, from http://peekaboo-vision.blogspot.com/: http://peekaboo-vision.blogspot.com/2013/01/machine-learning-cheat-sheet-for-scikit.html
model_evaluation.htm. (2013, Feb 6). Retrieved Feb 16, 2016, from http://www.saedsayad.com/model_evaluation.htm: http://www.saedsayad.com/model_evaluation.htm
types-of-variables-87-4406. (2013, Dec 6). Retrieved Feb 23, 2016, from https://www.boundless.com/statistics/textbooks/boundless-statistics-textbook/visualizing-data-3/the-histogram-18/types-of-variables-87-4406/: https://www.boundless.com/statistics/textbooks/boundless-statistics-textbook/visualizing-data-3/the-histogram-18/types-of-variables-87-4406/
http://machinelearningmastery.com/why-you-should-be-spot-checking-algorithms-on-your-machine-learning-problems/. (2014, Feb 7). Retrieved Feb 19, 2016, from http://machinelearningmastery.com/: http://machinelearningmastery.com/why-you-should-be-spot-checking-algorithms-on-your-machine-learning-problems/
http://src-bd.weebly.com/. (2014, Jan 22). Retrieved Feb 12, 2016, from http://src-bd.weebly.com/: http://src-bd.weebly.com/
(2014). MANOVAnewest.pdf.
the-problem-solver-approach-to-data-preparation-for-analytics.html. (2014, Mar 5). Retrieved Feb 26, 2016, from http://www.sas.com/: http://www.sas.com/en_us/insights/articles/data-management/the-problem-solver-approach-to-data-preparation-for-analytics.html
Analysis_of_variance. (2015, Oct 15). Retrieved Feb 12, 2016, from https://en.wikipedia.org/wiki/Analysis_of_variance: https://en.wikipedia.org/wiki/Analysis_of_variance
http://go.sap.com/docs/download/2015/09/2a16f496-3f7c-0010-82c7-eda71af511fa.pdf. (2015, Jan 11). Retrieved Feb 20, 2016, from http://go.sap.com/: http://go.sap.com/docs/download/2015/09/2a16f496-3f7c-0010-82c7-eda71af511fa.pdf
http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/. (2015, Jun 13). Retrieved Feb 13, 2016, from http://machinelearningmastery.com: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/
learning-data-science-feature-engineering. (2016, Jan 19). Retrieved Feb 20, 2016, from http://www.simafore.com/: http://www.simafore.com/blog/learning-data-science-feature-engineering
Abbott, D. (2012). Applied Predictive analytics.
Abbott, D. (2014). Applied Predictive Analytics: Principles and Techniques for the Professional Data Analyst. WILEY.
Afsar, J. (2012). Univariate, Bivariate And Multivariate Data. Engineering Intro.
Brownlee, J. (2013). How to Define Your Machine Learning Problem. machine learning mastery.
Brownlee, J. (2013). November 17, 2013. machine learning mastery.
Brownlee, J. (2014, sep 26). Discover Feature Engineering, How to Engineer Features and How to Get Good at It. Machine learning Mastery.
http://hillside.net/plop/plop2002/final/PLoP2002_jtsouza0_1.pdf. (n.d.). Retrieved Feb 19, 2016, from http://hillside.net/: http://hillside.net/plop/plop2002/final/PLoP2002_jtsouza0_1.pdf
http://machinelearningmastery.com/an-introduction-to-feature-selection/. (n.d.). http://machinelearningmastery.com/an-introduction-to-feature-selection/. Retrieved Feb 21, 2016, from http://machinelearningmastery.com: http://machinelearningmastery.com/an-introduction-to-feature-selection/
http://pslcdatashop.org/KDDCup/workshop/papers/kdd2010ntu.pdf. (n.d.). Retrieved feb 22, 2016, from http://pslcdatashop.org/: http://pslcdatashop.org/KDDCup/workshop/papers/kdd2010ntu.pdf
http://stats.stackexchange.com/questions/116585/confusion-matrix-of-classification-rules. (n.d.). Retrieved Feb 22, 2016, from http://stats.stackexchange.com: http://stats.stackexchange.com/questions/116585/confusion-matrix-of-classification-rules
http://www.saedsayad.com/model_evaluation_c.htm. (n.d.). Retrieved Feb 19, 2016, from http://www.saedsayad.com: http://www.saedsayad.com/model_evaluation_c.htm
http://www.saedsayad.com/model_evaluation_r.htm. (n.d.). Retrieved Feb 20, 2016, from http://www.saedsayad.com: http://www.saedsayad.com/model_evaluation_r.htm
http://www.slideshare.net/pierluca.lanzi/machine-learning-and-data-mining-14-evaluation-and-credibility. (n.d.). Retrieved Feb 18, 2016, from http://www.slideshare.net/pierluca.lanzi/: http://www.slideshare.net/pierluca.lanzi/machine-learning-and-data-mining-14-evaluation-and-credibility
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/. (n.d.). Retrieved mar 5, 2016, from https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/: https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
https://www.sas.com/content/dam/SAS/en_us/doc/whitepaper2/three-steps-put-predictive-analytics-work-105837.pdf. (n.d.). Retrieved Feb 22, 2016, from https://www.sas.com/: https://www.sas.com/content/dam/SAS/en_us/doc/whitepaper2/three-steps-put-predictive-analytics-work-105837.pdf
Kerravala, Z. (2015, Jan 30). Data preparation is the unsung hero of big data analytics.
p5V11n1.pdf. (n.d.). Retrieved Feb 22, 2016, from http://kdd.org/exploration_files/p5V11n1.pdf: http://kdd.org/exploration_files/p5V11n1.pdf
pmml.html. (n.d.). Retrieved Feb 24, 2016, from http://www.kdnuggets.com/faq/pmml.html: http://www.kdnuggets.com/faq/pmml.html
predictive-modeling-production-deployment/. (n.d.). Retrieved Feb 23, 2016, from http://insidebigdata.com/2014/10/08/predictive-modeling-production-deployment/: http://insidebigdata.com/2014/10/08/predictive-modeling-production-deployment/
Ray, S. (2016, January 11). guide-data-exploration/. Retrieved Feb 25, 2016, from www.analyticsvidhya.com: http://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
Vistasc - Analytical journey starts Here. (2016, Jan 12). http://www.simafore.com/vistasc-univariate-bivariate-multivariate. Retrieved Feb 15, 2016, from http://www.simafore.com: http://www.simafore.com/vistasc-univariate-bivariate-multivariate


 [1] http://www.sas.com/en_us/insights/analytics/machine-learning.html
[1] http://machinelearningmastery.com/how-to-define-your-machine-learning-problem/
[1] http://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
[1] http://userwww.sfsu.edu/efc/classes/biol710/manova/MANOVAnewest.pdf
[1] Applied Predictive Analytics: Principles and Techniques for the Professional Data Analyst
[1] http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/
[1] http://machinelearningmastery.com/why-you-should-be-spot-checking-algorithms-on-your-machine-learning-problems/
[1] http://www.saedsayad.com/model_evaluation_c.htm
[1] http://www.saedsayad.com/model_evaluation_r.htm
[1] http://www.predictiveanalyticstoday.com/deployment-predictive-models/
[1] https://www-01.ibm.com/support/knowledgecenter/SS3RA7_15.0.0/com.ibm.spss.modeler.help/models_import_pmml.htm