Saturday, February 20, 2016

Deep Learning and big Data Analytics



In last one decade, Data mining or Big Data Analytics has become increasingly important for both public and private companies. We have seen a huge increase in new technologies, frameworks and hardware to support the wave of big data in recent years. With companies collecting more and more data, there is huge need to mine this data to derive useful insights and decisions. Some of the most useful applications of data mining are in cyber security, marketing, fraud detection and customer relationship. Data mining techniques are highly focused on creating generalized patterns from massive data sets and use these patterns to help in decision making for future use.
However, the underlying foundation of any data mining or machine learning technique is the goodness of the data. A complex and efficient technique can fall short, if the features used to train the models were not representative. On the other hand, even a simple algorithm can do a wonderful job, if it is provided with right features. So the question that pops up is ,”How to find right set of features in the massive data sets with millions of data points?”. In our following discussion we will discuss some unique challenges involved with data mining and big data and see how deep learning can help overcoming these challenges.

KEY ISSUES:
  • Feature Engineering.
  • Information retrieval and indexing.
  • Dealing with Massive unlabeled/unsupervised data

ANALYSIS OF ISSUES:

Feature Engineering:
Performance and effectiveness of data mining or machine learning techniques i.e supervised or unsupervised algorithms largely depend on the underlying data. No technique can lead to fruitful results if the underlying data is incorrect, non-representative or not used properly. Even though, the exiting tools and technologies provides many options to discover the relationships between the input variables and the target variables yet these options and techniques are meant to deal with small number of features. Most often used data analysis techniques are limited to uni-variate, bivariate and multivariate analysis, which are confined to few number of variables and are focused on single dimensional features. However, in true world, features are not generally single dimensional; Input features can be multidimensional and are important to identify and derive to get the satisfactory results from underlying algorithms. Feature Engineering is most time consuming but important part of any data mining algorithm. Many useful multidimensional features can be discovered with the help of business domain experts. However, in the massive data sets, which are highly unstructured and unorganized many of such useful multidimensional features go undetected and the underlying models are not optimum models. Linear transformation techniques such “Principal component Analysis” are incapable of dealing with nonlinear features.

Deep learning helps in dealing with the situation and provides techniques for automatic extraction of most representative features of the data set. Deep learning algorithms try to emulate hierarchical learning of human brain. Deep learning algorithms have the ability to generalize in non-local ways and to detect patterns beyond nearest neighbors. Deep learning algorithms provide richer generalization of input features by providing a multilayer abstraction approach. At each layers, the features are abstracted and generalized to identify the multidimensionality. For example: an image is composed of different sources of variability in terms of light, shapes of objects and materials. Multilayer abstraction provided by deep learning algorithms can help in separating different sources of variation in data.

Information retrieval and indexing:
Although not related to data mining directly but another major issue related to underlying data is “the efficiency of information retrieval”. Data in today’s world has exceeded the typical storage, processing and computing capacity of traditional databases and data analysis tools. Rise of Big Data technologies have made it possible to store the massive data generated each hour. In addition to volume of data, Big data is also associated with other complexities such as Variety, Velocity and Veracity.
With the growing dependence on data, efficient storage and retrieval of information has become increasingly important. Traditional indexing solutions are no longer useful to improve the situation because this data is huge and not organized as a relational model. Data collected from sources as video streaming, images and audio require more than traditional indexing. This huge amount of data needs semantic indexing so that data can be presented in more efficient manner and can be used as a source for knowledge discovery and comprehension. Deep learning provides solution to implement semantic indexing for efficient information retrieval. Deep learning generates high level abstract data representation, which can be used for semantic indexing instead of using raw data for indexing. Deep learning can not only provide semantic indexing but can also help in uncover the complex relationships and factor leading to knowledge and understanding. Data abstraction and representation makes it possible to store similar representation closer to each other in memory for fast retrieval.

Dealing with Massive unlabeled/unsupervised data:
Another challenge apart from volume and velocity of data is ability to deal with massive unlabeled and unsupervised data. This data contains complicated non-linear features. As explained earlier, deep learning can help in feature engineering by providing a multilayer abstraction but this task become even more daunting when the underlying data is unsupervised. It is therefore becomes essential to decode these complex nonlinear features and use the simpler form in the algorithms. The technique of discovering these nonlinear complex features is called discriminative task.
Discriminative task not only helps in discriminative analysis but can also be used for data tagging to improve searching algorithms. For example, MAVIS- Microsoft Research Audio Video indexing system uses deep learning to enable search with speech. Discriminative tasks have become increasingly important with the growth of digital media collections. Most of this digital media comes from social networks, GPS, medical imaging and image sharing systems. It is highly important to organize and store these images so that they can be browsed and retrieve more efficiently. This huge collection of images is an example of unlabeled data because in technical terms, a picture is a collection of pixels only. We need efficient methods to store and organized this unsupervised and unlabeled data. Text based searches are no longer capable to provide the right solution to this huge collection. One solution is to use automated tagging and extracting semantic information for these images. Deep learning provides useful techniques to construct useful representation of these image and video data in real time, which can be used for image indexing and retrieval.

CONCLUSION:

With growing business opportunities in the field of data, the dependence on data is only going to increase in the future. We have witnessed the exponential growth in this vertical in recent years. Companies have been trying to leverage every possible opportunity provided by the data. Companies are storing each bit and byte of data in its raw format for later use. Even though, it is true that data is king in today’s market but at the same time industry needs to discover efficient ways of storing only the relevant data. This is in contrast with the fundamental fact of Big data but dumping massive data in data lakes can lead to difficulties in retrieval where in the most important piece of data is buried deep underneath. Some more challenges presented by Big data are
  • Inefficiency in dealing with high Dimensional Data
  • Large scale Models
  • Problems with Increment learning

Models based on these huge data sets are not only computationally very expensive but also create difficulties in terms of interpretation. Moreover, the process of model creation and evaluation is an iterative process and it takes many iterations before discovering the right set of parameters and algorithm for a given business problem. Given these huge data sets, the process of models creation and evaluation becomes very time consuming. Even though, techniques such as Principal component analysis and deep learning can help to discover the most relevant features of the given data set yet the process is intensive and requires a deep understanding of business domain and statistical methods.
We have witnessed a huge number of business cases to leverage the power of data in recent years but industry is still trying to come up with optimum solutions to mine the data in real time. Many useful information provided by these huge data sets is time sensitive for example, If a company is trying to predict stock prices then although it is important to observe the trend line based on historical events but is even more important to be able to predict the future in real time.
With increasing use of deep learning techniques in data mining and artificial intelligence field, we can expect the solutions to incremental learning with real time analytics. Deep learning can provide solutions to many of these problems but there is lot of research work required to make deep learning more useful. Open source tools, technologies and frameworks such as R, Python, F#, SKlearn has played a significant role to attain the current level of data industry. We have seen major companies such google, Facebook, yahoo etc. to share their proprietary frameworks to help revolutionizing the industry. The release of TransForce library by Google to aid deep learning, last year, should help in improving the business uses cases of deep learning to solve the real world problems of image recognition, natural language process etc. with even higher accuracy. Nevertheless, it will be interesting to witness this journey and be even more interesting to be part of this journey.

References:http://journalofbigdata.springeropen.com/articles/10.1186/s40537-014-0007-7

2 comments:

  1. The article was up to the point and described the information very effectively. Thanks to blog author for wonderful and informative post.https://Suryainformatics.com

    ReplyDelete
  2. It is better to click here to get the proper information about the iot strategies, which can help in designing the useful iot connectivity.

    ReplyDelete