In last one decade, Data mining or Big
Data Analytics has become increasingly important for both public and private
companies. We have seen a huge increase in new technologies, frameworks and
hardware to support the wave of big data in recent years. With companies
collecting more and more data, there is huge need to mine this data to derive
useful insights and decisions. Some of the most useful applications of data
mining are in cyber security, marketing, fraud detection and customer
relationship. Data mining techniques are highly focused on creating generalized
patterns from massive data sets and use these patterns to help in decision
making for future use.
However, the underlying
foundation of any data mining or machine learning technique is the goodness of
the data. A complex and efficient technique can fall short, if the features
used to train the models were not representative. On the other hand, even a
simple algorithm can do a wonderful job, if it is provided with right features.
So the question that pops up is ,”How to find right set of features in the
massive data sets with millions of data points?”. In our following discussion
we will discuss some unique challenges involved with data mining and big data
and see how deep learning can help overcoming these challenges.
KEY
ISSUES:
- Feature Engineering.
- Information retrieval and indexing.
- Dealing with Massive unlabeled/unsupervised data
ANALYSIS
OF ISSUES:
Feature
Engineering:
Performance and
effectiveness of data mining or machine learning techniques i.e supervised or
unsupervised algorithms largely depend on the underlying data. No technique can
lead to fruitful results if the underlying data is incorrect,
non-representative or not used properly. Even though, the exiting tools and
technologies provides many options to discover the relationships between the
input variables and the target variables yet these options and techniques are
meant to deal with small number of features. Most often used data analysis
techniques are limited to uni-variate, bivariate and multivariate analysis,
which are confined to few number of variables and are focused on single
dimensional features. However, in true world, features are not generally single
dimensional; Input features can be multidimensional and are important to
identify and derive to get the satisfactory results from underlying algorithms.
Feature Engineering is most time consuming but important part of any data
mining algorithm. Many useful multidimensional features can be discovered with
the help of business domain experts. However, in the massive data sets, which
are highly unstructured and unorganized many of such useful multidimensional
features go undetected and the underlying models are not optimum models. Linear
transformation techniques such “Principal component Analysis” are incapable of
dealing with nonlinear features.
Deep
learning helps in dealing with the situation and provides techniques for
automatic extraction of most representative features of the data set. Deep
learning algorithms try to emulate hierarchical learning of human brain. Deep
learning algorithms have the ability to generalize in non-local ways and to
detect patterns beyond nearest neighbors. Deep learning algorithms provide
richer generalization of input features by providing a multilayer abstraction
approach. At each layers, the features are abstracted and generalized to identify
the multidimensionality. For example: an image is composed of different sources
of variability in terms of light, shapes of objects and materials. Multilayer
abstraction provided by deep learning algorithms can help in separating
different sources of variation in data.
Information
retrieval and indexing:
Although not
related to data mining directly but another major issue related to underlying
data is “the efficiency of information retrieval”. Data in today’s world has
exceeded the typical storage, processing and computing capacity of traditional
databases and data analysis tools. Rise of Big Data technologies have made it
possible to store the massive data generated each hour. In addition to volume
of data, Big data is also associated with other complexities such as Variety,
Velocity and Veracity.
With
the growing dependence on data, efficient storage and retrieval of information
has become increasingly important. Traditional indexing solutions are no longer
useful to improve the situation because this data is huge and not organized as
a relational model. Data collected from sources as video streaming, images and
audio require more than traditional indexing. This huge amount of data needs
semantic indexing so that data can be presented in more efficient manner and
can be used as a source for knowledge discovery and comprehension. Deep
learning provides solution to implement semantic indexing for efficient
information retrieval. Deep learning generates high level abstract data
representation, which can be used for semantic indexing instead of using raw
data for indexing. Deep learning can not only provide semantic indexing but can
also help in uncover the complex relationships and factor leading to knowledge
and understanding. Data abstraction and representation makes it possible to
store similar representation closer to each other in memory for fast retrieval.
Dealing with
Massive unlabeled/unsupervised data:
Another
challenge apart from volume and velocity of data is ability to deal with
massive unlabeled and unsupervised data. This data contains complicated
non-linear features. As explained earlier, deep learning can help in feature
engineering by providing a multilayer abstraction but this task become even
more daunting when the underlying data is unsupervised. It is therefore becomes
essential to decode these complex nonlinear features and use the simpler form
in the algorithms. The technique of discovering these nonlinear complex
features is called discriminative task.
Discriminative
task not only helps in discriminative analysis but can also be used for data
tagging to improve searching algorithms. For example, MAVIS- Microsoft Research
Audio Video indexing system uses deep learning to enable search with speech.
Discriminative tasks have become increasingly important with the growth of
digital media collections. Most of this digital media comes from social
networks, GPS, medical imaging and image sharing systems. It is highly
important to organize and store these images so that they can be browsed and
retrieve more efficiently. This huge collection of images is an example of
unlabeled data because in technical terms, a picture is a collection of pixels
only. We need efficient methods to store and organized this unsupervised and
unlabeled data. Text based searches are no longer capable to provide the right
solution to this huge collection. One solution is to use automated tagging and
extracting semantic information for these images. Deep learning provides useful
techniques to construct useful representation of these image and video data in
real time, which can be used for image indexing and retrieval.
CONCLUSION:
With growing business opportunities in
the field of data, the dependence on data is only going to increase in the
future. We have witnessed the exponential growth in this vertical in recent
years. Companies have been trying to leverage every possible opportunity
provided by the data. Companies are storing each bit and byte of data in its
raw format for later use. Even though, it is true that data is king in today’s
market but at the same time industry needs to discover efficient ways of
storing only the relevant data. This is in contrast with the fundamental fact
of Big data but dumping massive data in data lakes can lead to difficulties in
retrieval where in the most important piece of data is buried deep underneath.
Some more challenges presented by Big data are
- Inefficiency
in dealing with high Dimensional Data
- Large scale
Models
- Problems
with Increment learning
Models based on these huge data sets are
not only computationally very expensive but also create difficulties in terms
of interpretation. Moreover, the process of model creation and evaluation is an
iterative process and it takes many iterations before discovering the right set
of parameters and algorithm for a given business problem. Given these huge data
sets, the process of models creation and evaluation becomes very time
consuming. Even though, techniques such as Principal component analysis and
deep learning can help to discover the most relevant features of the given data
set yet the process is intensive and requires a deep understanding of business
domain and statistical methods.
We have witnessed a huge number
of business cases to leverage the power of data in recent years but industry is
still trying to come up with optimum solutions to mine the data in real time. Many
useful information provided by these huge data sets is time sensitive for
example, If a company is trying to predict stock prices then although it is
important to observe the trend line based on historical events but is even more
important to be able to predict the future in real time.
With increasing use of deep
learning techniques in data mining and artificial intelligence field, we can
expect the solutions to incremental learning with real time analytics. Deep
learning can provide solutions to many of these problems but there is lot of
research work required to make deep learning more useful. Open source tools,
technologies and frameworks such as R, Python, F#, SKlearn has played a
significant role to attain the current level of data industry. We have seen
major companies such google, Facebook, yahoo etc. to share their proprietary
frameworks to help revolutionizing the industry. The release of TransForce
library by Google to aid deep learning, last year, should help in improving the
business uses cases of deep learning to solve the real world problems of image
recognition, natural language process etc. with even higher accuracy.
Nevertheless, it will be interesting to witness this journey and be even more
interesting to be part of this journey.
References:http://journalofbigdata.springeropen.com/articles/10.1186/s40537-014-0007-7
References:http://journalofbigdata.springeropen.com/articles/10.1186/s40537-014-0007-7
The article was up to the point and described the information very effectively. Thanks to blog author for wonderful and informative post.https://Suryainformatics.com
ReplyDeleteIt is better to click here to get the proper information about the iot strategies, which can help in designing the useful iot connectivity.
ReplyDelete