Wednesday, April 23, 2014

Challenges of Big Data; Is Hadoop meeting the Big Data Challenge?

Are we living in the era of "Big Data”? Yes, of course. In today's technology-fuelled world where computing power has significantly increased, electronic devices are more commonplace, accessibility to the Internet has improved, and users have been able to transmit and collect more data than ever before. Organizations are producing data at an astounding rate. It is reported that Facebook alone collects 250 terabytes a day.

According to Thompson Reuters News Analytics, digital data production has more than doubled from almost 1 million petabytes (equal to about 1 billion terabytes) in 2009 to a projected 7.9 zettabytes (a zettabyte is equal to 1 million petabytes) in 2015, and an estimated 35-40 zettabytes in 2020. Other research organizations offer even higher estimates!

As organizations have begun to collect and produce massive amounts of data, they have recognized the advantages of data analysis. But they have also struggled to manage the massive amounts of information that they have. This has led to new challenges.


Businesses realize that tremendous benefits can be gained in analyzing Big Data related to business competition, situational awareness, productivity, science, and innovation. 


Apache Hadoop meets the challenges of Big Data by simplifying the implementation of data-intensive, highly parallel distributed applications. It allows analytical tasks to be divided into fragments of work and distributed over thousands of computers, providing fast analytics time and distributed storage of massive amounts of data. 

Hadoop provides a cost-effective way for storing huge quantities of data. It provides a scalable and reliable mechanism for processing large amounts of data over a cluster of commodity hardware. And it provides new and improved analysis techniques that enable sophisticated analytical processing of multi- structured data.

Hadoop is different from previous distributed approaches in the following ways:



    In addition, Hadoop provides a simple programming approach that abstracts the complexity evident in previous distributed implementations. As a result, Hadoop provides a powerful mechanism for data analytics, which consists of the following:
  • Vast amount of storage — Hadoop enables applications to work with thousands of computers and petabytes of data. Over the past decade, computer professionals have realized that low-cost "commodity" systems can be used together for high-performance computing applications that once could be handled only by supercomputers. Hundreds of "small" computers may be configured in a cluster to obtain aggregate computing power that can exceed by far that of single supercomputer at a cheaper price. Hadoop can leverage clusters in excess of thousands of machines, providing huge storage and processing power at a price that an enterprise can afford.
  • Distributed processing with fast data access — Hadoop clusters provide the capability to efficiently store vast amounts of data while providing fast data access. Prior to Hadoop, parallel computation applications experienced difficulty distributing execution between machines that were available on the cluster. This was because the cluster execution model creates demand for shared data storage with very high I/O performance. Hadoop moves execution toward the data. Moving the applications to the data alleviates many of the high performance challenges. In addition, Hadoop applications are typically organized in a way that they process data sequentially. This avoids random data access (disk seek operations), further decreasing I/O load.
  • Reliability, failover, and scalability — In the past, implementers of parallel applications struggled to deal with the issue of reliability when it came to moving to a cluster of machines. Although the reliability of an individual machine is fairly high, the probability of failure grows as the size of the cluster grows. It will not be uncommon to have daily failures in a large (thousands of machines) cluster. Because of the way that Hadoop was designed and implemented, a failure (or set of failures) will not create inconsistent results. Hadoop detects failures and retries execution (by utilizing different nodes). Moreover, the scalability support built into Hadoop's implementation allows for seamlessly bringing additional (repaired) servers into a cluster, and leveraging them for both data storage and execution. For most Hadoop users, the most important feature of Hadoop is the clean separation between business programming and infrastructure support. For users who want to concentrate on business logic, Hadoop hides infrastructure complexity, and provides an easy-to-use platform for making complex, distributed computations for difficult problems.

Monday, April 21, 2014

Big Data Landscape


To understand Big Data's present landscape, the industries/companies involved can be categorized in six major segments as below -
  1. Infrastructure
    • NoSQL Databases
    • Hadoop Related
    • NewSQL Databases
    • MPP Databases
    • Management Monitoring
    • Cluster Services
    • Storage
    • Crowd-sourcing
    • Security
    • Collection/ Transport
  2. Analytics
    • Analytics Solutions
    • Data Visualization
    • Statistical Computing
    • Social Media
    • Sentiment Analysis
    • Analytics Services
    • Location/People/Events
    • Big Data Search
    • IT Analytics
    • Real Time
    • Crowdsourced Analytics
    • SMB Analytics
  3. Applications
    • Ad Optimization
    • Publisher Tools
    • Marketing
    • Industry Applications
    • Application Service Providers
  4.  Data Sources
    • Data Market Places
    • Data Sources
    • Personal Data
  5. Cross Infrastructure / Analytics
  6. Open Source Projects
    • Framework
    • Query/ Data Flow
    • Data Access
    • Coordination / Workflow
    • Real Time
    • Statistical Tools
    • Machine Learning
    • Cloud Deplyment
Courtesy: Matt & Zilis

Monday, April 7, 2014

What is Big Data?


As per Wikipedia's definition, Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

There’s no clear cut definition for ‘big data’ - it is a very subjective term. Most people would consider a data set of terabytes or more to be ‘big data’. One reasonable definition is that it is data which can’t easily be processed on a single machine.

The big data challenges include capture, curating, storage, search, sharing, transfer, analysis and visualization. 

To understand big data in more precisely, we should look into the characteristic of big data, i.e. 3 V's - 


Big Data: 3 V's




















1.  Volume:

As the name suggests big data technology depends on a massive amount of data in order to get better intelligence and more valuable information. The technology is few years old and according to IBM in 2012, the data gathered every day equals 2.5 Exabyte (2.5 quintillion bytes). Such enormous amount of data will definitely require very advanced computational power as well as storage resources to be handled, stored, and analyzed in a reasonable amount of time. Moreover, the gathered information is rapidly increasing in detail, thus in size. 

According to Harvard business review, “Big Data: The Management Revolution, by Andrew McAfee”, the size of data is expected to be doubled every 40 months depending on the high penetration rates of the wireless technology market. 

2.  Velocity:

Big data technology requires a very high computational resources as well as storage in order to handle large data and complex sets of unstructured data. The data can be generated and stored in many ways, yet the company ability to store, retrieve and process these data sets affects the company agility. 

A famous example was demonstrated by a group of researchers from the MIT media lab on the black Friday (the start of Christmas shopping in the United States). In an experiment the MIT media lab group collected information from the Location Based Service over Smartphones to detect how many cars entered Macy’s parking lot. Using such information they were able to estimate the size of Macy’s sales before Macy’s itself was able to detect it. 

3.  Variety:

Unlike the traditional analytics, the big data theoretically has an infinite number of forms. The data are collected in tremendous number of ways and every single operation or action represents a value to the business. No one can count the number of operations that are carried over the web and electronic devices every single moment all over the globe. For instance, every post and interaction on Facebook, tweet, shared image, text message, GPS signal and many other forms of electronic interaction counts and adds valuable information. 

This variety of data in most cases produces large amounts of unstructured data sets. The biggest issue that comes with such enormous and unstructured database is the noisy image of the data. Subsequently, in order to get the proper information and superior value the big data will poses much more mining.

The digital data is growing like tsunami. As per the forecast done by
IDC, it is projected to grow about 40 Zettabytes by 2020.

Courtesy: Hadoop Summit April 2-3, 2014, Amsterdam, Netherlands
              

The key Big Data technologies are as follows -
  • Hadoop - MapReduce framework, including Hadoop Distributed File System (HDFS)
  • NoSQL (Not Only SQL) data stores
  • MPP (Massively Parallel Processing) databases
  • In-memory database processing
Typical Big Data Problems -
  • Perform sentiment analysis on 12 terabytes of daily Tweets
  • Predict power consumption from 350 billion annual metere readings
  • Identify potential fraud in a business's 5 million daily transactions