Big Data Management
Managing and analyzing data of the customers have been a great benefit for the industries, but also a great challenge for them. Till the time they were having a handful of customers things were quite simple, but as soon as the number increased, complications in managing information raised along with. To survive or gain a competitive advantage these companies added more product varieties on finding the vast range of customers. Things were not confined to industries only, problems kept arising in medical science, research and development field, 3D simulations and designing etc. The large data from medical science gets deleted every week due to insufficient storage management. To face these challenges the term Big Data came into the scene.
“Big Data is the next big thing in computing and generates value from very large datasets, but cannot be analyzed with traditional computing techniques”.
The Big Data Explosions:
The quantity of computer data generated on planet Earth is growing due to several reason first starts. Retailers, building vast database or recorded customer activity, organizations working on the just sticks financial services, health care are also capturing more data and public social media are creating vast quantities of digital material.
The cycle of big data management:
Characteristics of Big Data
Big Data is often characterized using 3 V’s:
It poses the greatest challenge and the greatest opportunity as Big Data could help many organizations to understand people better and allocate resources more effectively.However, traditional computing’s solution like relational databases are incapable to handle with this huge magnitude.
Big Data velocity also raises a number of issues with the rate at which data was flowing into many organizations now exceeding the capacity of their IT systems. In addition, users increasingly demand with a stream to them in real time and delivering this can prove quite a challenge.
Finally the variety of data types to be produced in becoming increasingly diverse, gone are the days when data centers only have to deal with the documents, financial transactions, stack records and personal files. Today photographs, audios, videos, 3D models, complex simulations and location data are being pulled into many corporate systems. Many such Big Data sources also unstructured and not easy to categories let alone process with traditional computing techniques.
- Veracity: ‘refers to the degree in which a leader trusts the used information in order to take decision. So getting the right correlations in Big Data is very important for the business future.’
|Some Real Facts:
Ø New York stock exchange generates a 1 TB / day.
Ø Google processes 700PB/month
Ø Facebook hosts 10 billion photos taking 1PB of storage
‘One clear example of Big Data is the Square Kilometre Array (SKA) (www.skatelescope.org) planned to be constructed in South Africa and Australia. When the SKA is completed in 2024 it will produce in excess of one exabyte of raw data per day (1 exabyte = 1018 bytes), which is more than the entire daily internet traffic at present.’
Such a big data management definitely requires a big big big data center for the storage purpose. The idea that cloud computing means data isn’t stored on computer hardware isn’t accurate. Your data may not be on your local machine, but it has to be housed on physical drives somewhere — in a data center.
Some of the words about Data Centers were given by the experts are given below:-
- Data centers are the brains of the Internet.
- The engine of the internet, it is a giant building with a lot of power, a lot of cooling and a lot of computers.
- With its row upon row upon row upon row of machines, all working together to provide the services that make all functions.
But a question should arise in our mind that how the Big Data is stored in any data center and why should we trust any of the kind that the data is secured there. Say for example, Google and Facebook are managing the mammoth data in its data centers. Doesn’t it make us think how that much quantity of data they manage? Not only the datasets of that particular organization are there, but they are lending their resources to customers also, that means customers data is also to be managed. Below is a brief description over the management of data centres:
Today the leading Big Data technology is Hadoop. This is an open source software also likely for reliable, scalable distributing computing and provides the first viable platform the big data analytics. Hadoop is already used by most big data pioneers. For example, LinkedIn core uses it to generate over $100 Billion personalized recommendation every week.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures (Source: http://hadoop.apache.org/.) Hadoop is designed to process large amount of structured, unstructured data (terabytes to petabytes) and is implemented on the racks of commodity servers as a Hadoop cluster. It is designed to parallelize data processing across computing nodes to speed computation and hide latency.
Hadoop distributed to storage in processing here large data sets across groups or clusters of server computers whereas traditional large scale computing solutions rely on expensive server hardware with a high fault tolerance. Hadoop detection compensates the hardware failures over the systems at the application level. This allows a high level of service continuity to be delivered by clusters of individual computers, each of which may be prone to failure.Technically Hadoop consists of two key components
- First is the Hadoop distributed file system which consists high bandwidth close to base storage.
- The second is a data processing framework called map reduce based on Google Search Technology.
Map reduce distribute or map large data sets across multiple servers. Each server then converts a summary of data to it has been allocated. All this server information used an aggregated in this so term reduced state. Map reduce subsequently allows extremely large role datasets to be rapidly distilled before more traditional data analysis tools are applied.
Differences between Hadoop mapreduce and RDMS
|S.no||Traditional RDMS||Hadoop Mapreduce|
|3.||Updates||Read and write many times||Write once, Read many times|
|4.||Structure||Static schema||Dynamic schema|
|5.||Integrity||High||Low in simple setup, can be improved with additional servers|
|6.||Scaling||Non-linear||Linera (up to 10,000 machines as of Dec 2012)|
For organizations who cannot afford an internal Big Data to infrastructure cloud based Big Data solution are also available where public Big Data sets need to be downloaded. For example, Amazon web services already host many public datasets containing government and medical information.
Book: BIG DATA FOR DUMMIES by Judith Hurwitz, Alan Nugent, Dr. Fern Halper,
and Marcia BIG DATA BOOK
All rights reserved. No part of this Post may be copied, distributed, or transmitted in any form or by any means, without the prior written permission of the website admin, except in the case of brief quotations embodied in critical reviews and certain other noncommercial uses permitted by copyright law. For permission requests, write to the owner, addressed “Attention: Permissions Coordinator,” to the admin @coderinme