July 8, 2019

We are living in the information age, thus with the advent of the Digital revolution, we have a large number corporations to serve our needs, to make our lives easier and to take us to the next level of humanity. To achieve this task these corporations need a base on which they could achieve this task. This “base” is the data which they have generated over the course of time.

Data is ever generating. Each second piles of data are being generated. Let’s have a look at an example : Facebook has 1.79 billion active monthly users. These users are uploading images, videos, posting texts, comments, likes, etc. which is being done every second which accounts for all types of data – structured, semi-structured and unstructured. Now let’s say, if each user hits at least one like a day, it would account to 1.79 billion likes to handle in a day. But we all know that is surely not the case, the real figures are much more humongous than that. Likewise, did you ever pondered that the thousands of Apps on the Google play store having having millions of download with each user generating data, What happens to that data ?

This data is processed and analyzed by respective companies to add competitive advantage to their Corporation and come up with future solution for them to serve the customers better, learning from the scenarios of the present. For example, people comment their reviews about specific products on Amazon, these reviews are then processed by Amazon to understand the plight of the customer and provide better service in the future.

According to the prevailing Database management system, handling this amount of data generating at this pace is a challenge as it required high memory(RAM), scaling beyond a capacity often involved downtime and came with an upper limit. Also, it would not have been cost friendly. Further, RDBMS was unable to categorize unstructured data.

So the question arises, how to process these data sets ? The answer to this is HADOOP.

Hadoop is an open source framework that allows distributed processing of large data sets across clusters of commodity hardware.

Open Source: In contrast, to the traditional RDBMS systems which required purchasing a license, Hadoop is readily available without any cost, maintained by Apache foundation.

Distributed processing: Hadoop framework splits the data into chunks and processes them in parallel. This makes it time efficient in handling big data sets. e.g. If you have to write 10,000 pages. What would you prefer, hiring the world’s fastest writer or hiring 100 writers a day ? Definitely, the latter is much faster and cheaper.

Large Data Sets: Used to process big data sets with ease. All the processing happens on the data present in HDFS (Hadoop File System). Since the data is divided into different machines and processed in parallel, it allows to process massive data.

Commodity Hardware: Hadoop uses cheap and simple hardware rather than high enterprise computers that cost too much. “China mobile”, a Telecom Company based in China, which was generating 5-8 TB data of records daily. Hadoop enabled them to use 10 times data than their older system at ⅕ cost.

Related Blog

Hadoop Ecosystem

Hadoop is a framework which comprised of set of tools and technologies. They combine together to make a Eco System. Different tools can be used at different parts of projects based on its implementati

Advantages of Hadoop

Hadoop being open-sourced and using commodity hardware provides a cost effective model, unlike traditional RDBMS, which requires expensive hardware and high-end processors.

What are Dimension Tables

Data Warehouse Tutorial