What is BigData & Hadoop?
Big Data
Big Data is a term used to describe a collection of data that is huge in volume and yet growing exponentially with time.
In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.
The concept of big data gained momentum in the early 2000s when industry analyst Doug Laney articulated the now-mainstream definition of big data as the three V’s:
Volume: Organizations collect data from a variety of sources, including business transactions, smart (IoT) devices, industrial equipment, videos, social media, and more. In the past, storing it would have been a problem — but cheaper storage on platforms like data lakes and Hadoop have eased the burden.
Velocity: With the growth in the Internet of Things, data streams into businesses at an unprecedented speed and must be handled in a timely manner. RFID tags, sensors, and smart meters are driving the need to deal with these torrents of data in near-real-time.
Variety: Data comes in all types of formats — from structured, numeric data in traditional databases to unstructured text documents, emails, videos, audios, stock ticker data, and financial transactions.
Big Data Case Study:-
- Facebook:
Facebook generates 4 petabytes of data per day — that’s a million gigabytes. This enormous amount of content generation is without a doubt connected to the fact that Facebook users spend more time on the site than users spend on any other social network, putting in about an hour a day.
2. Airways:
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand flights per day, the generation of data reaches up to many Petabytes.
Hadoop
As the World Wide Web grew in the late 1900s and early 2000s, search engines, and indexes were created to help locate relevant information amid the text-based content. In the early years, search results were returned by humans.
But as the web grew from dozens to millions of pages, automation was needed. Web crawlers were created, many as university-led research projects and search engine start-ups took off (Yahoo, AltaVista, etc.).
Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.
The base Apache Hadoop framework is composed of the following modules:
- Hadoop Common — contains libraries and utilities needed by other Hadoop modules;
- Hadoop Distributed File System (HDFS) — a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
- Hadoop YARN — (introduced in 2012) a platform responsible for managing computing resources in clusters and using them for scheduling users’ applications;
- Hadoop MapReduce — an implementation of the MapReduce programming model for large-scale
data processing.