Introduction to Big Data, Hadoop, Hadoop Distributed File System(HDFS), MapReduce
What is Big data?
Big Data Refers to huge amount of data which is very difficult to store, process and analyse. Volume, Velocity, Variety and Veracity (accuracy) are 4 characteristics. Volume refers to the amount of data, variety refers to the number of types of data, velocity refers to the speed of data processing and veracity refers to the noise and abnormality in data.
What is Hadoop?
Hadoop is a framework that allows distributed processing of large data sets. It is designed to solve problems that involve analyzing large sets of data (e.g. petabytes). Hadoop has become the low cost industry standard ecosystem for securely analyzing high volume data from a variety of enterprise sources.
Hadoop is not a database. Core components of Hadoop Framework are HDFS and MapReduce. HDFS is basically used to store large data sets and MapReduce is used to process such large data sets.
What is HDFS?
HDFS is a distributed and scalable file system used for storing very large data set, running clusters on commodity hardware. Files in HDFS are broken down into block-sized chunks, which are stored as independent units. A ‘block’ is the minimum amount of data that can be read or written. HDFS architecture follows a Master/Slave Architecture, where a cluster comprises of a single NameNode (Master node) and a number of DataNodes (Slave nodes).
What is MapReduce?
Map Reduce is the ‘heart‘ of Hadoop that consists of two parts – ‘map’ and ‘reduce’. Maps and reduces are programs for processing data. ‘Map’ processes the data first to give some intermediate output which is further processed by ‘Reduce’ to generate the final output.
What are the four modules that make up the Apache Hadoop framework?
- Hadoop Common, which contains the common utilities and libraries necessary for Hadoop’s other modules.
- Hadoop YARN, the framework’s platform for resource-management.
- Hadoop Distributed File System, or HDFS, which stores information on commodity machines,
- Hadoop MapReduce, a programming model used to process large sets of data.
Let’s find here how to Setup Multi Node Hadoop Cluster.
Wish you the best and Happy Reading!
All data and information provided on this site is for informational purposes only. thearsana.com makes no representations as to accuracy, completeness, correctness, suitability, or validity of any information on this site and will not be liable for any errors, omissions, or delays in this information or any losses, injuries, or damages arising from its display or use. All information is provided on an as-is basis.