[go: up one dir, main page]

DEV Community

shubham mishra
shubham mishra

Posted on • Originally published at developerindian.com

Type of data in hadoop

First developed by Doug Cutting and Mike Cafarella in 2005, Licence umder forApache License 2.0
Apache Hadoopis an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. The Hadoop Distributed File System (HDFS) is Hadoop's storage layer. Housed on multiple servers, data is divided into blocks based on file size. These blocks are then randomly distributed and stored across slave machines. HDFS in Hadoop Architecture divides large data into different blocks
Cloudera HadoopCloudera's open source platform, is the most popular distribution of Hadoop and related projects in the world .

** What is big data ?
**
Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. Big data is also a data but with huge size. For this we can useApache Hadoopandcloudera hadoop

Type of data used in big data(Apache Hadoop / Spark)?

**Structured –
**that which can be stored in rows and columns like relational data sets

Unstructured – data that cannot be stored in rows and columns like video, images, etc.
Semi-structured – data in XML that can be read by machines and human
Unstructured
Structured –that which can be stored in rows and columns like relational data sets

**Semi-structured
**Unstructured – data that cannot be stored in rows and columns like video, images, etc.

Lights
Semi-structured – data in XML that can be read by machines and human

About Apache Hadoop

Hadoop is the most important framework for working with Big Data. The biggest strength of Hadoop is scalability. It can upgrade from working on a single node to thousands of nodes without any issue in a seamless manner.
Advantages of hadoop
Apache hadoopstores data in a distributed fashion, which allows data to be processed distributedly on a cluster of nodes

In short, we can say thatApache hadoopis an open-source framework. Hadoop is best known for its fault tolerance and high availability feature
Apache hadoopclusters are scalable.
TheApache hadoopframework is easy to use.
In HDFS, the fault tolerance signifies the robustness of the system in the event of failure. The HDFS is highly fault-tolerant that if any machine fails, the other machine containing the copy of that data automatically become active.

Top comments (0)