An cluster is group of process (generally one machine per process) called node where you will find two kind of nodes:

Additionally, you will have certainly edge nodes that hosts the client application (no services).

A minimal hadoop cluster needs HDFS and YARN.

## 3 - Management

### 3.2 - Capacity

Computer - Capacity Planning (Sizing) for a cluster

It depends mostly on how Hadoop is used.

• Storage - Disk- with the default File System and with the default block size. Example for 10TB

$$\text{Permanent Storage} = 2 . \text{Data Size} = 2 . 10TB = 20 TB \\ \text{Blocks} = 2 . \frac{\text{Data Size}}{\text{block Size}} = 2 . \frac{10000000}{128} = 156,250$$

• Temp Storage: The output size for the Map function (will be deleted at the end of shuffle) + the output size of the Reduce function. If the Map output won't be much larger than the input …

$$\text{Temporary Storage} = 2 . \text{Data Size} = 2 . 10TB = 20 TB \\$$