HDFS - DistCp (distributed inter/intra-cluster copy)

> Database > (Apache) Hadoop > Hadoop Distributed File System (HDFS)

1 - About

DistCp (distributed copy) is a tool used for large inter/intra-cluster copying

Advertising

3 - Concept

distcp is a mapReduce application and run therefore in parallel. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.

4 - Management

4.1 - Inter-cluster copy

Hadoop - hadoop client utility

hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo

where; nn = HDFS - NameNode

5 - Example

5.1 - between S3 and Hdfs

hadoop distcp s3n://AWS_SECRET_ID:AWS_SECRET_KEY@blaze-data/enron-email hdfs:///tmp/enron

6 - Documentation / Reference