About

Windows Azure Storage Blob (WASB) is an file system implemented as an extension built on top of the HDFS APIs and is in many ways HDFS.

The WASB variation uses:

  • SSL certificates for improved security
  • the storage accounts in WASB to load data instead of from local disks in HDFS.

WASB is built into HDInsight (Microsoft's Hadoop on Azure service) and is the default file system.

Azure storage stores files as a flat key/value store without formal support for folders. The hadoop-azure file system layer simulates folders on top of Azure storage. By default, folder rename in the hadoop-azure file system layer is not atomic. That means that a failure during a folder rename could, for example, leave some folders in the original directory and some in the new one. See the parameter fs.azure.atomic.rename.dir if you want to make the operations atomic.

In Azure you store blobs on containers within Azure storage accounts.

  • You grant access to a storage account
  • You create collections at the container level,
  • and you place blobs (files of any format) inside the containers.

Azure Storage Structure

Limitations

Only the commands that are specific to the native HDFS implementation (which is referred to as DFS), such as fschk and dfsadmin, show different behavior in Azure storage

Structure

  • Multiple Hadoop clusters can point to one storage account.
  • One Hadoop cluster can point to multiple storage accounts.
  • An Hadoop Cluser (machine) may be removed but the data may still be persisted (WASB). You can add, remove, and modify files in the Azure blob store without regard to whether a Hadoop cluster exists.

Azure Storage Structure

Configuration

Chunk size

  • The data is chunked and distributed to nodes when a job is run. If you need to change the chunk size for memory related performance at run time that is still an option. See Data Processing - Buffer (Batch concept in code)
  • You can pass in any Hadoop configuration parameter setting when you create the cluster or you can use the SET command for a given job.

Replication factor

  • Each blob (file) is replicated 3x within the data center.
  • If you choose to use geo-replication on your account you also get 3 copies of the data in another data center within the same region.

Management

File location

Azure doesn't have the notion of directory. However, the parsing of the file name gives the tree structure because Hadoop recognizes that a slash “/” is an indication of a directory.

Blob address:

# Fully Qualified name Local
hdfs://<namenodehost>/<path>

# HDInsight Syntax Global
wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>
# Example
wasb://[email protected]/SomeDirectory/ASubDirectory/AFile.txt

Scheme

The schemes wasb and wasbs identify a URL on a file system backed by Azure Blob Storage.

Driver: org.apache.hadoop.fs.azure.Wasb

Use blob storage

  • Locally Fully Qualified Name:
hdfs://<namenodehost>/<path>
  • Globally in Azure Storage
wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>

Make a directory

hadoop fs -mkdir wasb://[email protected]/testDir

Upload

hadoop fs -put testFile wasb://[email protected]/testDir/testFile
azure storage blob upload <sourcefilename> <containername> <blobname> --account-name <storageaccountname> --account-key <storageaccountkey>

See also: https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-upload-data

Download

  • Cli
azure storage blob download <containername> <blobname> <destinationfilename> --account-name <storageaccountname> --account-key <storageaccountkey>

https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage#download-files

Cat the content

hadoop fs -cat wasbs://[email protected]/testDir/testFile
test file content

Delete

  • Powershell
Remove-AzureStorageBlob -Container $containerName -Context $storageContext -blob $blob
  • Cli
azure storage blob delete <containername> <blobname> --account-name <storageaccountname> --account-key <storageaccountkey>

List

  • cmdlet
Get-AzureStorageBlob -Container $containerName -Context $storageContext -prefix "example/data/"
  • cli
azure storage blob list <containername> <blobname|prefix> --account-name <storageaccountname> --account-key <storageaccountkey>

Hadoop Configuration

The Wasb configuration (ie the file system configuration) are in the core-site.xml file.

Example:

<property>
  <name>fs.AbstractFileSystem.wasb.impl</name>
  <value>org.apache.hadoop.fs.azure.Wasb</value>
</property>

<property>
  <name>fs.AbstractFileSystem.wasbs.impl</name>
  <value>org.apache.hadoop.fs.azure.Wasbs</value>
</property>

<property>
  <name>fs.azure.account.key.hiinformaticasawe.blob.core.windows.net</name>
  <value>MIIB/QYJKoZIhvcNAQcDoIIB7jCCAeo....</value>
</property>

<property>
  <name>fs.azure.account.keyprovider.hiinformaticasawe.blob.core.windows.net</name>
  <value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
</property>

<property>
  <name>fs.azure.io.copyblob.retry.max.retries</name>
  <value>60</value>
</property>

<property>
  <name>fs.azure.io.read.tolerate.concurrent.append</name>
  <value>true</value>
</property>

<property>
  <name>fs.azure.page.blob.dir</name>
  <value>/mapreducestaging,/atshistory,/tezstaging,/ams/hbase/WALs,/ams/hbase/oldWALs,/ams/hbase/MasterProcWALs</value>
</property>

<property>
  <name>fs.azure.shellkeyprovider.script</name>
  <value>/usr/lib/hdinsight-common/scripts/decrypt.sh</value>
</property>

Code

WASB is also available in the Apache source code for Hadoop. Therefore when you install Hadoop, such as Hortonworks HDP or Cloudera EDH/CDH, on Azure VMs you can use WASB with some configuration changes to the cluster.

jar needed in a client installation:

  • hadoop-azure-X.X.X.jar: the hdfs implementation (normally comes with the hadoop distribution and is dependent of the below azure storage jar)
  • azure-storage-X.X.X.jar: the Azure-storage jar

Example: https://docs.microsoft.com/en-us/java/api/overview/azure/storage

Support

Invalid URI for NameNode address (check fs.defaultFS): wasb is not of scheme hdfs

When using an hadoop command line client such as hdfs, you may get the following error:

hdfs groups hdfs
Exception in thread "main" java.lang.IllegalArgumentException: Invalid URI for NameNode address (check fs.defaultFS): wasb://[email protected] is not of scheme 'hdfs'.
        at org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:530)
        at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
        at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:147)
        at org.apache.hadoop.hdfs.tools.GetGroups.getUgmProtocol(GetGroups.java:87)
        at org.apache.hadoop.tools.GetGroupsBase.run(GetGroupsBase.java:71)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
        at org.apache.hadoop.hdfs.tools.GetGroups.main(GetGroups.java:96)

Pass the URI with hdfs scheme to resolve this problem:

hdfs groups  -D "fs.default.name=hdfs://namenode/"  hdfs
hdfs : hadoop

Documentation / Reference