Nifi

> Data Integration Tool (ETL/ELT) > Nifi

1 - About

Nifi - NiFi was built to automate the flow of data between systems.

2 - Concept

2.1 - Data Flow

2.2 - FlowFile

As data moves through NiFi, a pointer to the data (FlowFile attribute ?) is being passed around, referred to as a FlowFile.

FlowFiles are made up of two parts:

  • FlowFile Content
  • and FlowFile Attributes (They move from processor to processor in your dataflow, not the content)

Storage:

  • FlowFiles Content is written to NiFI's content Repository while
  • FlowFile Attributes live in JVM heap memory and the NiFi FlowFile repository.
Advertising

2.3 - Repository

There are three key repositories:

  • The FlowFile Repository (contains metadata for all the current FlowFiles in the flow)
  • the Content Repository (holds the content for current and past FlowFiles)
  • and the Provenance Repository (holds the history of FlowFiles, keep track of where in the flow the FlowFile is)

https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html#repositories

2.4 - Commit

As a Processor writes data to a flowfile, that is streamed directly to the content repository. When the processor finishes, it commits the session (essentially marks a transaction as complete). This triggers the Provenance Repository to be updated to include the events that occurred for that processor and then the FlowFile repository is then updated to keep track of where in the flow the FlowFile is.

Finally, the FlowFile can be moved to the next queue in the flow. This way, if power is lost at any point, NiFi is able to resume where it left off.

This, however, glosses over one detail, which is that by default when we update the repositories, we write the information to disk but this is often cached by the operating system. If you truly have a complete loss of power, it is possible to lose those updates to the repository. This can be avoided by configuring the repositories in the nifi.properties file to always sync to disk. This, however, can be a significant hinderance to performance.

Simply killing NiFi, though, will not be problematic, as the operating system will still be responsible for flushing that data to the disk.

2.5 - Monitoring

Advertising

2.6 - Deployment

The cli tool enables administrators to interact with NiFi and NiFi Registry instances to automate tasks such as deploying versioned flows and managing process groups and cluster nodes. See https://nifi.apache.org/docs/nifi-docs/html/toolkit-guide.html

3 - Start

  • Wifi 1.9.2 to the local port 8081
docker run --name nifi ^
  -p 8081:8080 ^
  -d ^
  apache/nifi:1.9.2

4 - API

4.1 - Rest

5 - Documentation / Reference