Apache MiNiFi-CPP v0.2 – An Early Analysis

Working in the BigData and Streaming space leads to many types of jagged edge use cases for  stream processing, routing, and storage. That is to say you may not have any connectivity and will require the edge itself to be able to take actions that best prioritize its needs before its able to ‘phone home’ again. But yet we need to go further then that, we have to move from being a simple connected device and start becoming smart devices that operate without connectivity. As one increases distance from data center the smaller the amount of technological resources are available across the spectrum of needs for most use cases. To best control the edge its required to balance of native features, flexibility, and low resource impact. In this post ill focus on the resource impact running a ‘standard’ MiNiFi dataflow has on differing hardware and how changing that flow causes the hardware to also react differently.

WARN – The version of MiNiFi CPP used is 0.2; this is very early for this product and if your reading this Blog Post from a historical perspective it may be useful but its highly likely that if a year has past the product has matured more and would change alot of the following designs. 

In order to evaluate system requirements for MiNiFi-CPP on differing hardware a simple test template was created that is reused for all the metrics. It went threw some simple revisions as we learned more about the behavior of the system. Then after the template testing was done it was time build a more real use case and attempt to implement the barebone requires of most edge experiment collection needs from a connectivity standpoint.

Final notes/assumptions are that I had to build from src the 0.2 MiNiFi-CPP Tag to run on my Raspberry Pi, there are no metrics within NiFi at this release so we will be focused on System (pi) performance metrics, and system metrics are collected using ‘collectl’ which is a constant in these tests but we don’t know the real impact of running on the hardware.

The testing template was kept simple due to the number of available processors in the CPP version today. For easy replication please find both the full NiFi and MiNiFi template at the pasteBin links along with the basic test process.

Benchmark V1b
NiFi XML    -> https://pastebin.com/5b3tGykP

MiNiFi YAML -> https://pastebin.com/wmBETsr4

Test Process-> https://pastebin.com/fHtj74rf

The template at heart is simple and designed to try and strain the system it runs on, later we will find just how hard that is to do when your not sure of any initial requirements ;). It performs this by simulating sensor probe input by using the build in FlowFile generator and also making system calls to devices (/dev/null) to generate both small and large bulky events, it then simulates sending these over a data-link and receives it onto itself for additional load generation.  The template is broken into the following:

  • Data Generation – In Essence our ‘Sensor Probes’
    • Flow File Generation
      • 1 250 Byte file
      • 2 Concurrency, running continuously
    • Execute Process
      • dd if=/dev/zero count=250 bs=60K
        • 15 MB
      • 1 Concurrency, every 15 Seconds
  • Back-pressure Mechanics – Mass Storage Media
    • 4 Connections with slightly differing back pressure mechanics named below
    • {Expire Seconds}-{FlowFileCount}-{StorageBackPressure}-{Prioritize}
  • Sending and Receiving over a Data Link
    • Invoking and Listening over HTTP
    • Retrying failures
    • Running continuously
Test Flow

Hardware was based on availability with simple access to: 2 PIs and 2 Intel NUCs. Special call out to David for running the Alpha test on his NUC Mini-PCs to get more comparisons! And to Tommy & David for letting me borrow a Pi3.

In all fairness these systems (Pi vs NUC) are almost not comparable between the dramatically different memory availability, chipset performance, and io. For example the memory on the Pi’s is 0.5GB and 1GB while the NUCs are 16GB and 32GB; Also from the Core Perspective we find the same, the ARM Pis with 1core @ 1Ghz or 4 Cores @ 1.3Ghz and the x86_64 NUCs are utilizing 4 cores at 1.8Ghz+; lets not even talk about processor caches. The first run exposed the night and day differences just between the hardware, and the NUCs took home the trophy for most boring charts but best performing… I excluded the memory from the NUC graphs below.

DC53427HYE – Run 1
NUC6i7KYK – Run 1

The NUC systems man handle the requirements for this test and push the CPU needs down into the sub 5% for almost the entire duration. WMerge and Wait remain low and unchanging to the flowfile generate needs of the test to write into the content repository, otherwise all metrics including network rx seem to find a home and stay within the given range remaining stable until shutdown.

The first test was surprising during the hour long run on the Pis’; the Zero-W would continue to run at full speed the entire time, but the Pi3 would find itself slowly degrading in performance until it used all Swap on the system and performed a Page-In in event that cause the Kernel to OOM kill the process itself. Comparing the hardware makes me feel that the multi-threaded process was able to place itself into a bad spot where it needed to load more then it had memory; while the Zero-W could not due to its single threaded nature. It goes to show that the old adage of ‘the dullest knife in the box is the most dangerous’ is at times true.

Pi Zero-W – Run 1 (Large Connection Storage)

The Zero-W metrics stabilize well but is CPU thrashing near 100% the entire time, the swap event seen here is not until shutdown of the MiNiFi process.

Pi3 – Run 1 (Large Connection Storage)

In the first of the Pi3 runs you can see that there is no stable point found in the system, it starts on strong but continues to slowly erode. CPU average slowly declines from its starting point of ~45% over all 4 cores to 30%, the wmerge numbers going from ~900 write merge operations to under 500, along with increasingly growing and erratic disk wait. About halfway into the run the Swap used begins to grow until it maxes out, after that finally a page in event that sets the fate for the process.

Pi3 – Run 1 [Zoom for Page-In Event]
Dmesg error

[ 3418.137699] Out of memory: Kill process 1098 (minifi) score 910 or sacrifice child

[ 3418.137755] Killed process 1098 (minifi) total-vm:1085044kB, anon-rss:926704kB, file-rss:0kB, shmem-rss:0kB

After some investigation it was found that one of the connections in the initial version of the template contained a connection with  1GB of back-pressure storage configured for it along with others. When combined with all other connections this meant that our sum was greater then available system memory. After re-configuring the template so that each connection had no more then 90MB of total storage placing the aggregate well under even the Zero-W’s memory we ran them both again. (The NUCs were not run on this as they are just beasts in comparison.)

Pi Zero-W – Run 2 Limited Connection Storage Template

The write merge has dropped down significantly once we put the connection storage limits in place falling by ~25%-50%.  Each of the two Limited Connection Storage runs had wait anomalies well over 400*.

Pi Zero-W – Run 3 Limited Connection Storage Template
Pi Zero-W – Run 4 (Large Connection Storage)

Upon re-running the test again the same anomalies of high disk wait persist so we re-tested the large connection storage too were it was found to exist. When comparing the large connection tests of the first shown to this one you can see we have repeatedly erratic disk wait showing up, but the wmerge also increased significantly compared to the limited storage variant.

*It is very possible that the flash storage card is starting to go after being beaten down by multiple aggressive tests including some that consumed almost all available inodes and storage space on the disk.

Pi3 – Run 2 Limited Connection Storage Template

In tuning the connection storage the Pi3 managed to live past the 1 hour marker for the test and what had show itself before in the charts as slow erosion of CPU performance and disk write merges was not there at the same magnitude until some Swap starts to gain again. It does appear over the long term there may still be an issue but a longer run then 1 hour will be required to investigate. In the future some long running examples of the template will be performed to test longer system stability.

Testing and benchmarks are interesting and great to start building baselines for hardware configurations but what about something more real world and purpose driven? In my personal time and career I have gotten to work around automotive a significant amount of time and they almost always have some type of Telematics experimentation program that needs to collect data and smartly send it on its way. Yet to be setup is ODB2 Dongle collection over BlueTooth (ELM327) but its will make its way as another blog post for more review of system performance along with a more  explanations of script operations, and how to get it up and running yourself.

Prototype Dongle
driver1e.yml -> https://pastebin.com/E52LvF2b

The driver1e test performs the following:

  • Manages WiFi Connection
    • Scans for SSID’s and Reconnects when it finds ‘home’
  • Manages LTE Connection
    • Handles configuring the USB LTE 4G Modem and connecting it to the correct APN
    • Evaluates if its currently connected to the network or not periodically and attempts to reconnect
  • Manages GPS Device Data Collection
    • Starts and scraps the cgps apis calls
    • Collects at a given interal
  • Transmits over Site-2-Site
    • Sends data over any available data link to the target endpoint
Pi Zero-W – DriverConfig 1e (Stationary)

In a more realistic applied application there is some spare CPU, and the wmerge/wait is lower then what the capped and uncapped tests showed the system capacity for the Pi Zero-W to be. While our wmerge and cpu total is lower likely due to significantly less flowfile generation the wait increased just like it did in the Pi Zero-W’s runs with limited connection storage, but it should be noted that even the largest disk wait anomaly here is multiple factors lower measuring in at a max of 28, compared to the tests values results of 400 or higher on its anomalies*.

Update

I had been asked to provide a more idle baseline of MiNiFi running so I created a template which simply starts and continuously runs both a Invoke HTTP connected to another Invoke HTTP each scheduled for once a second, and would not be generating or receiving any data beyond the simple scheduling of these two processors. I only had the Zero-W for this test.

Invoke 2 Invoke YML -> https://pastebin.com/CLKwLTJ0
Pi Zero-W – Idle Invoke 2 Invoke

The Invoke 2 Invoke test shows just how little MiNiFi uses when no tasks or continuous processing are required.  Less then 0.5 wmegre and wait on average, and cpu  ~7% the entire time. MiNiFi was started at T-14 and Stopped at T-44 in this graph (the 2 tiny cpu spikes), showing just how little impact the idle system really has compared to the bare Pi Zero-W.

Lessons Learned and Take Away

  • Near Idle Pi Zero-W with the Invoke 2 Invoke HTTP Processors test, unable to tell when MiNiFi starts and stops vs system background noise.
  • Real World application of MiNiFi-CPP shows spare capacity on Pi Zero-W and ample compute on Pi3
  • Keep your total Connection Storage volume under your System’s total RAM
    • This one seems odd to me, but many improvements are yet to come in the next MiNiFi release.
  • Watch for signs of a bottleneck with slow erosion of metrics over time
    • CPU Total % and wmerge metrics slowed over an hour before the process died
  • Its no contest if you have high power hardware to meet your capacity needs
    • Increased memory, Faster cores, More IOPS = Win
  • Disk WMerge/Wait performace may be imapcted by the size of connection storage

Things to follow up on

  • Total Connection Storage aggregate as system memory
  • Long running tests for stability
  • Deeper investigation into disk wait produced by connections with lower storage amounts
  • Methods for Flash Disk ‘failure’ validation