The Human Part of Hadoop

All too often I see clients who have begun implementing Hadoop company wide as if the Data Lake was some impossible to break sandbox just because its horizontally scalable, letting anyone who can figure out how to make the platform work do whatever they desire and leaving application development teams to create their own unique standards rather then platform standards. More typically and perhaps a true issue with this sandbox method is less a problem with the technology that was elected to solve their use-case (this may be a problem too,) but more often then not this is related in some manor to what will be described as ‘The Human Quandary’:  Skills, Education,  Organizations,  Standards, Staffing, and Current Market Demand for Subject Matter Experts

With the first user facing installation of the platform there are likely little to no existing skills, making the need to target the correct resources to skill up very important. When looking for people to staff your ‘platform teams’ (eng/arch/ops/support/devs) a focus on Linux and Java skills will serve the best should you not have existing Hadoop resources available. Attempting to place a DBA that is without knowledge of Linux or Java into one of these teams is a set method for long term pains as they will quickly find out a majority of their  skills are not utilized on operations or engineering specifically. Additional skills that benefit all teams would be Kerberos knowledge: installing, operating, integrating with. Developers are also in the same boat unless the application will be 100% SQL focused, and even when this is the case a understanding of Java and Linux will make development teams more able to support themselves and debug their own problems.

One skill/technology often missing from many Hadoop platforms is the establishment of a CI/CD pipeline for all application teams to utilize, its important to have this standard from day 1 to prevent each application team from developing their own variant of this process (commonizing this helps security and auditing of production deployments too.) Without this principle part of software engineering the reactive result is excessive staffing of QA testers to validate something teams don’t understand how it works but believe they can test it to learn behavior; in almost all cases the unit & integration tests these QA resources perform can be completely automated and ran automatically when developers check in code that needs promotion to another environment. Without this automatic CI/CD pipeline each new application project adds increasing technical debt to be managed, this directly impacts how many QA resources are required to continue supporting projects and then becomes the root cause on the platform forcing it to freeze versions because all of the existing projects on it cannot be tested in an automatic fashion to see if a future platform version will cause regression failures. By freezing the platform we begin to stall innovation and prevent net new opportunities from being enabled by newer features identified to fill gaps/enable by the Architecture team.

Skillset Focus By Role, Skill Focus in decending order of importance.

  • Operations
    • Linux – Automation, Monitoring, Operations & Configuration (Admin Skills)
    • Java – Tuning & Monitoring (Java Garbage Collection Focus is a plus)
    • Kerberos – Operation & Debugging
  • Engineering
    • Linux – System Engineering
    • Hardware Design – Server & Network
    • Java – Debugging & Tuning
    • Kerberos – Integration & Debugging
  • Architecture
    • Java –  Programming & Debugging
    • Linux – Shell Debugging and Configuration
    • Hardware Design – Server & Network)
    • Kerberos – Integration & Debugging
  • Development
    • Java – Programming, Testing, Profiling
    • Continuous Integration & Deployment Skills
    • SQL – Only if your project is really using SQL.
    • Linux – Shell,  Basic Debugging
    • Kerberos – Use in Java and Linux Shell
  • Support/Partner
    • Java – Programming, Debugging & Profiling
    • Linux – Shell Debugging

Unfortunately most organizations start their Hadoop investment (even claiming its a strategic enterprise objective) well under funded in almost every case. After much thought I believe some of this to be due to managements lack of understanding the complexity of Hadoop itself: No doubt most management believes Hadoop to be a single technology or perhaps a few completely ignoring the fact that in most vendors distributions you have easily over 20+ components all equally complex and requiring their own unique efforts to utilize correctly or rather they focus on the small initial size of their deployment likely sub 50 nodes. This leads to these teams no longer being able to look after the strategic Data Lake duties as their time is spent focusing on a reactive day to day use of the platform from end users. In extreme examples Operations, Engineering and even Architecture teams are pulled away from their daily duties and suddenly find themselves as the Support Team forsaking all strategic duties to just make the gears turn for application delivery & day to day platform stability vs. investing properly in staffing the platform as a long term investment.

At the time of writing this quality Hadoop resources are expensive due to rarity. This results in a need for clients to staff with their own resources and augment with some expert hands as required until they mature. Being forced to skill-up your own resources can be tricky because of what may been seen as over-staffing initially is the best long term development strategy which is harder to justify. When staffing teams its important from day 1 to create redundancy within each team this way when the platform has aged N years you have multiple resources with the same experience. Rather what normally occurs is that each team is performed by a solo individual (exception of app development teams) creating a single point of subject matter expert failure required to keep the platform operating.

As an example a client had major stability and performance issues platform wide. At the root cause this was never implementing established platform best practices mostly due to an inability to correctly prioritize tasks because of the daily overwhelming tactical problems of ‘partially skilled’ end users (they never invested in a support team.) After some time more extreme platform problems began to emerge until it finally became unusable by the client. When it came time to perform emergency maintenance over the weekend the entire stability plan had to be changed as the platform admin (who is amazing) would be on vacation and he was the only admin on the team with the experience to be trusted to perform any task of this depth on the platform. While this client did have other admins they all had only recently been brought on-board as such for over 2 years this platform was in the hands of a single skilled admin with no one to back fill him. The more important lesson here isn’t that we needed to staff so we could perform maintenance but a continuation of the on-going theme that building these resources is an investment over time and that if we had initially enough people staffing the teams its very possible the tasks required to have prevented all of this would have been prioritized, scheduled and implemented and a platform emergency never seen.

Staffing does not scale linearly with deployment size enabling the initial investment to be reused as the platform grows. Growing up to a 1000 node cluster doesn’t require a massive army and can be staffed up overtime so long as the initial investment takes place to produce skilled redundancy. See this as growing your own SME’s early and saving yourself having to keep expert hands on-board long term by enabling your overstaffed teams to be educated first hand.

Core Platform Team Staffing 

  • Operations 3-8 people
  • Engineering 3-5 people
  • Architecture 2-5 people
  • Support 2+ (Based on Apps and User needs)

Above the words ‘partially skilled user’ was used to describe users that are unable to self-support. This can be any user from application development teams, to analysts, even members of the Core Platform team who are missing critical parts of the recommend skill set. It should not be under-emphasized that training all users in how the basic behaviors of the platform, ie admin training for developers & vica-versa, developer training for data scientists, even developer training for BI dashboard creators who only interact with SQL can impact how these people design their implementations. For example most developers while understanding the basic concepts of HDFS files and blocks from a data processing perspective have no understanding of the operational impact of the size of files and blocks resulting in large scale small-file problems and NameNode RPC load. BI developers may not have the understanding of were logs are stored to debug failed or slow queries but with developer training gain insight into how to not just collect their logs but troubleshoot the problem for basic issues too. Data Scientists may find themselves understanding better ways to make their applications perform taking hours of processing time off their processes for example with Spark and when to use .map vs .mapWithPartition.

A clean healthy lake

Many clients have the end vision of creating an Enterprise Data Lake with their Hadoop platform which is a fine goal to aspire to. To build a place where many can come much like the photo above; people setting up tents on the water, fishing, boating, swimming, etc. All being enabled by the lake itself as its cleanliness attracts more to it each day. The lack of standards & anti-patterns from the Architecture team who is busy performing support tasks due to the complete lack of a Support/Partner team leads to a slow but sure degradation in the cleanliness of the platform.  Many would like to believe that a Support/DevOps team will identify bad actors in the system and prevent the Data Lake’s water quality from degrading but this assumes that the degrading action is enough on its own to trip an alert that causes the team to investigate but this may not always be the case.

Ecological disaster with health hazards is still a lake

I prefer to give the example of putting plastic trash in your Data Lake. On a multi-acre lake a single pop bottle is negligible at worse and may not be detectable short of seeing someone throw the bottle into the lake yourself; no level of water tests will identify the ‘single’ pop bottle polluting the lake. But if every person that visits leaves behind some plastic suddenly we have a situation that didn’t occur when an app deployed as a single entity, but a perfect storm of smaller almost undetectable applications becoming a full blown ecological disaster that may take significant time and money to recover from.

This doesn’t mean to completely discourage innovation as it is required to create the level of use needed driving the platform forward, but standards cannot be forsaken in this distributed platform less you want an platform disaster later. From a technological perspective I am reminded of the sheer volume of clients using Hive-On-HBase only to discover they had no understanding of how HBase even works, and while their small projects didn’t effect it much once enough teams were using Hive-On-HBase everyone experienced a degraded quality of service, but it wasn’t the single application but the accumulation of them all performing anti-pattern scans when accessing data-sets that it materialized. Another example is the typical small files on HDFS that limit the scaliblity for the platform for all users but also can impact jobs in such a way that they fail to even start as they run out of memory attempting to calculate all the input paths needed. All of these require that we understand what is required to keep the platform healthy and standardize on this; is littering not an illegal activity that your fined for?

The sheer number of components that most vendor distributions have makes coming up with standards for a complete deployment a burdensome task when your also trying to learn the best patterns for these components too. In the absence of initial standards the simplest defense is to not enable the components that are not needed until Architectural gap analysis shows a value in performing the engineering effort and operational deployments tasks. Look at the initial needs: Storage (HDFS), Processing (YARN), Processing Languages/Engines (MapReduce, Hive, Pig, Spark), NoSQL Storage (HBase), Governance & Security (Atlas & Ranger), Streaming (Storm, Kafka), Data Flow (NiFi), Scheduling (Oozie), etc. The first use-cases most clients perform are focused in Batch or Stream. Normally clients find themselves only needed a few items from the entire list letting them focus on becoming experts in these before adding more complexity to the platform; namely HDFS, YARN, Oozie, MapReduce, Hive, NiFi.

This is important because installing everything will lead to everything being used by developers, even the newest of new features which should be avoided in production. The most successful development team focused on a limited subset of components and didn’t try to use every new feature but focused on the stable ones. Today they are the only application on the clients platform that can go though an upgrade and avoid issues, most other teams applications break upon a platform upgrade as the bleeding edge features they elected to use were not yet stable or production hardened. Additionally as the team continued to mature in their use of these specific components they became some of the most advanced users of the scheduling features and developed intelligent ways to cleanly scale up as they ingested more data over time.  Now these team members are being spread over other teams to bring the SME level skill sets company wide. Its unlikely this team would have ever gotten to this SME level had they been allowed to venture on entire vendor distribution focusing only on the novel and latest shiny enchantments and not the core stable parts of the components they required.

In summary a number of things have been discussed. First is that Hadoop is not just 1 technology but many and this increases the complexly and amount of initial staff you should have to manage it even if your physical deployment is rather small. Staffing correctly enables each team to do their proper strategic duties and not be sucked into the tactical task of performing day to day support for partially skilled users and that this normally happens because of a lack of a dedicated platform support team. Because its unlikely that a skilled Hadoop resource will be initially available its very important to create a team out of people with the correct initial skill sets focused in Linux and Java. Cross training Developers with Admin, BI users with Developer, etc will lead users toward self supporting of their own issues and better design patterns that holistically account for platform behavior. Standards keep the Data Lake clean and viable as a strategic resource, individual bad applications may not hurt the platform but in aggregate many things can: SmallFiles, HBase scans, etc. Keeping both the Platform and Development teams focused on the core features of the components builds stable applications and in house SMEs rather then unstable bleeding edge applications with massive technical challenges. If you want to continue to keep your platform on the most current versions proper investment in CI/CD is required to automate around technical project debt from the start.