Managing the Manager-less Process

Fred George has quite a resume – he’s been in the software industry since the 70’s and is still active. His 2017 talk @ the GOTO conference is pure gold.

His breakdown of the role of the Business Analyst at 19:20 is spot on. The role of the manager is even saltier (23:12) – “I am the God here”.

Well worth an hour of your life (mostly for coders).

As a side node there are two characters in the Harry Potter movies called “Fred and George”, making searches for “Fred George” a pain.

Advertisements

Monolith

20 years ago, the NVR we wrote was a monolith. It was a single executable, and the UI ran directly on the console. Rendering the UI, doing (primitive) motion detection and storing the video was all done within the same executable. From a performance standpoint, it made sense; to do motion detection we needed to decode the video, and we need to decode the video to render it on the screen, so decoding the video just once made sense. We’d support up to a mind-blowing 5 cameras per recorder. As hardware improved, we upped the limit to 25, in Roman numerals, 25 is XXV, and hence the name XProtect XXV (people also loved X’s back then – fortunately, we did not support 30 cameras).

Image result for rock

I’m guessing that the old monolith would be pretty fast on today’s PC, but it’s hard/impossible to scale beyond a single machine. Supporting 1000 cameras is just not feasible with the monolithic design. That said, if your system is < 50 cameras, a monolith may actually simpler, faster and just better, and I guess that’s why cheap IP recorders are so popular.

You can do a distributed monolith design too; that’s where you “glue” several monoliths together. The OnSSI Ocularis system does this; it allows you to bring in a many autonomous monoliths and let the user interact with them via one unified interface. This is a fairly common approach. Instead of completely re-designing the monolith, you basically allow remote control of the monolith via a single interface. This allows a monolith to scale to several thousand cameras across many monoliths.

One of the issues of the monolithic design is that the bigger the monolith, the more errors/bugs/flaws you’ll have. As bugs are fixed, all the monoliths must be updated. If the monolith consists of a million lines, chances are that the monolith will have a lot of issues, and fixes for these issues introduce new issues and so on. Eventually, you’re in a situation where every day you have a new release that must be deployed to every machine running the code.

The alternative to the monolith is the service based architecture. You could argue that the distributed monolith is service based; except the “service” does everything. Ideally, a service based design ties together many different services that have a tightly defined responsibility.

For example; you could have the following services: configuration, recorder, privileges, alarms, maps, health. The idea being that each of these services simply has to adhere to an interface contract. How the team actually implements the functionality is irrelevant. If a faster, lighter or more feature rich recorder service comes along, it can be added to the service infrastructure as long as it adheres to the interface. Kinda like ONVIF?

This allows for a two-tiered architectural approach. The “city planner” who plans out what services are needed and how they communicate, and the “building architect” who designs/plans what goes into the service. Smaller services are easier to manage, and thus, hopefully, do not require constant updates. To the end user though, the experience may actually be the same (or even worse). Perhaps patch 221 just updates a single service, but the user has to take some action. Whether patch 221 updates a monolith or a service doesn’t make much difference to the end-user.

Just like cities evolve over time, so does code and features. 100 years ago when this neighborhood was built, a sewer pipe was installed with the house. Later, electricity was added, it required digging a trench and plugging it into the grid. Naturally, it required planning and it was a lot of work, but it was done once, and it very rarely fails. Services are added to the city, one by one, but they all have to adhere to an interface contract. Electricity comes in at 50 Hz, and 220V at the socket, and the sockets are all compatible. It would be a giant mess if some providers used 25 Hz, some 100 Hz, some gave 110V, some 360V etc. There’s not a lot of room for interpretation here; 220V 50 Hz is 220V 50 Hz. If the spec just said “AC” it’s be a mess. Kinda like ONVIF?.

Image result for wire spaghetti

In software, the work to define the service responsibilities, and actually validate that services adhere to the interface contract is often overlooked. One team does a proprietary interface, another uses WCF, a third uses HTTPS/JSON, and all teams think that they’re doing it right, and everyone else is wrong. 3rd parties have to juggle proprietary libraries that abstract the communication with the service or deal with several different interface protocols (never mind the actual data). So imagine a product that has 20 different 3rd party libraries, each with bugs and issues, and each of those 3rd parties issue patches every 6 months. That’s 40 times a year that someone has to make decide to update or not; “Is there anything in patch 221 that pertains to my installation? Am I using a service that is dependent on any of those libraries” and so on.

This just deals with the wiring of the application. Often the UI/UX language differs radically between teams. Do we drag/drop things, or hit a “transfer” button. Can we always filter lists etc. Once again, a “city planner” is needed. Someone willing to be the  a-hole, when a team decide that deviating from the UX language is just fine, because this new design is so much better.

I suppose the problem, in many cases, is that many people think this is the fun part of the job, and everyone has an opinion about it. If you’re afraid of “stepping on toes”, then you might end up with a myriad of monoliths glued together with duct-tape communicating via a cacophony of protocols.

OK, this post is already too long;

Monoliths can be fine, but you probably should try to do something service based. You’re not Netflix or Dell, but service architecture means a more clearly defined purpose of your code, and that’s a good thing. But above all, define and stick to one means of communication, and it should not be via a library.

 

Conway’s Law

I was re-watching this video about the (initially failed) conversion from a monolithic design of an online store, into a microservice based architecture. During the talk, Conway’s Law is mentioned. It’s one of those laws that you really should keep in mind when building software.

“organizations which design systems … are constrained to produce designs which are copies of the communication structures of these organizations.”
— M. Conway

The concept was beautifully illustrated by a conversation I had recently; I was explaining why I disliked proprietary protocols, and hated the idea of having to rely on a binary library as the interface to a server. If a server uses HTTPS/JSON as it’s external interface, it allows me to use a large number of libraries – of my choice, for different platforms (*nix, windows) – to talk to the server. I can trivially test things using a common web browser. If there is a bug in any of those libraries, I can use another library, I can fix the error in the library myself (if it is OSS) etc. Basically I become the master of my own destiny.

If, on the other hand, there is a bug in the library provided to me, required to speak some bizarre proprietary protocol, then I have to wait for the vendor/organizational unit to provide a bug-fixed version of the library. In the meantime, I just have to wait. It’s also much harder to determine if the issue is in the server or the library because I may not have transparency to what’s inside the library, and I can’t trivially use a different means of testing the server.

But here’s the issue; the bug in the communication library that is affecting my module might not be seen as a high priority issue by the unit in charge of said library. It might be that the author left, and it takes considerable time to fix the issue etc. etc. this dramatically slows down progress and the time it takes to deliver a solution to a problem.

Image result for bottleneck

The strange thing is this; the idea that all communication has to pass through a single library, making the library critically important (but slowing things down) was actually 100% mirrored in the way the company communicated internally. Instead of encouraging cross team communication, there was an insistence that all communication pass through a single point of contact.

Basically, the crux is this, if the product is weird, take a look at the organization first. It might just be the case that the product is the result of a sub-optimal organizational structure.

Crashing a Plane

Ethiopian Airlines Flight 961 crashed into the Indian ocean. It had been hijacked en route from Addis-Ababa to Nairobi. The hijackers wanted to go to Australia. The captain warned that the plane only had enough fuel for the scheduled flight and would never make it to Australia. The hijackers disagreed. The 767-200ER had a max. flight capacity of 11 hours, enough to make it to Australia they argued. 125 people died when the plane finally ran out of fuel and the pilots had to attempt an emergency landing on water.

Korean Air Flight 801 was under the command of the very experienced Captain Park Yong-chul. During heavy rain, the Captain erroneously thought that the glidescope instrument landing system was operational, when it fact it wasn’t. The Captain sent the plane into the ground about 5 km from the airport killing 228 people.

In the case of Ethiopian Airlines, there’s no question that the people in charge of the plane (the hijackers), had no idea what they were doing. Their ignorance, and distrust of the crew, ultimately caused their demise. I am certain that up until the last minute, the hijackers believed they knew what they were doing.

For Korean Air 801, the crew was undoubtedly competent. The Captain had 9000 hours logged, and during the failed approach, we can safely assume that he felt that he knew what he was doing. In fact, he might have been so good that everyone else stopped second guessing Captain Park even though their instruments was giving them a reading that told them something was seriously wrong. Only the 57 year old flight engineer Nam Suk-hoon with 13000 hours logged dared speak up.

I think there’s an analogy here; we see companies crash due to gross incompetence, inexperience and failure to listen to experienced people, but we also see companies die (or become zombies) because they have become so experienced that they felt that they couldn’t make any fatal mistakes. Anyone suggesting they were short on approach are ignored. The “naysayers” can then leave the plane on their own, get thrown out for not being on board with the plan, or meet their maker when the plane hits the ground.

Yahoo comes to mind; witness this horror-show:

Image result for yahoo bad decisions

The people making these mistakes were not crazed hijackers with an insane plan. These were people in expensive suits, with many many years of experience. They all had a large swarm of people doing their bidding and showing them excel sheets and power-point presentations from early morning to late evening. Yet, they managed to crash the plane into the ground.

So, I guess the moral is this: if you’re reading the instruments, and they all say that you’re going to crash into the ground, then maybe, just maybe the instruments are showing the true state of things. If the Captain refuses to acknowledge the readings and dismisses the reports, then the choices are pretty clear.

The analogy’s weakness is that in most cases, no-one dies when the Captain sends the product in the wrong direction. The “passengers” (customers) will just get up from their seats and step into another plane or mode of transportation, and (strangely) in many cases the Captain and crew will move on and take over the controls of another plane. We can just hope that the new plane will stay in the air until it reaches it’s intended destination safely.

Do Managers in Software Companies Need to Code?

I think so.

The horrible truth is that there are good and bad coders, there are good and bad managers and there are easy and hard projects.

A project, taken on by good coders and good managers can fail simply because the project was too complex and was too intertwined with system that the team had no control over. You could argue that the team never should have taken on the task, but that’s why you warn the customer of the risk of non-completion and bill by the hour.

When doing research on the skills needed to be a good software project manager, there seems to be an implied truth that the coders simply do what they are told, and that coding/design errors are always the managers fault. At the same time, you’ll find that people complain about micromanagement, and not letting the coders find their own solution. I find these two statements at odds with one another.

Coders will sometimes do things that are just wrong, yet it still “works”. How do you handle these situations? Do you, as a manager insists that the work is done “correctly”, which the coder may think is just a matter of taste, and not correct vs incorrect? Or do you leave the smelly code in there, and keep the peace?

If you don’t know how to code, and you’re the manager, you won’t even notice that the code is bad. You’ll be happy that it “works”. Over time, though, the cost of bad code will weigh down on productivity, the errors start piling up, good coders leave as there is no reward for good quality and they’re fed up with refactoring shitty code. If you have great coders, you might not run into that situation, but how do you know if you have great coders if you can’t code?

Maybe you’re the best coder in the world, and you’re in a managerial position facing some smelly code, you might consider two approaches: scold the coder(s), and demand that they do it the “correct” way (which is then interpreted as micromanagement), or alternatively, if you’re exhausted from the discussions, you just do a refactor yourself on a Sunday, while the kids are in the park?

In the real world, though, the best solution is for the manager to have decent coding skills, and posses that rare ability to argue convincingly. The latter is very hard to do if you do not understand the art of coding. Furthermore I don’t think coders are uniquely handicapped in being persuasive and certainly not when dealing with other coders (n00b managers wearing a tie are universally despised in the coding world).

Every coder is different, and act differently depending on the time of day, week or year. Some coders have not fully matured, some are a little too ripe, and some just like to do things the way they always did (or “at my old job we…”), different approaches are needed to persuade different people.

I must confess that this is what I have observed, the few times I have been wearing anything with any resemblance to a managerial hat, I have walked away being universally despised and feared as some sort of “Eye of Sauron” who picks up on the smallest error with no mercy when dishing out insults, but in theory at least, I think I know how thing ought to be.

So,if you are managing software projects and interacting with coders, you need to know how to code.

Debtors Prison

There’s a wonderful term called “technical debt”. It’s what you accrue when you make dumb mistakes, and instead of correcting the mistake, and taking the hit up front, you take out a small loan, patch up the crap with spittle and cardboard, and ship the product.

kid_credit
Yay! Free money!!!

Outside R&D technical debt doesn’t seem to matter. It’s like taking your family to a restaurant and racking up more debt; the kids don’t care, to them, the little credit card is a magical piece of plastic, and the kids are wondering why you don’t use it more often. If they had the card, it would be new PlayStations and drones every day.

Technical debt is a product killer; as the competition heats up, the company wants to “rev the engine”, but all the hacks and quick fixes mean that as soon as you step on the gas, the damn thing falls apart. The gunk and duct tape that gave you a small lead out of the gate, but in the long run, the weight of all that debt will catch up. It’s like a car that does 0-60 in 3 seconds but then dies after 1 mile of racing. Sure it might enter the race again, limp along for a few rounds, then back to the garage, until it eventually gives up and drops out.

Duct Tape Car Fix - 03
Might get you home, but you won’t win the race with this fix

Why does this happen?

A company may masquerade as a software company and simply pile more and more resources into “just fix it” and “we need” tasks that ignore the real need to properly replace the intake pipe shown above. “If it works, why are you replacing it”, the suit will ask, “my customer needs a sunroof, and you’re wasting time on fixing something that already works!”.

So, it’s probably wise to look at the circumstances that caused the company to take on the debt in the first place. An actual software company might take technical debt very seriously, and very early on they will schedule time for 3 distinct tasks:

  1. Ongoing development of the existing product (warts and all),
  2. Continued re-architecting and refactoring of modules,
  3. Development of the next generation product/platform

Any given team (dependent on size, competency, motivation, and guidance) will be able to deliver some amount of work X. The company sells a solution that requires the work Y. Given that Y < X, the difference can be spent on #2 and #3. The bigger the difference, the better the quality of subsequent releases of the product. If the difference is small, then (absent team changes), the product will stagnate. If Y > X then the product will not fulfill the expectations of the customer. To bridge the gap until the team can deliver an X > Y, you might take on some “bridge debt”. But if the bridge debt is perpetual (Y always grows as fast or faster than X), then you’re in trouble. If Y > X for too long, then X might actually shrink as well, which is a really bad sign.

Proper software architecture is designed so that when more (competent) manpower is added, X grows. Poor architecture can lead to the opposite result. And naturally, incompetent maintenance of the architecture itself (an inevitable result of a quick-fix culture), will eventually lead to the problematic situation where adding people lead to lower throughput.

A different kind of “debt” is the inability to properly value the IP you’ve developed. The cost of development is very different from the value of the outcome. E.g. a company may spend thousands of hours developing a custom log handler, but the value of such a thing is probably very low. This is hard to accept for the people involved, and it often leads to friction when someone points out that the outcome of 1000 hours of work is actually worthless (or possibly even provides a net negative value for the product). A lot of (additional) time may be spent trying to persuade ourselves that we didn’t just flush 1000 hours down the drain, as we’re more inclined to believe a soothing lie than the painful truth.

Solutions?

A company that wants to solve the debt problem must first take a good look at its core values. Not the values it pretends to have, but the actual values; what makes management smile and how it handles the information given to them. Does management frown when a scalability issue is discovered, do they yell and slam doors, points out 20 times that “we will lose the customer if we don’t fix this now!”. The team lead hurries down the hallway, and the team pulls out cans of Pringles and the start ripping off pieces of tape.

The behavior might make the manager feel good. The chest-beating alpha-manager put those damn developers in their place, and got this shit done!. However, over the long run, it will lead to 3 things : 1) Developers will do a “quick fix”, because management wants this fixed quickly, rather than correctly, 2) Developers will stop providing “bad news”, and 3) developers that value correctness and quality will leave.

To the manager, the “quality developer” is not an asset at all. It’s just someone who wants to delay everything to fix an intake that is already working “perfectly”. So over time, the company will get more and more duct-tapers and hacks, and fewer craftsmen and artisans.

The only good thing about technical debt (for a coder) is that it belongs to the company, and not to the employees. Once they’re gone, they don’t have to worry about it anymore. Those that remain do, and they now have to work even harder to pay it back.

debt_mountain2

Why Products Go Bad

The simpleton will equate commercial success with quality.

I don’t.

A product can be well made, even if it is not commercially successful and vice versa. The Microsoft Zune HD, for example, was a great product. Hell, Microsoft’s Phone OS is/was good too. In contrast, Kinect is/was a terrible product. It promised the world, and it was shit. Johnny Lee proved that Nintendo’s controllers were fucking awesome, and Microsoft wanted some of that goodness. Most people at Microsoft knew how piss poor Kinect was, most devs knew too, but  management did not want to be upstaged by Nintendo, so they released this fine piece of junk. Molyneux flat out lied about the capabilities of the thing (and he was not the only one I’m sure).

Sometimes, and perhaps too often, see products that have the potential to be “good”, and perhaps they are already good, but then, gradually as time passes and new generations of the product are released, it turns to utter crap. Why does this happen? You would expect the opposite to be true. You’d expect that the next generation of a product improved on the old.

My own experience is that I am generally considered “an overthinker”. Instead of just shutting up and doing what “the customer asks”, I think about the ramifications over the longer term. I try to interpret what the real problem is, and I spend a long time thinking about a good solution. I spend a lot of time talking about the problem with my peers, drawing on whiteboards. I think about the issues as I drive drove to the office, while I fly flew across the Atlantic. And sometimes, I change my mind. Sometimes, after long discussion, after “everyone agrees”, I see things in a new light and change my mind. And it pisses people off.

In the general population, I believe that there is a large percentage who just want to be told what to do, do what they are told and then at 5.15 pm drive home and watch TV, happy and content that they did what they were told all day. To the majority, “a good day” is doing as much of what you’re being told as possible, regardless of what the task is. They do not want to be interrupted by assholes that can’t offer them a promotion or a raise, who critique the “what” or the “how” – regardless of merit. The “customer” to them, is not the user of the product, the “customer” is their immediate supervisor. Make that guy happy, and you move upwards.

Telling people that unchecking “always” does not mean “never” makes people angry. They can understand the logic (not always = sometimes), but they are angry that you can’t understand that their career is jeopardized if they pointed that out when their supervisor told them to make that change. They will correct the problem if a supervisor tells them to – even if screams them in the face that this is useless to the end user. Doesn’t matter. The end user does not dish out promotions or raise their salary.

As these non-thinkers move up, they get to supervise people like me (JH: No, this has not happened at OnSSI). And that’s where it gets really bad. Now they are in a position where they are told what to do, and they are telling someone else to do that thing (nirvana), and then they learn that the asshole doesn’t want to listen and do what he is told, like “everyone else” does, so eventually the “overthinker” is replaced with a non-thinker, and this continues until all the thinkers are gone, and the company or branch then does exactly what the customer asks.

When you see features that flat out do not work and never did work, and there’s no motivation to fix that issue, then you have to pause, and consider if you have enough thinkers among the non-thinkers.

Because you need both.

You need lying sales and marketing people (that know just how far the truth can be stretched, or who can make a reality distortion field), you need asshole genius programmers who knows iOS, gstreamer, ffmpeg and Qt, you need vain and arrogant designers who can draw the best damn icons and keep everything consistent across the apps, you need dried up, mummified sysops to run IT.

But most of all, you need to make sure that these people think, and care about the end user, instead of just title on their business-card.