Are we really moving faster? How visualizing flow changed the way we work

Roman Pickl speaking at a conference

Roman Pickl

September 8, 2020

This is a sequel to Getting Out of Quicksand, With DevOps!.

This story was told before at a number of conferences, e.g. at ADDO 2020:

More slides and videos are available here.

I added additional links where appropriate and try to attribute sources as well as possible. If you find an error or have a comment please contact me (see at the bottom).

TLDR; Just give me the code!

You can find the code of the dashboards that I developed here: https://github.com/rompic/Smashing-Flowboard

TLDR

After putting in countless hours improving the deployment pipeline, investing in automation and deploying new technologies, it is time to ask this fundamental question: “Are we really moving faster?”

This is a story of how we made work visible by applying DevOps and Flow Metrics to discover bottlenecks and improve flow. We did this using dashboards, which are great cultural change tools, as they visualize problems and spark discussion.

I provide concrete steps to implement key metrics, automatically collect and visualize them on an open source dashboard and find an answer to this important question.

Key Takeaways are:

How did I end up here?

My name is Roman Pickl and for the last two years I’ve been a technical project manager at Elektrobit, which is an automotive software supplier. Before that I was CTO of a medium sized company called Fluidtime, but also a process manager at the Austrian parcel service, which also deals with some kind of continuous delivery, I guess…

I have a background in software engineering, business administration and computer & electronics engineering. CI/CD/DevOps is the sweet spot for me, as I really love how the things I learned in my Production Management and Operations Research courses are nowadays applied in the IT domain.

One aspect, that I really liked about my job in the operations department of the Austrian parcel service back in 2009 was the fast physical feedback and visibility of problems.

There were more subtle and hard to find process errors as well of course, but if one of the main systems or processes did not work as expected, boxes started to pile up at the bottleneck, providing a hard to ignore indicator of the problem.

I moved on after about 1.5 years but since then I always missed this clear feedback signal in my IT jobs.

Ambient Awareness

I was missing “Ambient Awareness”. I think I first read about this concept in Michael Nygard’s 2007 book Release It! (There is also a 2018 second edition, but i haven’t looked into it yet). The idea is to create an Ambient Display, “an Interface between People and Digital Information” which represents data, e.g. the health of a system with the help of sound, visuals, movement or other cues. I had the honor to work with Michael Kieslinger who has published several papers on this topic at the Interaction Design Institute Ivrea and later founded Fluidtime around this concept.

Mike Clark’s Pragmatic Project Automation book from 2004 also has a section about getting feedback from visual devices / eXtreme Feedback Devices.

There are various ideas out there, from simple displays, to ambient orbs, lava lamps, bear lamps USB rocket launchers, traffic lights, you name it.

These kind of “information radiators”, which should be put in a highly visible location to promote responsibility in the team (nothing to hide) and provoke conversation, can be traced back to, you may have already guessed it, the Toyota Production System (https://www.agilealliance.org/glossary/information-radiators).

When I spoke at DevDays Europe 2019, Steve Poole held an inspiring keynote about Dashboards and Culture: How openness changes your behavior (newer recording). He told a story about how sharing insights on dashboards closes communication gaps, forces discussion on how to generate accurate data / metrics and changes your culture. Putting data on a dashboard made the problem “real”. Before that it was just data in a spreadsheet. He is also talking about measuring end to end times. When I asked Gene Kim about tools which revolutionized IT work he also mentioned dashboards and Steve Poole’s talk.

Dominica DeGrandis also makes a strong case for visualizing work and its flow in Making Work Visible.

I had already collected the data that I wanted to show for our weekly status meetings on a Wiki page by hand for a few month. However, I wanted to collect it automatically and have up to date data all the time. So I wanted to visualize our work on an automated dashboard. I had previous experience putting Redmine agile’s Agile Ajax board, JIRA wallboards, a Jenkins Build Monitor, and Graylog Dashboards on the wall and cycling through them in browser tabs using extensions like Revolver, but this time I was looking for something more integrated. What is more, i remembered a quote from Winston Churchill:

We shape our buildings and afterwards our buildings shape us.

Winston Churchill

It also reminded me of the Skoda plant and BMW Project House discussed in Thomas J. Allen & Gunter Henn’s, book The Organization and Architecture of Innovation - Managing the Flow of Technology, where every employee has to pass certain points of the assembly line or up-to-date prototypes before arriving at his/her workplace.

So, working in a distributed team, I wanted to have the data available on our intranet, but also on a highly visible screen in the entrance area / hallway where everyone passes by a few times a day.

Creating a Dashboard

I already had a Raspberry Pi 3 at hand, but quickly learned that getting my private device on the company network is more or less impossible (get it on a white list to get an IP etc.). What was even more startling was that it was very difficult to get an additional monitor to show a dashboard. I asked every other month, but, given that we were growing, there was always a scarcity of time (also my own) and resources.

In retrospect I was very successful last year with piggybacking a planned change: Part of our company moved to a new office, down the street of the existing office.

The new office was renovated and a wish list / things-to-do list was created. So I asked IT for the stuff I needed for my dashboard (Raspberry Pi and a monitor).

I think that given that they were in a “change mode” of solving problems, buying hardware and setting up the network, it was easier to get the stuff approved and I got these devices. I may also have been lucky, as there was a plan to use a Raspberry Pi 3 Model B Plus Rev 1.3 (Broadcom BCM2837B0, Cortex-A53 (ARMv8) 64-bit SoC @ 1.4GHz, 1GB LPDDR2 SDRAM) incl. case, SD card, power supply, WiFi Dongle and HDMI cable for something else and it got canceled. In any case, I got this official device and a monitor from IT.

I would get a Raspberry PI 4 with at least 2 GB RAM now, as I ran into some memory issues.

There were also some logistical problems (I guess that’s normal) with setting up the new office space and I had some time to play around with dashing/smashing. As I had some previous experience with it, I gave it a go and was happy enough to keep it. There are other options like Tipboard or Mozaik, but unfortunately none of them seems to be very active.

Smashing is a Sinatra based dashboard framework. it comes with a number of pre-installed widgets based on scss, html, and coffeescript. There is also a high number of user submitted widgets that are easy to adapt. You can also get started easily with writing your own widget following their workshop. You can either use jobs written in ruby to update your widgets or push data to the API (see https://github.com/Smashing/smashing/wiki/How-To%3A-update-dashboard-in-Django) As smashing runs best on linux (https://github.com/Smashing/smashing/wiki/Installation), I used the following docker image for testing and development: https://hub.docker.com/r/visibilityspots/smashing. Note that some people have also been using it on windows lately.

A word of caution. I’m neither a ruby nor a CoffeeScript dev. So feel free to improve the code and setup.

I was ready to put some (private) time into implementing a first version of the dashboard and set it up in the hallway, ready for the official opening of the office.

So I ran:

docker run -d -p 8080:3030 visibilityspots/smashing

and pointed my browser to http://localhost:8080/.

It was up and running with a first dashboard:

Smashing sample dashboard showing some of the built-in widgets

The docker image allows you to map the dashboards, jobs and widgets folder from your local disk which I used extensively to speed up development.

Based on the metrics I had collected by hand for a few month I wanted to visualize the following things:

This is what I came up with (Data has been changed to protect the innocent, code can be found here https://github.com/rompic/Smashing-Flowboard ):

Smashing status dashboard

All widgets are clickable and lead to the data source.

Putting it on the Raspberry Pi

Running the Dashboard on a Raspberry Pi and connecting it to an external monitor was the next step.

sudo apt-get install ruby2.5-dev

sudo gem install bundler

bundle update --bundler

once.

I also did a few other things, but I won’t go into too much detail as you will find a lot of information online. If you have any problems feel free to contact me (see bottom of the page):

The dashboard sparked a lot of interesting discussion during the opening party and we also got some great feedback about our innovative ways of working. Ever since the dashboard has been part of the new office and evolving into an important indicator of the current status and a source of new change initiatives.

I had succeeded in bringing back ambient awareness. That’s when I noticed a problem.

Epiphany

Applying the Three Ways of DevOps, especially by experimentation and by identifying bottlenecks in the build and test run, we were able to cut the full build/test cycle by a factor of 3 in the first few months of 2018. Moving our code to a git mono repo and containerizing our build environment in 2019 allowed us to provide feedback to our developers on every commit within minutes, not hours. Furthermore, automating our delivery allowed us to provide a new version of our software with the click of a button. This was great and we felt happier and so freaking agile.

In Gene Kim and John Willis' audio book Beyond the Phoenix Project a scene in Eliyahu M. Goldratt’s classic book The Goal is discussed.

Alex, the leading character of the story, is very proud of the increased “productivity” they get in the plant by applying robots when Jonah, the management guru, asks him a few questions. In summary the dialog evolves like this

Is the company now making more money? : No

Did you ship even one more product? : No

Are plant inventories down? : No

Are employee expenses down? : No

Then you didn’t really increase productivity, your inventories are going through the roof, aren’t they?

Looking at the dashboard, the inventory was starring me in my face:

300 done, but not yet released tickets

This is Sparta meme from the movie 300

Imagine all these tickets were boxes lying around in the hallway, they would have been way harder to ignore. They don’t have any value as long as they are not released. Furthermore, it doesn’t really make sense to add more.

Also see the discussion of Done vs. Done Done in Dominica DeGrandis’s - Making Work Visible - How to Unmask Capacity Killing WIP on page 122 ff.:

Think of a box of cereal sitting on a grocery store shelf. Corn flakes don’t provide any value to Kellogg’s until a customer buys them. Like inventory sitting on a shelf, a newly developed feature or bug fix doesn’t provide much value to the requestor until they can get their hands on it. — Dominica DeGrandis’s in Making Work Visible - How to Unmask Capacity Killing WIP

The bottleneck in development had shifted to testing and we were creating a lot of inventory.

Shifting bottlenecks

In the Beyond the Phoenix Project audio book Gene Kim and John Willis discuss this shifting bottlenecks specifically in the IT domain. It is also discussed in the DevOps Handbook (pages 22, 23 in my copy) and by Gene Kim in an AMA of the The Unicorn Project: https://youtu.be/ReROx9-68V8?t=818 :

They are talking about 5 progressions

  1. Environment creation: A common first bottleneck is getting a deployment environment. A potential solution is to provide them on-demand and self-service e.g. by automating / virtualization / infrastructure of code
  2. Code deployment: It then often moves to code deployment, where a solution is automation and reducing hand offs and move towards self-service, single piece flow and continuous delivery
  3. Test setup and run: It then often progresses to testing (takes too long for faster deployments, manual tests, etc.): Massively automate the test process, move from integration tests to unit tests and parallelize.
  4. Overly tightly architecture: It then often moves to architecture. Small changes need a lot of approval of other teams etc.: Move to loosely coupled architectures / components that can be deployed independently
  5. If these constraints have been resolved, it moves to development or product managers, running out of great ideas or deciding which ideas to validate with real live customers. Effort should then shift to improving the flow from idea to delivery (“aha to ka-ching!").

When I heard and read this, I was reassured that our efforts were going in the right direction. It also reminded me of the J-curve of automation mentioned in the 2018 State of DevOps report which states that you really need relentless improvement, refactoring and innovation to reach a state of excellence.

According to the DevOps Handbook, Measurements and Value Stream Mapping can help to identify the current constraints and guide the transformation.

DevOps metrics and measuring flow

As explained by Jez Humble in this twitter thread, the DORA metrics deal with the product delivery domain (Build, Testing, Deployment), while flow metrics deal with the Product Design and Development domain. Your team is in both domains simultaneously if they are cross-functional . A predictable, low variability delivery process enables working in smaller batches and taking an experimental approach to product development and process improvement.

In the DORA - Accelerate: State of DevOps 2019 report, the authors have identified four key metrics to differentiate low, medium and high performance:

Four key metrics: Lead Time, Change Fail Rate, Deployment Frequency, Time to Restore

DORA - Accelerate: State of DevOps 2019: Elite performance, productivity, and scaling

While availability is shown in this figure, they do not include it in their cluster analysis as it does not apply in the same way to different software products.

The authors show that these metrics do not represent trade-offs between throughput and stability, but rather that high performers succeed in improving all these four metrics at the same time and stability and speed enable each other.

Based on these insights, and looking at our current state, I was especially interested in the throughput part and aimed to measure flow.

In from Project to Product Mik Kersten introduces the Flow Framework.

It defines 4 different Flow Items (features, defects, risks and technical debt) which describe all the work in a value stream (Mutually Exclusive and Comprehensively Exhaustive) and proposes to track the following metrics:

It also encourages to have a look at Flow Distribution, the allocation of Flow Items in a particular flow state across a measure of time, which helps to prioritize specific types of work during specific time frames in order to meet a desired business outcome / or see trade-offs.

It is business outcome driven as it also recommends to track business value, cost, quality and team happiness (with a survey) and correlate it to the Flow Metrics.

Carmen DeArdo gives a great overview:

If you want to know more about flow metrics you might be interested in watching two videos from last year’s All Day DevOps event:

and this one

So I created another dashboard (Data has been changed to protect the innocent, code can be found here https://github.com/rompic/Smashing-Flowboard ):

Smashing metrics dashboard

Notice that while it provides some insides into features and defects, we currently do not track risks and technical debt (some are in the improvement category) that explicitly. Every 60 seconds we rotate through this flow metrics dashboard and the status dashboard using (https://github.com/vrish88/sinatra_cyclist).

After the next deployment, I stood in front of the dashboard.

Epiphany II

It dawned on me. We were shipping more often, but as we didn’t deploy from master, but rather patches from a release branch, on average we got slower. We had a fast lane for fixes, which were fixed on master and backported to the release branch (which is the way to go, if you use branches at all; see trunk-based development), but it still took us too long to ship features, which were waiting to be released.

It may look like a crisis, but it’s only the end of an illusion

Rhonda’s First Revelation, Gerald M. Weinberg in The Secrets of Consulting

So we looked into cutting our release cycle for major releases from every half-year to each quarter or even more often.

Still it seemed as if we were always late, with priorities / requirements changing in between these cycles.

I felt like we were improving our development process, constantly running, but remaining in the same spot, as in the red queen’s race:

“Well, in our country,” said Alice, still panting a little, “you’d generally get to somewhere else—if you run very fast for a long time, as we’ve been doing.”

“A slow sort of country!” said the Queen. “Now, here, you see, it takes all the running you can do, to keep in the same place. If you want to get somewhere else, you must run at least twice as fast as that!”

Lewis Carroll - Through the Looking-Glass, and What Alice Found There

As I later found out, we were trapped in the Local Optimisation & the Urgency Paradox described by Jonathan Smart:

Valuable ideas sit in 12m to 18m of big up front planning with no sense of urgency, no questioning if they might add more value than what has already been locked in the plans for the year. As soon as they reach the product development team they are urgent.

I had already heard of a similar phenomenon called Water-Scrum-Fall from Jez Humble in his GOTO 2015 presentation (also see Dave West’s article from 2011) Similarly, in Lean Enterprise (pos2959) it is stated that “making one process block more efficient will have a minimal effect on the overall value stream. Adopting agile in a single function (such as development) has little impact on the value stream and customer outcomes”. So this was not the first time I heard about this difference between agile development and business agility, but the “we are so freaking AGILE, yay!"-picture from Klaus Leopold in the above mentioned article really drove it home for me.

We had a limited system focus and had to turn to a powerful tool: Value Stream Mapping.

A brief Intro to Value Stream Mapping

The First Way of DevOps emphasizes system thinking. It highlights the importance of flow from ideation to deployment and support.

As stated in the DevOps Handbook page 60ff) work typically starts with the product manager/owner gathering requirements based on a customer request or a business need. A development team adds this feature to their backlog, plans it for an iteration and implement the feature. The code is then build, integrated and tested. Finally it gets deployed and released to the customer where, if everything worked well, it creates the desired value.

Value Stream Mapping is a technique to visualize and understand the relevant and critical steps necessary to create and deliver value. Of course it can be traced back to Toyota. The current state is documented with representatives involved in each step in a workshop as well as people able to authorize the required changes, lean metrics are identified, and based on this the future state map (one to three years) is deducted as well as an action plan is created (typically three to twelve month). The improvement Kata can be used to move towards the future state. This approach allows to uncover bottlenecks, long wait times and eliminate waste/rework, as in value streams of any complexity no single person knows all the necessary steps that need to be performed to create value for customers.

For a draft agenda of a value stream mapping workshop see the dojo consortium’s website.

According to Chapter 7 in Jez Humble, Joanne Molesky and Barry O’Reilly ’s book Lean Enterprise it is not the goal to map every single step in detail, rather to get an overview with 5-15 process blocks. For each process, the team which performs it, the activity and the name is recorded. Real data is gathered about the current status: the people involved, barriers to flow, amount of work in each process block as well as queues / inventory between processes. Additionally three key metrics are recorded:

Metric What it measures
Lead Time (LT) The time from the point work is made available to a process to the point it hands that work off to the next downstream process
Process Time (PT) The time spent executing a particular process (with all necessary information, resources and working uninterrupted).
Percent complete and accurate (%C/A) The proportion of times a process receives something from an upstream process that it can use without requiring rework

Note that some authors also use cycle time as a metric. Karen Martin and Mike Osterling avoid using it at all in Value Stream Mapping: How to Visualize Work and Align Leadership for Organizational Transformation as it has several definitions and is used synonymously with different things.

Based on these metrics, summary metrics like total lead time, total process time, activity ratio (total process time divided by total lead time), accumulated / rolled %C/A are calculated.

Value stream mapping is used in various industries. See https://cloud.google.com/solutions/devops/devops-process-work-visibility-in-value-stream for an IT development related example and Karen Martin providing an overview of her Value Stream Mapping: How to Visualize Work and Align Leadership for Organizational Transformation book here.

Outcome and current status

While process improvements focus on where value is added, Value Stream Analysis focuses on identifying bottlenecks and eliminating waste. It turns out that this approach often has a way higher leverage. As described we found that we had a limited system focus, and needed buy-in to influence the process up- and downstream.

Luckily, at the same time the organization identified focus programs to improve flow and based on these started a Continuous Improvement initiative which is rolled out in 2020. We were able to connect to that program and harness what we learned to drive further change.

Due to the COVID19-Pandemic the time plan has been shifted a little bit and we are still in the middle of analysis.

However, since the beginning of the year:

While we made considerable progress in our journey, challenges still remain.

Further references and information

If you want to know more, you should really read these books, follow these links or watch the talks of the authors:

After I implemented the dashboards I found several professional / open source solutions that cater for the same or similar problem. So if you have a more complex setup or want to do something more serious, you might want to look into these:

Actually, Forrester has recently published an updated report called Elevate Agile-plus-DevOps with VSM which describes the benefits of the tools available in the emerging Value Stream Management market. They also have published a report named The Forrester Wave™: Value Stream Management Solutions, Q3 2020 which list 11 leading providers of such tools. At the time of writing it was possible to get a copy for the former, which was one of the companies listed, as well as the latter from digital.ai.

Summary and Outlook

The main business problem we faced is delivering value in a flexible way, at speed and high quality to our internal and external customers. We were hindered by long development cycles, 6 month budgeting periods, high workloads and priorities that were often changing. A full cycle of building and testing one of our software and hardware products took more than 24 hours. So when you did something in the afternoon, you sometimes didn’t get feedback until the following day, but the day after. It had a negative impact on developer moral and felt like quicksand: the more we fought it, the more it pulled us in. We knew there must be a better way.

Briefly after moving to a new office in 2019, and knowing about the importance of making work visible and after having learned about the Flow Framework, I implemented a dashboard using an open source solution (smashing) which automatically gathered and visualized, among others, Flow metrics (Flow Load, Flow Time, Flow Efficiency, Flow Distribution, Flow Velocity) for our value stream. After putting in countless hours eliminating waste, improving the deployment pipeline, investing in automation and deploying new technologies, I wanted to answer a fundamental question: “Are we really moving faster?” It took me a while, and listening to Beyond the Phoenix Project and reading The Goal, to understand:

It became clear that we were trapped in local optimization (now described by Jonathan Smart as the Local Optimisation & the Urgency Paradox), we had a limited system focus, and needed buy-in to influence the process up- and downstream. We were able to connect our efforts to the Continuous Improvement initiative that had just started in the company. While it is nice that a top-down program fits so nicely to a bottom-up effort, we still have to be aware of and thaw the frozen middle i.e. middle managers who seem to resist transformation as the way they are incentivised did not change.

While we are still in the middle of the analysis, we were already able to soften the pain, which is also visible in the flow metrics that we track. At the same time we have to be aware that metrics could also do harm, as described in the recent HBR article Don’t Let Metrics Undermine Your Business (find a nice sketchnote by Kate Rutter here: https://twitter.com/katerutter/status/1234276317249425408), if they surrogate the strategy.

Another thing that I’m interested in is looking into better tracking risks and having a more detailed look at the productivity part of the model described in The 2019 Accelerate State of DevOps report.


If you have any questions, recommendations, hints, or just want to say thanks, feel free to contact me. You can reach me on twitter @rompic or via mail at hello@pickl.eu

Thanks for reading this article.