How not to use OpenStack Community Metrics tools

I have noticed this morning another article/blog post mistakenly trying to extrapolate hard facts about a company’s involvement in OpenStack using one of the reporting tools built for the community. The reporter went to Stackalytics (but it could have gone to Activity Board, it would have been the same) to check if Oracle had made any contribution to the OpenStack codebase. That’s the wrong way to use these tools: numbers regarding companies contributing to OpenStack published daily cannot be trusted.

Both Stackalytics and Activity Board depend on data that is entered voluntarily by the contributors to OpenStack, therefore they cannot be trusted. Stackalytics has a mapping file in its repository that is kept up to date by developers themselves (those that know its existence). Activity Board pulls data straight from the OpenStack Foundation Members database: when you sign up as Member of the Foundation (a precondition to become a developer) you’re asked to enter data about who pays for your contribution. The bylaws of the Foundation also require that you keep that info up to date, but we know as a fact that few people log back in their member profile and even fewer update their affiliation. Therefore we know that the data about the affiliation in all reporting tools is not 100% reliable at any point in time. It’s good enough if you’re looking at the top contributing companies where the volumes are high enough to remain fairly valid despite small percentage errors. But when a reporter goes to check if a total newcomer to the community has submitted any code, that number is very likely to be wrong (and close to zero): the new developers may have not understood what the Affiliation field is and not filled it out (I see a lot of those on a weekly basis) and they’re very unlikely to know about the mapping file in git for Stackalytics.

The data that I trust most (but still not 100%, especially for ‘long tail’ contributors) are the reports published with Bitergia at release time: every OpenStack release we do a lot of manual cleanup of the data in the Foundation database, ask people to update their affiliation, normalize names of companies and circulate the report for comments before making them public. Still, those may contain errors which we track on Launchpad.

As far as I know the reporter didn’t ask the Foundation nor Oracle if anybody could point at actual commits done by Oracle employees and that’s what he should have done.

OpenStack prouds itself for being an open community and I’ve been the first proposer of having a public way to see the various activities inside the project, in real time and including the information about companies, not just individuals. I think we need to discuss how we can provide better data and avoid giving false illusion of precision to casual visitors to these sites.

11 thoughts on “How not to use OpenStack Community Metrics tools

  1. I think this is a reasonable use of the tools. If a company doesn’t have their affiliation information up to date then they need to get that fixed. The company can then post information showing that they are indeed contributing.

    The onus of properly updated metrics for an organization needs to be on the organization and not on the person using the metrics. Untrustable metrics aren’t useful metrics.

    Maybe there’s something we can do to make the process less error prone for people contributing to get their affiliations correct?

    1. I think the main problem with delegating responsibilities to companies is that they have very few tools to fix the issue themselves. Neither I think it’s fair to ask newcomers to do yet another thing (set the mapping file in stackalytics or something else) before they can commit code or file a bug: it’s already too hard to start interacting with OpenStack. I think we need to find a better way.

      One line of thoughts I have is to create the concept of ‘group’ and ‘sub-groups’ in our members database and tie such concept to the Corporate Contributor License Agreement. I think it deserves a blog post on its own. The basic idea is that the person who manages a development team at, say, IBM, instead of signing the Corporate CLA on Echosign and provide the list of authorized committers on the same platform, creates a group on the ID.openstack.org service and assigns to that group the individual members of his/her development team. This will also simplify the management of the Corporate CLA. It won’t solve 100% of the issues of people/companies who participate in other activities that is not managed via gerrit though.

  2. Just a couple of notes on Stackalytics affiliation rules.
    1. Here is a description of algorithm https://wiki.openstack.org/wiki/Stackalytics#Company_affiliation
    It is not only about manually mantained mapping file. Mapping file is maintained directly by contributors. The main source of affilation data is emails from commits info. If commit is signed with hnarkaytis@mirantis.com, then affilation is evident.
    2. There are 701 conributors profiles for Icehouse at the moment. 97 (13.8%) are classified as *independent. The rest 600+ contributors are affilated according email’ domains or mapping file. *independent contributors made only 7% of all commits. There is a regular mail compain for *independent contributors, that ensure that all of them are aware about their affilation. Top 10 *independent (50%+ of commits) contributors confirmed their affiliation via email. This means that maximum error of measurment is about 3%.

    As core contributor of Stackalytics I can confirm that affilations are not 100% correct. Confidence interval for all measurments is 3%.
    Issue that you raised is valid and we gonna put an official disclamer on landing page.

    1. You’re right, Herman, I forgot to mention that there are heuristics that help ‘guessing’ the highest percentage of commit attributions.

Leave a Reply

Your email address will not be published. Required fields are marked *