Wednesday, May 6, 2009

Counting the number of Ubuntu users

There have been a few articles recently trying to estimate the number of Linux users, which is apparently a challenging problem. However I have to wonder why it can't be figured out at least at the distro level by simply storing hashes of IP addresses that hit Canonical's update site, and looking at the number of unique ones each week/month.

There are going to be people using mirrors, but this is a small percent to lose to at least get something in the right magnitude, and the most popular mirrors could probably do a similar thing and contribute their numbers anyway. The only other main drawback would be multiple Ubuntu machines under the same IP, which again seems like it would only result in a slight inaccuracy. You'd also lose a small percent to users infrequently using their computers such that they aren't updated on a monthly basis, but yearly results would pull back in any of these people using their computers frequently enough to warrant counting.

Alternatively, as others have suggested as well, if Google would just release their numbers for browsers hitting google.com, we'd probably have a solid idea as well.

Are there already accurate numbers for Ubuntu and if not, am I missing something with my proposal?

UPDATE: Jef pointed out that Fedora is already doing this at http://fedoraproject.org/wiki/Statistics#Yum_Data, which is pretty awesome! That shows about 14 million unique repository connections, so making a VERY rough, not remotely scientific estimate, we could use distrowatch to estimate that Ubuntu has 1.68 times the number of users as Fedora, and get something around the order of 24 million users that have connected.

21 comments:

Drew Stephens said...

This sounds like good idea to me - each distribution's package repositories (well, except for Red Hat/Fedora with it's decidedly lack-luster and ill-used repos) has a good handle on the number of actively used machines. There will be some that won't get counted, because they update from someone's local mirror, and a few that will get counted multiple times when the check form different IPs, but it should be a close-enough number.

Anonymous said...

I think you're greatly underestimating the percentage of users using mirrors. I don't think we'd be anywhere near the order of magnitude of users that way.

Unknown said...

Yes,
counting unique IPs is the first whing that can be done, but clearly this is not accurate due to NAT.

Behind a single IP you might have N users, and you don't know how many they are.

Michael said...

soren, it doesn't quite sound like you understand orders of magnitude if I understand your comment correctly. Unless people using the default mirror are in the MINORITY (I highly, highly doubt this is the case), it would easily be on the same order. And second, as I mentioned in the post, the most popular mirrors would likely contribute their counts to Canonical as well, making it irrelevant.

Stefano, I agree, but again this number would be better than throwing around download numbers and random guesses. And conversely to your point, as Drew mentions, there are going to be users on netbooks/laptops updating from many different IPs which will partially offset that issue.

Anonymous said...

You know that http://popcon.ubuntu.com/ exists?

PriceChild said...

perhaps those hitting the security repo (central, not mirrored and used by default unless sa chooses not to) could be used... infact I'm pretty sure they've looked at that before.

Jef Spaleta said...

What you mean you don't trust Canonical's statement has been using every year since 2006 that Ubuntu has 8 million users? I don't trust it either.

Fedora already has a way to estimate users via IP. We have the dynamic MirrorManager service..and it has logs...

http://fedoraproject.org/wiki/Statistics#Yum_Data

People have put a lot of thought in to what is actually achievable and what is not with regard to Fedora metrics:
http://fedoraproject.org/wiki/Infrastructure/Metrics

There's absolutely no reason Canonical couldn't take the MirrorManager codebase and adapt it for Ubuntu's needs. Unlike Canonical, which spends a lot of time building closed web services codebases they are reluctant to share. All of Fedora infrastructure is done in the open..including the MirrorManager service.


MirrorManager is important enough to talk more about. Every Fedora client by default contacts the MirrorManager service asking for which mirrors to use. The MirrorManager service even lets admins on large private networks redirect fedora clients in their network block to a local private mirror..without client reconfiguration. We still count those clients because they contact MirrorManager instead of having to be manually reconfigured to point to the local mirror. Our MirrorManager service is a benefit to both the user and the local network admin who is trying to conserve bandwidth....and its enabled by default.


-jef

pochu said...

IIRC the installer sets apt to use mirrors by default, so people using the canonical repo are likely the minority

Michael said...

Anonymous, I am familiar with popcon but that is a purely opt-in system so as such is only useful for relative comparisons between packages, not for getting a good count.

PriceChild, using the security repo seems like a good idea, although it appears that for me I am hitting my mirror for security too. Some users/admins probably disable everything but that anyway.

Jef, that is awesome, thanks for the resource! Do you know if the total across versions is unique across them? That is, if I have Fedora 9 installed and then upgrade to Fedora 10, do I get counted twice? That seems like a common place where people are going to get double counted over the course of a year.

pochu, hopefully getting the main and US repository picks up a large percentage, and making it easy for the other mirrors to participate would make it fairly accurate.

Anonymous said...

I do look at total uniqueness once in a while. As with all the measurement scenarios, there are some flaws -- the office I used to work in had many users on different versions, but they would be counted as one IP address since they were behind a firewall. And if the object is to capture number of *machine* installations, then you're also missing users who have more than one version on separate machines at home (as I do).

Nevertheless, the count I just found from the already collected IP lists was over 12.5 million totally unique IP addresses, out of around 14 million current IP addresses found through simple summing. Obviously this count doesn't include Fedora derivatives such as CentOS, Scientific Linux, Red Hat Enterprise Linux, and so on.

Jef Spaleta said...

Your distrowatch metric is absolute crap. I refuse to let that stand without comment.

Xandros is ranked pretty low.. and yet it has a significant number of pre-installs via being the linux Asus uses on its EEE netbooks...for like what is it now 2 whole years. In fact from netbook sales estimates Xandros is crushing Ubuntu netbook pre-installs.

In no way whatsoever can you reasonably argue that the distrowatch metric correctly places Xandros compared to the 30 or so other distros in front of it. No way.

The distrowatch scaling metric does not stand up to scrutiny.

To understand how to use the distrowatch metric you have to understand why people are going to distrowatch. You also have to understand that distrowatch's own ranking system has a nonlinear affect on the ranking. Higher ranked distros are going to get more interests from new distrowatch visitors..because they are highly ranked. It's a feedback loop in the methodology. And it makes for an absolutely crap metric of anything at all.

If you are serious about this you need to find a metric that actually measures what you are interested in.

What you need to do is demand Shuttleworth or any other Canonical employee who has so far been quoting userbase numbers in the press for the last 3 years..that they actually describe how they get those numbers.

http://www.theregister.co.uk/2008/10/27/shuttleworth_ubuntu_commitment/
"Precise Ubuntu installed base numbers are impossible to obtain, but Shuttleworth said the most recent estimate is about 8 million users for the Linux variant. Ubuntu does not have any call-home features to help Canonical count installations. That's because Shuttleworth does not want to violate users' privacy or put up any barriers to adoption for the software. "We actually have no idea," Shuttleworth admitted."

Numbers have contexts... methodology has meaning. You can't just make up numbers and scaling factors just because they seem to fit the argument you are making. You have to test them for sanity. The distrowatch scaling factor is not a sane metric.

Michael said...

Jef, I completely agree! That's why I said it was "a VERY rough, not remotely scientific estimate". It was just meant as a fun number. I am not at all serious about it; I thought it would be obvious from my disclaimer. Obviously the ratio obtained on distrowatch is flawed in numerous ways. I'm sorry if I have offended you by multiplying some numbers together :) If you really don't think my disclaimer was sufficient, feel free to let me know why it isn't VERY rough or not remotely scientific, and I can adjust it.

pfrields, thanks for leaving a comment! It is good to hear that you guys are somewhat serious about metrics like this, and are certainly quite ahead of Ubuntu (as far as I can tell) in terms of collecting the data and being open about it. Like Jef said, it is certainly an aspect that is missing from Canonical/Ubuntu which is unfortunate considering their other marketing efforts.

Anyway I definitely appreciate some Fedora folks chiming in here and didn't mean to offend anyone with my wild extrapolations.

Jef Spaleta said...

I'm offended.. because those are exactly the sort of numbers which the laypress picks up on. By doing the calculation and publishing a calculation you know has no merit..you are not helping make the case for solid methodology for counting that can be reused by all distros..so we can get good solid numbers out to the press. Because god knows the press don't care about accuracy.

Do you want solid numbers or not for total linux usage? If you do, then don't publish goofball numbers yourself.

distrowatch and google trends...while "fun" to look at..have no meaning..have no value..in any well understood sense. You might as well just generate random numbers between 1 and 10 million for all distros and call them a rough estimate with a +/- 9 million errorbar on all the numbers.

MirrorManager and the statistics it generates is a methodological approach that everyone can use..we could get solid consistent numbers across pretty much all linux distributions if they adopted the MirrorManager approach to handing out mirror information to clients dynamically. There's real value for everyone in this tech. Users, network admins, and distributors. We don't have to rely on CEO's making up deployment numbers in press interviews.

-jef

Anonymous said...

Couldn't apt send the MAC-address (or some hash of it) to some kind of MirrorManager-like service? You would then get unique machines and I don't think people would feel it would violate their privacy (I don't feel my MAC-address is something private)...

Anonymous said...

Michael, I know what "order of magnitude" means, and I'd be deeply surprised if people using archive.ubunut.com as their mirror wasn't the minority. We have an extensive mirror network, and we configure Ubuntu systems be default to use country specific mirrors. Hence, metrics based on web requests to the primary archive mirror at archive.ubuntu.com would be completely useless to even come close to the order of magnitude of users. The only thing you can use it for is to get a rough idea about trends, but even then, it's a stretch.

Jef, you're missing the fact that Ubuntu is freely available from ShipIt, no matter where in the world you are, and no matter what sort of network connectivity you have (as long as you can actually get to ShipIt, of course). Hence, users who can only get Linux by these means will never be counted by a service such as MirrorManager. Since a similar service to ShipIt does not (to my knowledge) exist for any other Linux distro, number from MirrorManager-like services will be biased in favour of non-Ubuntu distros. Let's face it: There's no way to get hard numbers for number of Linux installs.

I'm sure there's also organisations that for whatever reason do not want to publish their use of Linux at all, even to an (alleged) anonymous service like MirrorManager. I don't say "alleged" because I don't believe it to be anonymous, but because there's no way for me to know whether that's the case or not, and for some organisations, that's simply not good enough.

Jef Spaleta said...

Are you suggesting that Shipit is uncountable? It most certainly is countable. If anything its the most accurate statistic you have available to you.

Canonical could tell us tomorrow exactly the number of ShipIt disks they have paid for AND the number of disks purchased directly for the Canonical shop. Have they ever done that? Have they ever put any hard numbers out with regard to how active ShipIt is? I haven't found them. If they haven't that's a pretty remarkable lack of transparency.

How about you press your leadership to publish the no-guesswork numbers associated with the amount of media sent via ShipIt for Intrepid on a monthly basis since the release of Intrepid. What is it maybe 1% of the total number of Ubuntu users Canonical employees have publicly claimed exist?

You want to haggle over a statistic that below the noise floor of any overall estimate fine..go right ahead...noise seems to be pretty important for Ubuntu supporters...much more than accuracy.

Michael said...

Anonymous, hashing the MAC addresses instead is a pretty good idea! I just suggested IPs because that is something the mirrors already have access to so it wouldn't involve any extra data sending or controversy. It is a great idea though, it would fix the cases of multiple users behind one IP, and also one user one multiple IPs.

soren, it sounds like you are right, sorry :) The more I look into this, the more I see that mirrors are more common than I thought. Thanks for enlightening me and sharing your knowledge! Though, I still think it would be feasible if enough mirrors participated. Combined with hashed MAC addresses, it has a decent accuracy potential.

Jef, ShipIt is certainly countable but I am not convinced how useful it would be. Surely some people order CDs and never use them, and someone in a shop might order one and install it a hundred times. And each LoCo could have hundreds on hand that never get used. Also, I'm not sure I could agree that something which is fun can have no meaning; fun IS meaning!

Lots of knowledgeable people have shared great stuff here, so that's awesome! Let's just try to keep it friendly and healthy :)

Jef Spaleta said...

Micheal:
At no point did I say that all "fun" things can't be "meaningful."

What I said was very specific. The distrowatch metric is not meaningful. The google trends metric is not meaningful. I making absolutely no claim about the meaningfulness of any other "fun" activity. I will say that making global maps of client connections to MirrorManager using GeoIP is both "fun for me" and "meaningiful" as it gives Fedora an easy to understand snapshot of how globally used Fedora is.

You've extrapolated what i said and attempted to apply beyond the bounds of the original context. Is that "fun" for you as well..making gross generalization about what other people say? That's neither friendly nor healthy. You want to keep this friendly..you want to keep this constructive? Then take more care and rein in your tendency to generalize.

I find it really amazing that you can so easily discount accurate ShipIt numbers as a useful rough metric and yet... you reached for distrowatch as a scaling metric. Stop putting the cart before the horse. Make accuracy the primary importance.. then worry about interpretation. Don't waste your time trying to interpret the meaning of numbers that aren't even accurate measure.

If LoCos have hundreds of cds collecting dust every release...that's also something you could get accurate stats on...you just have to survey LoCos and ask them. if they are requesting CDs and not giving them out..that is a drain on Canonical resources. It benefits everyone by making sure that's not happening too much.

-jef

Anonymous said...

If you want a rough estimate for the proportion of Ubuntu vs Fedora users, a better back-of-the-envelope would be the Desktop Linux Survey: http://www.desktoplinux.com/cgi-bin/survey/survey.cgi?view=archive&id=0813200712407

Now, it's 2 years out of date, but back then Fedora had 6% and Ubuntu 30%; if that's still the case today this would put Ubuntu at 5 times Fedora rather than 1.68 times Fedora.

Another metric might be Google Trends: http://www.google.com/trends?q=fedora%2C+ubuntu -- suggests that the Ubuntu:Fedora ratio of searches has been increasing; so either the ratio of usage has been changing or Ubuntu users are becoming even more likely to search for Ubuntu for some reason.

Now, Google Trends and Desktop Linux surveys can be inaccurate for a variety of reasons. I'll note, however, that the google trends data is consistent with the desktop Linux survey data; both imply roughly 5x - and both are doubtlessly better proxies for usage share than distro watch.

Tommi said...

I would either count at all the Distrowatch site. Even the sites admin says that should not be used anyway to calculate the amounts of users.

And I would like to see too that MAC address is hashed to upgrades, but Fedora is already using the smolt what generates unique ID of the system specs and sends it to Fedora. Sadly (good thing) it is not default so not all enable it. But by those figures says that Fedora has more installs than Ubuntu. Sorry about that, you ain't special, only gaining the media attention ;-)

I dont think that market share figures means anything for us. (OK, mayby some like to extend their virtual penis).

We all - what ever distribution we use - are using same OS. It does not matter what is your opinion about system, packagemanager, brand or even freedom. All we use Linux OS to power all the other software. Without Linux, we would not be here. So kudos to Linus Torvalds to code the OS in first place and kudos for Richard Stallman to start GNU project what gave us GPL and so on free software and we got the Linux OS licended under it (GPLv2).

I see more other distributions than Ubuntu in Finland, (the homeland of the Linux OS). Same thing around on the European or Asia where I have traveled.
Ubuntu has smaller share on technical oriented users (science labs, computer stores, universities etc) but that does not matter at all. We are all using same OS, the Linux (kernel)!

Unknown said...

What about universities, schools and companies using ubuntu all under the same ip-address?. This could have a pretty big influence on the numbers!