Proposal and Request For Feedback: Implement `dnf countme`

Hello I am Jonathan Wright, Infrastructure Team Lead for AlmaLinux. I manage most of the plumbing that keeps things humming smoothly along and I’ve been working on some improvements to some parts of it to make things more user friendly for our community.

AlmaLinux values transparency AlmaLinux OS Foundation | AlmaLinux Wiki and communal decision making, it’s one of the reasons why I decided to become a contributor. As part of some of the work I’m doing I’d like to request some feedback from the community on a proposal to enable dnf countme similar to the way the Fedora project does.

countme is a core feature of DNF implemented upstream in Fedora 32 (dnf 4.2.9). It is described by the docs as such:

Determines whether a special flag should be added to a single, randomly chosen metalink/mirrorlist query each week. This allows the repository owner to estimate the number of systems consuming it, by counting such queries over a week’s time, which is much more accurate than just counting unique IP addresses (which is subject to both overcounting and undercounting due to short DHCP leases and NAT, respectively).

The flag is a simple “countme=N” parameter appended to the metalink and mirrorlist URL, where N is an integer representing the “longevity” bucket this system belongs to. The following 4 buckets are defined, based on how many full weeks have passed since the beginning of the week when this system was installed: 1 = first week, 2 = first month (2-4 weeks), 3 = six months (5-24 weeks) and 4 = more than six months (> 24 weeks). This information is meant to help distinguish short-lived installs from long-term ones, and to gather other statistics about system lifecycle.

countme was designed with privacy in mind and does not add any identifying or unique information to requests so there is no tracking involved. Just a simple “hello” to the repository.

Currently, AlmaLinux does not track any sort of usage statistics for our distribution at all. We can technically try to aggregate basic metrics from HTTP logs on our mirrorlist servers but the reliability of the data will not be the best since counting unique IPs is undermined by things like NAT and dynamic addressing. So, I’d like to propose we implement “countme=1” in our repository configs just as Fedora and EPEL have done. I’d also like to propose that the aggregated data be made available publicly, similar to https://data-analysis.fedoraproject.org/ for the community to see.

I’ve setup a form for feedback at https://forms.gle/BShXoxJmsjNbMXCk6 in case you’d like to give any input on this proposal. We will keep this form open for about a week.

FAQ:

Q: When are “countme” requests sent? A: Once a week at random during normal dnf activity. If you do not use dnf calls that would otherwise trigger mirrorlist requests (makecache, install, update) this flag will NOT cause dnf to go out of its way and make special requests.

Q: What extra data will be sent that is not currently collected? A: “countme=X” will be added to a random mirrorlist request each week from DNF where X is a number, 1-4 which represents the number of weeks your system has been installed. See above for the explanation of this from the DNF documentation.

Q: Will aggregated data be made publicly available? A: Yes

Q: What data do you use? A: The only data we look at is in the HTTP request itself. Our log lines are in the standard Combined Log Format. Ex: 172.30.61.81 - - [15/Dec/2021:17:02:12 +0000] “GET /mirrorlist/8/baseos?countme=4 HTTP/1.1” 200 629 “-” “libdnf (AlmaLinux 8.3; generic; Linux.x86_64)”

We only look at log lines where the request is “GET”, the query string includes “countme=N”, the result is 200 or 302, and the User-Agent string matches the libdnf User-Agent header.

The only data we use are the timestamp, the query parameters (repo, arch, countme), and the libdnf User-Agent data.

In the future we will also aggregate data by country using GeoIP. Our processing and aggregation does not care about IPs themselves or their uniqueness. When we implement the aggregation of geographic data it will use MaxMind’s GeoIP database locally to turn the IP into a region which will be used for tallying generalized metrics for that region.

Raw access logs are archived in case we find major issues in any of our processing which would allow us to re-parse the data in the future and correct the published statistics.

Q: Can I opt out? A: Yes, but we’d prefer you not since the data is very helpful. The only extra data you’ll be submitting is “countme=X” in one request per week.

If you’d like to opt out you can comment out the “countme=1” line in the repository config files in /etc/yum.repos.d/

Discussion for this should be directed to the AlmaLinux Infrastructure mailing list. You can join the list at Info | infra@lists.almalinux.org - AlmaLinux List Archives

https://lists.almalinux.org/archives/list/infra@lists.almalinux.org/thread/3HCVC6IJ5SY6HNW5NF3ES4B7SGG6JZN2/