infra-specs/specs/website-stats.rst

::

  Copyright 2020 OpenStack Foundation

  This work is licensed under a Creative Commons Attribution 3.0
  Unported License.
  http://creativecommons.org/licenses/by/3.0/legalcode

======================
Website Activity Stats
======================

https://storyboard.openstack.org/#!/story/2007387

Basic website activity stats around which pages are hit most often, which
pages are 404s, and total number of visitors aid in properly running a
site. With this info you can correct broken links or redirect users to
appropriate locations. Popular pages can be given more attention as they
are read most often. Visitor numbers help you learn if changes that are
being made are effective or not.

Unfortunately for a long period of time we've not really published any of
this useful data.

Problem Description
===================

One of the major reasons we have not published this data historically is
that many tools that work with this data over share. We are particularly
concerned about publishing information that might be attributed to specific
users. The ideal here is that we could publish a bare minimum of information
that allows web admins to properly manage sites without leaking personal
information.

In particular we don't want to leak IP Addresses or subnets as IPs are
considered PII and without significant traffic subnets typically identify
specific users. We also want to avoid publishing referer information as
this can be used to infer who users are as well. This can happen if users
follow links from internal company wikis, bug trackers or code hosting
systems.

Out of an abundance of caution we will avoid publishing Operating System,
Web Browser, and google search terms as well. This data is likely safe to
share, particularly if we avoid making it cross referenceable with other
fields. For this reason we may add these stats in the future.

Proposed Change
===============

We can use goaccess, a GPL tool, to produce conservative website stats
reports from apache access logs. The key here is that newer goaccess (since
Ubuntu Bionic) allow you to remove data from the end result report files.
This allows us to tell goaccess to produce reports only with the data we
feel is safe for public consumption.

We would run periodic Zuul jobs that connected to static.opendev.org,
uncompressed Apache log files as necessary, then fed them through goaccess.
The resulting report.html output file could then be written into AFS as well
as hosted directly from the zuul logs system. This would give us reports
that updated roughly daily covering the period of time for which logs are
available.

To make this possible we will use Zuul's per project ssh keys. This will
allow the jobs to add static.opendev.org to the running ansible inventory
then run ansible to perform the above steps.

If publishing into AFS we would write them to a known location for each site::

  https://example.website.org/goaccess.html

To do this we need a configuration file that excludes the panels we do not
want::

  log-format COMBINED

  ignore-panel VISITORS
  ignore-panel REQUESTS
  ignore-panel REQUESTS_STATIC
  ignore-panel NOT_FOUND
  ignore-panel HOSTS
  ignore-panel OS
  ignore-panel BROWSERS
  ignore-panel VISIT_TIMES
  ignore-panel VIRTUAL_HOSTS
  ignore-panel REFERRERS
  ignore-panel REFERRING_SITES
  ignore-panel KEYPHRASES
  ignore-panel STATUS_CODES
  ignore-panel REMOTE_USER
  ignore-panel GEO_LOCATION

  enable-panel VISITORS
  enable-panel REQUESTS
  enable-panel REQUESTS_STATIC
  enable-panel NOT_FOUND
  enable-panel STATUS_CODES

Then we can run (roughly) this command in the Zuul jobs::

  goaccess /var/log/apache2/example.site.org_access.log* -o example-site-report.html -p ./goaccess.conf

Alternatives
------------

We can use tracker that run in the browser like goatcounter. One downside
to this approach is that we would need to run custom 404 pages in order
to collect data on 404s. This is more complicated than the web server logs
approach. One upside to this approach is that we could track referrers to
404s enabling us to more easily fix our own broken links.

If we were collecting a rich set of data they would provide much more info,
but because we've decided that we do not want to collect that information
the server logs should be sufficient.

Implementation
==============

Assignee(s)
-----------

Primary assignee:
  Clark Boylan (clarkb)

Gerrit Topic
------------

Use Gerrit topic "website-stats" for all patches related to this spec.

.. code-block:: bash

    git-review -t website-stats

Work Items
----------

* Write zuul jobs to produce and publish the goaccess reports.
* Document goaccess tooling for web admins.

Repositories
------------

None

Servers
-------

static.opendev.org would be updated to implement this for the sites it
hosts.

DNS Entries
-----------

None

Documentation
-------------

We will need to document where the stats can be retrieved once available.
We should also document the choices we made around which data is collected.

Security
--------

We could potentially leak sensitive client information unintentionally.
The example config file used above is intended to do its best to avoid that
by explicitly disabling all available goaccess panels then enabling the few
we know are safe.

Testing
-------

We can run the new job against test data to ensure it works as expected
without disclosing unwanted info.

Dependencies
============

None