infra-specs/specs/website-stats.rst
Clark Boylan 5cf38c85fb Add story for website stats spec
Additionally add clarkb as volunteer.

Change-Id: Id8a3e434519263316ca09b66d0a7d5b708e72e6d
2020-03-05 16:47:50 -08:00

179 lines
5.3 KiB
ReStructuredText

::
Copyright 2020 OpenStack Foundation
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
======================
Website Activity Stats
======================
https://storyboard.openstack.org/#!/story/2007387
Basic website activity stats around which pages are hit most often, which
pages are 404s, and total number of visitors aid in properly running a
site. With this info you can correct broken links or redirect users to
appropriate locations. Popular pages can be given more attention as they
are read most often. Visitor numbers help you learn if changes that are
being made are effective or not.
Unfortunately for a long period of time we've not really published any of
this useful data.
Problem Description
===================
One of the major reasons we have not published this data historically is
that many tools that work with this data over share. We are particularly
concerned about publishing information that might be attributed to specific
users. The ideal here is that we could publish a bare minimum of information
that allows web admins to properly manage sites without leaking personal
information.
In particular we don't want to leak IP Addresses or subnets as IPs are
considered PII and without significant traffic subnets typically identify
specific users. We also want to avoid publishing referer information as
this can be used to infer who users are as well. This can happen if users
follow links from internal company wikis, bug trackers or code hosting
systems.
Out of an abundance of caution we will avoid publishing Operating System,
Web Browser, and google search terms as well. This data is likely safe to
share, particularly if we avoid making it cross referenceable with other
fields. For this reason we may add these stats in the future.
Proposed Change
===============
We can use goaccess, a GPL tool, to produce conservative website stats
reports from apache access logs. The key here is that newer goaccess (since
Ubuntu Bionic) allow you to remove data from the end result report files.
This allows us to tell goaccess to produce reports only with the data we
feel is safe for public consumption.
We would run periodic Zuul jobs that connected to static.opendev.org,
uncompressed Apache log files as necessary, then fed them through goaccess.
The resulting report.html output file could then be written into AFS as well
as hosted directly from the zuul logs system. This would give us reports
that updated roughly daily covering the period of time for which logs are
available.
To make this possible we will use Zuul's per project ssh keys. This will
allow the jobs to add static.opendev.org to the running ansible inventory
then run ansible to perform the above steps.
If publishing into AFS we would write them to a known location for each site::
https://example.website.org/goaccess.html
To do this we need a configuration file that excludes the panels we do not
want::
log-format COMBINED
ignore-panel VISITORS
ignore-panel REQUESTS
ignore-panel REQUESTS_STATIC
ignore-panel NOT_FOUND
ignore-panel HOSTS
ignore-panel OS
ignore-panel BROWSERS
ignore-panel VISIT_TIMES
ignore-panel VIRTUAL_HOSTS
ignore-panel REFERRERS
ignore-panel REFERRING_SITES
ignore-panel KEYPHRASES
ignore-panel STATUS_CODES
ignore-panel REMOTE_USER
ignore-panel GEO_LOCATION
enable-panel VISITORS
enable-panel REQUESTS
enable-panel REQUESTS_STATIC
enable-panel NOT_FOUND
enable-panel STATUS_CODES
Then we can run (roughly) this command in the Zuul jobs::
goaccess /var/log/apache2/example.site.org_access.log* -o example-site-report.html -p ./goaccess.conf
Alternatives
------------
We can use tracker that run in the browser like goatcounter. One downside
to this approach is that we would need to run custom 404 pages in order
to collect data on 404s. This is more complicated than the web server logs
approach. One upside to this approach is that we could track referrers to
404s enabling us to more easily fix our own broken links.
If we were collecting a rich set of data they would provide much more info,
but because we've decided that we do not want to collect that information
the server logs should be sufficient.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
Clark Boylan (clarkb)
Gerrit Topic
------------
Use Gerrit topic "website-stats" for all patches related to this spec.
.. code-block:: bash
git-review -t website-stats
Work Items
----------
* Write zuul jobs to produce and publish the goaccess reports.
* Document goaccess tooling for web admins.
Repositories
------------
None
Servers
-------
static.opendev.org would be updated to implement this for the sites it
hosts.
DNS Entries
-----------
None
Documentation
-------------
We will need to document where the stats can be retrieved once available.
We should also document the choices we made around which data is collected.
Security
--------
We could potentially leak sensitive client information unintentionally.
The example config file used above is intended to do its best to avoid that
by explicitly disabling all available goaccess panels then enabling the few
we know are safe.
Testing
-------
We can run the new job against test data to ensure it works as expected
without disclosing unwanted info.
Dependencies
============
None