5cf38c85fb
Additionally add clarkb as volunteer. Change-Id: Id8a3e434519263316ca09b66d0a7d5b708e72e6d
179 lines
5.3 KiB
ReStructuredText
179 lines
5.3 KiB
ReStructuredText
::
|
|
|
|
Copyright 2020 OpenStack Foundation
|
|
|
|
This work is licensed under a Creative Commons Attribution 3.0
|
|
Unported License.
|
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
|
|
|
======================
|
|
Website Activity Stats
|
|
======================
|
|
|
|
https://storyboard.openstack.org/#!/story/2007387
|
|
|
|
Basic website activity stats around which pages are hit most often, which
|
|
pages are 404s, and total number of visitors aid in properly running a
|
|
site. With this info you can correct broken links or redirect users to
|
|
appropriate locations. Popular pages can be given more attention as they
|
|
are read most often. Visitor numbers help you learn if changes that are
|
|
being made are effective or not.
|
|
|
|
Unfortunately for a long period of time we've not really published any of
|
|
this useful data.
|
|
|
|
Problem Description
|
|
===================
|
|
|
|
One of the major reasons we have not published this data historically is
|
|
that many tools that work with this data over share. We are particularly
|
|
concerned about publishing information that might be attributed to specific
|
|
users. The ideal here is that we could publish a bare minimum of information
|
|
that allows web admins to properly manage sites without leaking personal
|
|
information.
|
|
|
|
In particular we don't want to leak IP Addresses or subnets as IPs are
|
|
considered PII and without significant traffic subnets typically identify
|
|
specific users. We also want to avoid publishing referer information as
|
|
this can be used to infer who users are as well. This can happen if users
|
|
follow links from internal company wikis, bug trackers or code hosting
|
|
systems.
|
|
|
|
Out of an abundance of caution we will avoid publishing Operating System,
|
|
Web Browser, and google search terms as well. This data is likely safe to
|
|
share, particularly if we avoid making it cross referenceable with other
|
|
fields. For this reason we may add these stats in the future.
|
|
|
|
Proposed Change
|
|
===============
|
|
|
|
We can use goaccess, a GPL tool, to produce conservative website stats
|
|
reports from apache access logs. The key here is that newer goaccess (since
|
|
Ubuntu Bionic) allow you to remove data from the end result report files.
|
|
This allows us to tell goaccess to produce reports only with the data we
|
|
feel is safe for public consumption.
|
|
|
|
We would run periodic Zuul jobs that connected to static.opendev.org,
|
|
uncompressed Apache log files as necessary, then fed them through goaccess.
|
|
The resulting report.html output file could then be written into AFS as well
|
|
as hosted directly from the zuul logs system. This would give us reports
|
|
that updated roughly daily covering the period of time for which logs are
|
|
available.
|
|
|
|
To make this possible we will use Zuul's per project ssh keys. This will
|
|
allow the jobs to add static.opendev.org to the running ansible inventory
|
|
then run ansible to perform the above steps.
|
|
|
|
If publishing into AFS we would write them to a known location for each site::
|
|
|
|
https://example.website.org/goaccess.html
|
|
|
|
To do this we need a configuration file that excludes the panels we do not
|
|
want::
|
|
|
|
log-format COMBINED
|
|
|
|
ignore-panel VISITORS
|
|
ignore-panel REQUESTS
|
|
ignore-panel REQUESTS_STATIC
|
|
ignore-panel NOT_FOUND
|
|
ignore-panel HOSTS
|
|
ignore-panel OS
|
|
ignore-panel BROWSERS
|
|
ignore-panel VISIT_TIMES
|
|
ignore-panel VIRTUAL_HOSTS
|
|
ignore-panel REFERRERS
|
|
ignore-panel REFERRING_SITES
|
|
ignore-panel KEYPHRASES
|
|
ignore-panel STATUS_CODES
|
|
ignore-panel REMOTE_USER
|
|
ignore-panel GEO_LOCATION
|
|
|
|
enable-panel VISITORS
|
|
enable-panel REQUESTS
|
|
enable-panel REQUESTS_STATIC
|
|
enable-panel NOT_FOUND
|
|
enable-panel STATUS_CODES
|
|
|
|
Then we can run (roughly) this command in the Zuul jobs::
|
|
|
|
goaccess /var/log/apache2/example.site.org_access.log* -o example-site-report.html -p ./goaccess.conf
|
|
|
|
Alternatives
|
|
------------
|
|
|
|
We can use tracker that run in the browser like goatcounter. One downside
|
|
to this approach is that we would need to run custom 404 pages in order
|
|
to collect data on 404s. This is more complicated than the web server logs
|
|
approach. One upside to this approach is that we could track referrers to
|
|
404s enabling us to more easily fix our own broken links.
|
|
|
|
If we were collecting a rich set of data they would provide much more info,
|
|
but because we've decided that we do not want to collect that information
|
|
the server logs should be sufficient.
|
|
|
|
Implementation
|
|
==============
|
|
|
|
Assignee(s)
|
|
-----------
|
|
|
|
Primary assignee:
|
|
Clark Boylan (clarkb)
|
|
|
|
Gerrit Topic
|
|
------------
|
|
|
|
Use Gerrit topic "website-stats" for all patches related to this spec.
|
|
|
|
.. code-block:: bash
|
|
|
|
git-review -t website-stats
|
|
|
|
Work Items
|
|
----------
|
|
|
|
* Write zuul jobs to produce and publish the goaccess reports.
|
|
* Document goaccess tooling for web admins.
|
|
|
|
Repositories
|
|
------------
|
|
|
|
None
|
|
|
|
Servers
|
|
-------
|
|
|
|
static.opendev.org would be updated to implement this for the sites it
|
|
hosts.
|
|
|
|
DNS Entries
|
|
-----------
|
|
|
|
None
|
|
|
|
Documentation
|
|
-------------
|
|
|
|
We will need to document where the stats can be retrieved once available.
|
|
We should also document the choices we made around which data is collected.
|
|
|
|
Security
|
|
--------
|
|
|
|
We could potentially leak sensitive client information unintentionally.
|
|
The example config file used above is intended to do its best to avoid that
|
|
by explicitly disabling all available goaccess panels then enabling the few
|
|
we know are safe.
|
|
|
|
Testing
|
|
-------
|
|
|
|
We can run the new job against test data to ensure it works as expected
|
|
without disclosing unwanted info.
|
|
|
|
Dependencies
|
|
============
|
|
|
|
None
|