After looking at analytics plugins on the WordPress plugin directory, the label “GDPR compliant” is my newest bugbear. Data is not produce; the General Data Protection Regulation is not a recipe for growing it organically. A data controller can be in compliance with the GDPR; software can help them be that; it cannot be it for them.
In this post I will implement a solution that tries to accomplish the goals I set out in the first post: Getting site statistics while respecting the law and visitor privacy, without resorting to any of the usual manipulation and deceit.
I found lots of opinions on how the GDPR and website analytics intersect. Much of it feels arbitrary and falls in one of two camps: Do what you want but overwhelm the visitor with consent annoyances just to be on the safe side (from people trying to sell me solutions and not get their customers sued), or don’t do anything at all (from government officials trying to make sure that they haven’t opened up a hole in the legislation). So in case there was any doubt, the following is just my own attempt at applying common sense to the issue. [Insert joke about common sense and law.]
What I’m saying is that this is not the be-all and end-all guide to the issue. Unlike big companies, I am not trying to keep myself from getting fined; I am trying to abide by the law because it’s the done thing and I’m a geek and stickler for rules. It’s not going to be perfect but think applying sound principles, a basic understanding of the law and common sense is a huge improvement over ignorance and doing nothing.
Features and requirements
Based on my exploration of the ePrivacy directives and the General Data Protection Regulation, I went looking for a an alternative to JetPack stats that featured:
- Personal data specification: I want clear information about what kind of data is processed and ideally help in estimating if it qualifies as personal data under the GDPR.
- Anonymization options: What options does the solution have for anonymization sufficient for any personal data to no longer be considered personal data under the GDPR?
- Rights management: If I cannot sufficiently anonymize data (and still consider it useful) what tools does it offer to help me manage consent and rights?
In addition, I considered the following requirements sine qua non because any kind of off-site processing would make me liable for the processor’s actions:
- Data processing: The data gathered should be processed only by my own server. Note that I here use ‘processing’ in the GDPR definition meaning from article 4(2), including: “any operation […] such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction”. I am the data controller and my own data processor.
- Data sharing: No sharing of data with third parties for any purpose. The suggestion I have seen put forth recently by advertising people, that using Google Analytics is not sharing with third parties because GA has a button saying “do not share with Google” honestly just makes me angry. Anyway, GA and the like are ruled out by the first requirement.
- Free-software license: In order for me to trust any promise of no data sharing, I would only want to use a solution with a free-software license.
This is not going to be a comparison of offerings. I went with the first option that I felt satisfied all of those requirements which was matomo, formerly Piwik. If anybody has suggestions for alternatives that can tick the same boxes, I would be happy to hear about them.
Matomo is available as a plugin from the directory. It works with multisite though it is a bit confusing how some options are network wide and some site-specific. At least I could not see exactly why some were in one place and some in another. It is a dedicated analytics package, i.e. unlike JetPack it doesn’t offer any other functionality. While it does feature a lot of ecommerce and behaviour tracking options, it also very much caters to privacy conscious data nerds like me.
It is licensed under the GPL v3 (requirement 3). According to the plugin directory entry “100% data ownership, no one else can see your data. All data is stored in your WordPress and not sent to any third party or different country.” So that should satisfy requirement 1 and 2. The one non-homegrown resource that it uses appears to a geoip database that is downloaded to the site. In other words geoip location is not established by API calls out of the server but resolved locally. Also: The matomo team know better than to label their product “GDPR compliant”.
At time of writing it had been updated within the past month, had 30.000 active installations and has 63 out of 69 reviewers giving it 5 stars out of 5. Out of the 71 Mb in the wp-content/plugins folder, matomo accounted for 65 Mb. A chunky boy, in other words. I don’t know how much weight it has added to my database.
The ePrivacy directive: Cookie cutter
As I mentioned in the last post, compliance with the eprivacy directives can be laughably simple: Just don’t set any cookies.
As far as I have been able to ascertain, WordPress does not set any cookies by default if all you do is visit and read. Go to a login page and it immediately will set a test cookie, supposedly so it can better debug, should login fail for non-obvious reasons. This, I will consider completely “as strictly necessary in order to provide an information society service explicitly requested by the subscriber.” Also, for what it’s worth the login page is not intended for the public, i.e. I am not trying to provide visitors with an information society service here even if bruteforcers think I am. Commenting does not set cookies and I have opted to not include the “save my information when I comment” cookie opt-in option as it isn’t particularly useful here. While I’m deeply appreciative of anybody who cares to comment on a post, I am aware that Reddit, this ain’t.
Matomo, on the other hand, does set cookies by default. The main purpose is to generate a visitor profile for a session and to detect return visitors. These things are obviously useful to know if I run a business: How do people navigate my site and do I have return customers? I don’t, so it’s an easy decision to make. By ticking “Force tracking without cookies” in the Privacy settings (under the “Anonymize data” submenu option), matomo will no longer set cookies.
With this option checked, nothing on the site sets unbidden cookies in visitors’ browsers. So at least as far as the eprivacy directives are concerned, I do not need visitors consent. I will talk about the implications of this choice later in the post. That’s feature 1 checked off.
The General Data Protection and what counts for personal data
As hinted at in the last post, I want to see what it takes to avoid processing anything the GDPR considers “personal information”. This is not an attempt at skirting the regulation, au contraire. Reducing the amount of data and reducing reliance on personal data is very much in the spirit of the GDPR and encouraged by the commision:
Personal data should only be processed where it isn’t reasonably feasible to carry out the processing in another manner. Where possible, it is preferable to use anonymous data. Where personal data is needed, it should be adequate, relevant, and limited to what is necessary for the purpose (‘data minimisation’).“How much data can be collected?”
So the overriding goal in the following is to identify what data, that matomo gathers, is considered personal and how I can anonymize it. As mentioned in the previous post I did not think pseudonymization-by-hashing was a viable way forward in this case but matomo has some neat tricks up its sleeves and pseudonymization is in it’s toolkit.
IP addresses and geoip as personal data
Matomo has a page on what information the plugin collects from visitors and what amongst that might be considered personal data with respect to the GDPR (feature 2, check). Top of that list? IP addresses, naturally. IP addresses are also listed as an example of personal data on the commision’s website, so there is little doubt that matomo, by registering IP addresses is collecting personal data.
Does matomo have any options for not processing IP addresses? Yes. Under “Anonymization”, I can set matomo to discard the last 1-3 octets (or not register IP addresses at all). The advice for a balance of privacy and usefulness is 2 octets. By my calculation that means that a visit could be from any of ~65,000 IP addresses. I think it’s a reasonable leap to say that is no longer personal data. When I directly asked Datatilsynet, the Danish governmental agency overseeing GDPR implementation, they agreed that if the octets were – to paraphrase munchkins rather than bureaucrats – positively, absolutely, undeniably and reliably deleted, the data could be considered to be anonymized. They are also on record as saying that the information that you can see a major landmark building from your office does not identify you. By that logic, living in a town of 65,000 IP addresses does not identify you either.
The shortened IP address is then used to generate a geoip location. The geoip guess shot 130 kms wide in my tests. This is good. Even if we had access to ISP logs it would be virtually impossible to identify the user; without it it is literally impossible as we would be searching an area of thousands of square kilometers. The geoip data based on what we have determined to be anonymous data is obviously also considered anonymous. So, feature 3 is a check? Not quite yet.,
[…] Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person.[…]Excerpt from Recital 26, General Data Protection Regulation
Natural persons may be associated with online identifiers provided by their devices, applications, tools and protocols, such as internet protocol addresses, cookie identifiers or other identifiers such as radio frequency identification tags. This may leave traces which, in particular when combined with unique identifiers and other information received by the servers, may be used to create profiles of the natural persons and identify them.Recital 30, General Data Protection Regulation
What happens when I start adding additional information to my anonymous two-bytes-IP? Is the data still not merely anonymized but most sincerely anonymized?
Matomo may discard the last two bytes of the address but in conjonction with the remainder, it registers visitor OS, browser, and resolution to make a visitor pseudonym called config_id. Didn’t I say in the last post that pseudonymization wouldn’t work here? Yes, except if I could somehow keep secrets from myself. By using 24 hour salts that are discarded at the end of the 24 hours, matomo enables semi-anonymous tracking in the very short term. Within those 24 hours it would practically impossible for me to unmask the visitor. Outside of the 24 hour window it is impossible. The config_id pseudonym is then used to present the website owner with visitor profiles and a log of their recent page views. The matomo developers are at pains to point out that this is not “fingerprinting” and I have to agree; it’s more like a vague outline. To analogize: I just saw a black haired man in red trousers walked past my window. It could have been Jack White but it most certainly could also have been a lot of other people.
I think you can read the two regulation quotes above in two ways. One says, that it’s about a situation with a unique identifer (e.g. an un-cropped IP address) that has been encrypted and is in danger of being unencrypted using the encryption key (the additional information). The other says that it’s about exactly the kind of situation that I have just described: Bits and pieces of inexact information all coming together to form a whole that with a bit of Mastermind™ logic can identify a person.
To make it less abstract, let’s say that there’s only one person in Gothenburg, Sweden, that runs Firefox on Ubuntu 18.04 with a desktop resolution of 1440p? No? Okay, then let’s say there are three but two of them were out of town at the time of the request. And that we know for sure that the town is entirely devoid of VPNs or TOR nodes or other kinds of proxies. Does that mean that the request sender can be definitively identified and so the package as a whole should be considered personal information?
I think there is an argument in favour of no. If you apply the “objective factors, such as the costs of and the amount of time required for identification” to the job of identifying the swede above, it’s clear that amount of access and detective work required would be excessive.
However, if you ask various government agencies, another picture emerges. A spokesperon for Datatilsynet has suggested that “if your mother could identify you from the data”, it’s personal data. I don’t about yours, but my mother does not know what browser she uses herself, let alone what I use. However, the meaning is clear: For mother, read something along the lines of “the person most ideally situated to identify the data subject given the data”. And for person, we should probably read organizations and your local secret police and internet watchdogs that could lay their hands on the data. And ironically CNIL, the French data authorities, seems to agree. In their view, for my matomo install to not collect personal data, I would need to “Disable [the] Live Plugin”. What does this mean?
Tracking and profiling
In effect, the “live plugin” is what allows for the kind of granularity I gave a (hypothetical) example of with the Swedish visitor above. It means, that I can see that this one visitor from a specific part of the world using a specific OS and platform and a specific browser read a specific article at a specific time – and then went on to read another one in the same series and checked out the front page. It maintains the link between various bits of information rather than registering and counting it separately. And that is the crux of the matter. Keeping separate tallies of how many Windows/Linux/other visitors and how many visitors from various countries I get, is not personal data, because the OS data and the country data exist in isolation. A user is not identifiable based purely on knowledge of the make of the user’s operating system. In contrast, using the live plugin could be argued to be profiling and tracking visitors.
The way I see it, there are legitimate reasons for both “tracking” and “profiling” here but also reasons to steer clear. I use scare quotes because I don’t think either word really represents what I would be doing but I can also see why they might apply.
By profiling I mean connecting the available information about a user – country, OS, browser, etc. – with what kind of posts they are requesting. As I have made clear, matomo does make this kind of information available on a very atomic level: User A from country B using browser C on OS D read this post at time E. That level of information is fascinating but of absolutely no use to me. In matomo, however, I cannot have segmentation without it, and there are valid reasons to be interested in segmentation. A quarter of all visits on this site are from Windows users. Do their interests differ from those visiting from linux desktops? Or are they just linux users at home, surfing from their Windows work computers? The downside is that – at least in theory – visitor profiles increase the risk of making the data subjects identifiable. I am hesitant to embrace this as “profiling” because we are used to hearing that word in much more data heavy contexts. Google’s FLoC initiative for instance assigns visitor’s a much more detailed profile than “Windows user from Belgium”.
By tracking I mean connecting multiple page views over time into a “visit”. I find long time tracking across the web abhorrent. But as detailed at the start of this section, without cookies I can do no such thing. All I can do is see if someone viewed another post within 30 minutes of their first registered page view. I have written a 15 post series on hosting your own email. Knowing that even a single individual made it at least some way through that series would go a long way toward making the herculean effort feel somewhat worthwhile. Now, I actually thought that this would be entirely dependent on visitor profiles: If I don’t allow visitor profiles as a means of distinguishing between visitors, I cannot track visits. And using visitor profiles would bring me back to the issues with profiling and the risks of reidentification of “anonymous data”. The reason I thought so is because matomo removes the “visitor log” from the dashboard when I disable the “Live plugin”. So looking at a snapshot of individual visitors moving across the site is no longer an option. However, I can still see aggregate behaviour patterns on the “Behaviour” page of the dashboard, like entry and exit pages, average number of page views per visitor and “transitions” from one page to another. So matomo still uses the config_id hashes to distinguish visitors, just not to show individual profiles to me.
I was on the fence for a while but ultimately it was not that difficult a decision to make. I have disabled the live plugin.
I think the principle of proportionality (recital 3) is helpful here. Is the added risk of identification of data subjects – however hypothetical, however seemingly trivial and not-sensitive – proportional to my goals? Is it necessary? The answer is no. While I must do without visitor segmentation, I can get a sense of short-term visitor behaviour on the site. In effect, I can have almost everything I can ask for without enabling the live plugin. When you add that the GDPR advocates for data minimisation, it becomes overwhelmingly clear that I should not use it.
To turn off visitor profiling, I go to the “Live” section of the “General settings” in the admin panel and check “Disable visits log & visitor profile”:
There is one small adjustment, I will allow myself, seeing as I don’t profile users. With the cookie-less setup matomo can attempt to identify returning visitors within the 24 hour window where visitor profile hashes are generated using the same salt. Should somebody read one post, work on something and come back later that same day (CET) to look at other stuff on the site, I would now be able to spot that pattern. As mentioned, by default matomo would only see that return as a continuation if it happened within 30 minutes of the first page view. This change is accomplished by indicating the length of the window in seconds with the following setting in the matomo
global.ini.php file (according to matomo documentation it should be in a file called
config.ini.php – I’m guessing either out-of-date docs or differences between the WordPress plugin and the standalone application):
; The amount of time in the past to match the current visitor to a known visitor via fingerprint. Defaults to visit_standard_length. ; If you are looking for higher accuracy of "returning visitors" metrics, you may set this value to 86400 or more. ; This is especially useful when you use the Tracking API where tracking Returning Visitors often depends on this setting. ; The value window_look_back_for_visitor is used only if it is set to greater than visit_standard_length. ; Note: visitors with visitor IDs will be matched by visitor ID from any point in time, this is only for recognizing visitors ; by device fingerprint. window_look_back_for_visitor = 86400
And that is feature 3 checked off. Matomo does have tools to manage GDPR requests for deletion, rectification and insight (feature 4). It’s hard to judge their fitness for purpose when I am not using them but they seem fairly competent. For anyone actually storing IP addresses as personal information and getting requests for deletion or insight, it will obviously be a difficult challenge to actually match the rights request with the logged visit, because the person requesting deletion/insight/etc. may not be able to say what IP address they had at the time of the visit. So matomo seems to do it’s best at helping you narrow down the list of potential candidates.
As for the impact on analytics… While I loose the granular, individual level data mentioned earlier, matomo still keeps aggregate data (and keeps on aggregating them) about operating systems and browsers and post popularity. I just cannot crossreference them any more. So it’s a bit basic but really quite commensurate with my needs:
Count me out
As far as compliance goes, I believe I am done. There are however a few things I wanted to address before closing out.
As mentioned previously, even if visitors are not identiable, there is an element of tracking going on, however slight. Should I offer visitors a chance to opt-out nonetheless? I think it is only courteous and golden-rule-ish, so it is available as an explicit option on the privacy page. Note that matomo now has to resort to cookies in order to enforce the opt-out:
There are two other ways that users can avoid participating in the data collection. I think they’re both bad, and I’m absolutely guilty of using them both myself. I guess it’s true what they say about walking a mile in someone else’s shoes.
I have configured matomo to respect the Do-Not-Track signal. As Christopher Soghoian notes in the link, it is obviously a technically superior option to the one-cookie-per-domain solution that I implemented above. However, it treats all tracking the same. I have tried to explain why I think my extremely confined and limited interests are legitimate where those of ad networks tend to drift into the illegitimate, to put it mildly. Do-Not-Track does not care. I don’t think that’s ideal but I have in the past used DNT myself, so who am I to refuse it now?
There is nothing to say to this other than, I hope the coming years will present better and more nuanced solutions both on a technical and legal level. The ePrivacy regulation that is set to replace the “cookie law” eprivacy directives, much like the GDPR replaced a 1995 directive, will have more explicit support for technical solutions. I am curious to see what they will be.
I am also working on a sort-of part 3 in this series. In the course of investigating how analytics collects data, I have become aware of some incidental data collection in other parts of the system. Might as well be consistent, so in part 3 I will pick up on those odds and ends. Things like server logs, Google fonts, and more.