Server logs and GDPR compliance

When running analytics on my website I am deliberately gathering and processing data. When I’m just running Nginx as a means to serve web pages, the data processing is more inadvertent because Nginx keeps logs of IP addresses unless I turn it off.

I have already talked at lengths about how the GDPR classifies IP addresses as personal information. But surely, this is different, right? Because we have always done it! And I didn’t do anything to make it happen, it was part of the default config!

I am sorry to tell you, that neither of those arguments are going to hold up in court.

Legitimate interests? Like, socially acceptable hobbies?

Because we have been conditioned by bad solutions on the internet, there’s a decent chance you are now asking yourself: “Do I need consent to have IP addresses appear in my server logs?” The answer is no.

In article 6(1) of the GDPR, “legitimate interest” is established as a lawful basis. If I can claim “legitimate interest”, I don’t need any other lawful basis, including consent. In other words: Consent is irrelevant. If this sounds like a giant loophole, worry not, because it’s not unrestricted:

processing is necessary for the purposes of the legitimate interests pursued by the controller or by a third party, except where such interests are overridden by the interests or fundamental rights and freedoms of the data subject which require protection of personal data, in particular where the data subject is a child.
General Data Protection Regulation, Article 6(1)(f)

Three things stand out to me:

The legitimate interests of the data controller are weighed against the interest of the data subject in not having their data processed willy-nilly.
However, processing personal data must be “necessary for the purposes” of my legitimate interests if I am to claim it as lawful basis.
“legitimate interests” are defined or at least exemplified in other parts of the regulation

I will start with the last one as that will make it clear why I am talking about legitimate interest. Legitimate interest is vaguely described in recital 47, including mention of prevention of fraud:

The processing of personal data strictly necessary for the purposes of preventing fraud also constitutes a legitimate interest of the data controller concerned.
General Data Protection Regulation, Recital 47 (excerpt)

Add to that that network security is specifically called out in recital 49:

The processing of personal data to the extent strictly necessary and proportionate for the purposes of ensuring network and information security, i.e. the ability of a network or an information system to resist, at a given level of confidence, accidental events or unlawful or malicious actions that compromise the availability, authenticity, integrity and confidentiality of stored or transmitted personal data, and the security of the related services offered by, or accessible via, those networks and systems, by public authorities, by computer emergency response teams (CERTs), computer security incident response teams (CSIRTs), by providers of electronic communications networks and services and by providers of security technologies and services, constitutes a legitimate interest of the data controller concerned. This could, for example, include preventing unauthorised access to electronic communications networks and malicious code distribution and stopping ‘denial of service’ attacks and damage to computer and electronic communication systems.
General Data Protection Reglation, Recital 49

I don’t know exactly how DDoS mitigation works – maybe it’s more about patterns than perpetrators – but I think I can find some justification here in processing IP addresses.

It is a little unclear to me if a) it is applicable to me – am I a provider of “electronic communications […] services”? – and b) can I only call upon this in defense of other personal data (“compromise the availability, authenticity, integrity and confidentiality of stored or transmitted personal data”) However, for the time being I am going to assume that the answer to the first is yes and that the answer ot the second is either no or that I can claim defense of my own personal data residing on my web server, e.g. email address, unpublished drafts, etc.

(I don’t want to think about the idea that collecting IP addresses in itself triggers my right to collect IP addresses (now I do have personal data to protect!) because that makes my head hurt, and I am sure lawyers have a latin term to explain why this is bad reasoning.)

However, both recitals also touch on the first bullet I drew up, that of weighing interests:”to the extent strictly necessary and proportionate for the purposes of ensuring network and information security”. This does not infringe on my right to process personal data, just on the extent. In relation to server logs, I suspect “extent” would be about what information I am collecting along with IP addresses and for how long I am keeping that information. Is it “strictly necessary and proportionate” to combatting DoS or intrusion or malware distribution to keep IP addresses on file for years on end? Obviously not. By restraining the personal data processing to the extent strictly necessary etc. I abide by the principle of data minimisation as well as respecting the interests of the data subjects.

What about my second bullet from 6(1)(f) which also talks about necessity? I think it goes deeper than the recital’s “extent strictly necessary” because it says that I only have lawful basis if I actually need personal data to pursue my legitimate interest. In other words: Legitimate interest that can be pursued using personal data is not in itself enough. There must be no other way.

That may be overstating it a bit but I think the point stands. Take the analytics case in my previous posts. Had I claimed “legitimate interests” as my lawful basis, saying I need personal data to know what I must know about visitors, you would be able to point to the part of the second post where I say that the stats I get without personal data are plenty good and sufficient. Clearly, personal data are not “strictly necessary” to me. Even supposing that “curiosity” and “SEO” come under the umbrella of legitimate interests.

Finally, it is worth making explicit that this legitimate interest in network security is not a backdoor to solving the issue of the lawful basis of personal data in analytics.

The processing of personal data for purposes other than those for which the personal data were initially collected should be allowed only where the processing is compatible with the purposes for which the personal data were initially collected
General Data Protection Regulation, Article 50 (excerpt)

I don’t know what “compatible” means in the law but I’m guessing it doesn’t mean “completely unrelated to”. So don’t import your server logs into your analytics solution, assuming that it’s legal that way.

To conclude this section, I am less certain of my right to collect IP addresses based on legitimate interests than I was before I began it. To reiterate:

It is uncertain (to me at least) if recital 49 makes my interest in defending my web server a “legitimate interest”
Even if it does, IP addresses (and other personal data) have to be “necessary” (not just useful or utilitarian) for the defense if my legitimate interest is to provide me with a lawful basis
And even then I have to restrain my data processing to the extent “strictly necessary and proportionate” for the purpose.

Consent, schmontent

I feel like I need to add one more thing on the topic to explain my simple “no” to the question of whether consent might be needed. It should be clear from the above that there is at least one other lawful basis on which to process (collect, store, analyse) user IPs. But since it seems shaky, why not just ask for consent?

Now, I have read the otherwise excellent Ctrl.blog post on the subject and while it does a good job of explaining the subject, and is technically correct in almost every way, it hews very close to saying that you should consider asking for consent, and that consent is king, so to speak. Again, just referring to my national agency, Datatilsynet, they are at pains to point out that consent doesn’t trump or is better than other lawful basises. In fact, they encourage their audience to consider the limitations of consent: is the data subject truly informed (e.g. when clicking on a web site button), and is the data subject truly free to give consent (e.g. is an employee at liberty to say no to an employer’s request for consent?) We all know that consent asking on the web doesn’t make sense to anybody. I would say even more so for inclusion in server logs.

While you could still ask for consent, you would then be sent down the rabbit hole of figuring out how to provide the user with options to withdraw consent, and what service if any you can provide non-consenting users with (please don’t imitate the NPR and give them a plaintext view, just to be a dick about it). I don’t want to do that – especially now, that I have gotten rid of any need for consent to analytics – and so I’m going to rule it out as an option.

Alternatives to personal data

I wrote under legitimate interests that to claim it as lawful basis, the personal data has to be necessary for the purpose. I am sure that lawyers for big tech don’t find that troubling in the least. Of course, it’s necessary, they will say. We can always work backwards toward a reason.

Me, I am not so sure that I really need logging of IP addresses for anything that I would actually attempt as part of a security setup. While I haven’t dug deep into security practices, I am going to hazard a guess that 99,9 % of everything I would do could be accomplished equally well using a psedonym and some deliberately neutered GeoIP information thrown in for good measure, similar to the matomo solution I adopted in the analytics post. The last 1 ‰ would the case where someone was trying to upload illegal materials to my site and I would want to hand the logs over to the police.

As for DDoS: The reality of my situation (a single VPS with four cores and 8 Gb of RAM) is that my only response to a DDoS situation would be to turn off the server and hope it blows over soon. Or outsource the operation to one of the mitigation big boys. In which case it’s their logging operation and their compliance that I have to worry about.

So: Is it a real possibility to use psedonyms/hashed IPs in server logs, similar to what matomo does? I am not sure if any server/proxies offer that since they mostly cater to professionals who can justify the need for personal data – at least to themselves – but for now I am going to list is as an option. It would obviously need to take the same precautions as matomo does (hashing with salts that are discarded and forgotten on a regular basis).

Options, options

Much like in the analytics scenario, I think it’s worth stressing that simply because I (might) have a lawful basis for keeping personal data, that does not mean I have to do it. Keeping personal data under the GDPR comes with a lot of headaches and it is often easier to just… not. This might be by design because it pushes potential data controllers towards the very weighing of interests that the GDPR encourages.

To conclude, here are the options that I think are worth considering:

Claim legtimate interest, claim that IP addresses are necessary -> process IP addresses “to the extent necessary” (i.e. for a limited time period)
Pseudonymous/semi-anonymized logging
Either stop logging altogether or exclude IP addresses from logs

Now that I know what compliant logging could look like, it’s time to investigate what is actually happening with logs and what Nginx et al. can do.

My choice

I am going to make pseudonyms or anonymization my preferred method for a couple of simple reasons.

One of them is an issue inadvertently brought up by the ctrl.blog post: Use shred, it says, after you no longer wish to keep the IP addresses around. Only thing is, shred works wonders for hard drives, but not so much for SSDs. And as far as I know there is no reliable means to securely delete files on SSDs. Which means that I can never really be sure if I have indeed deleted the data or not.

The other is the issue I brought up earlier and which I now know falls under the catch phrase/bon mot You Aren’t Gonna Need It, a programming principle that ties in very neatly with the GDPR principle of data minimization. You can easily get caught up in trying to prepare for extensions and expansions but if you don’t need it now, maybe not build up the edifice around a hypothetical need in a hypothetical future?

I realize that this is not the end of the discussion and I may well revisit this. There are undoubtedly plenty of log analyzers and tools that expect and depend upon full, valid IP addresses for them to work. And that these tools can genuinely be necessary to operate securely. All I am saying is that if I cannot justify it from my current needs, I think the only right thing from both a legal and technical perspective is to log without personal data registration.

Pseudonyms and anonymization

Is there a built in pseudonymized client identifier in Nginx? No, sadly there is not. Someone has coded just the module, I’m looking for. If ipscrub is enabled as a module in Nginx it will allow me to log all sessions reliably without ever logging personal data. It also features the same salt recycling concepts as matomo. The main obtacle to using it is I would have to build it myself – and I suspect, rebuild it whenever Nginx updates. That is a tall order for a minor functionality. I’m still scarred from having to rebuild my wifi driver modules on Arch Linux with every kernel update (I had a post-it note on my screen telling me to please remember to download the needed files and run the AUR scripts after running pacman -Syu but before rebooting; otherwise I would loose the wifi connection).

I also briefly considered using various connection and session identifier variables built into Nginx but essentially there was no guarantee with any of them that they would reliably be constant for a visit from one IP address.

There are however techniques to anonymize IP addresses sufficiently for them not to be considered personal data. Yes, it’s the exact same trick I enabled in matomo: Only keep the first two octets, throw the others in the bin. Nginx can generate new variables by essentially running sed on existing ones. This is called mapping. Stackoverflow user Mike Bretz suggests a way to use it for older versions of Nginx and stackoverflow user Michael Gorianskyi has updated it for Nginx verions ≥ 1.11. Both of them however include three octets in their solutions. Here’s my adapted and simplified version of Bretz’s solution:

    map $remote_addr $ip_anonym1 {
     default 0.0;
     "~(?P<ip>(\d+)\.(\d+))\.\d+\.\d+" $ip;
     "~(?P<ip>[^:]+:[^:]+):" $ip;
    }

While I don’t know the entirety of the syntax, it is clear that the IPv4 regex matches four octets and captures the first two. The IPv6 regex captures two sets of “anything but a colon” separated by a colon and ended by a colon. I have skipped the somewhat convoluted part where Bretz attaches a ‘.0.0’ to the end of the shortened address. Here’s my two octet version of Gorianskyi’s solution:

    map $remote_addr $remote_addr_anon {
    ~(?P<ip>\d+\.\d+)\.         $ip.0.0;
    ~(?P<ip>[^:]+:[^:]+):       $ip::;
    default                     0.0.0.0;
    }

This version adds the ‘.0.0’ without tripling the line count which is nice if not all that essential. The maps go in the http context, same as log_format. Possibly best to put it before log_format? Just guessing.

Testing that it works as intended is as easy as writing a new log_format that incorporates both the original IP address (remote_addr) and the new variable:

log_format gdprtest '$remote_addr - $remote_addr_anon [$time_local] '
    '"$request" $status $body_bytes_sent '
    '"$http_referer" "$http_user_agent"';

And then indicate in the proper context (usually server) that Nginx should use the format that I just defined:

access_log    /var/log/nginx/my.access.log gdprtest;

The log output matches the expectations:

123.456.1.93 - 123.456.0.0 [16/Dec/2021:05:37:07 +0100]  "GET / HTTP/1.1" 200 12340 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"

All I need to do now is to rewrite my log_format from “gdprtest” to … let’s call it “anoncombined” as it is an anonymous version of the default “combined” format.

log_format anoncombined '$remote_addr_anon [$time_local] '
    '"$request" $status $body_bytes_sent '
    '"$http_referer" "$http_user_agent"';

Error logs

You might have spotted the huge glaring hole here which is that Nginx error logs do not support custom logging formats. You enable the error log for a server (or location or…), set the severity level and that’s it. From the examples I have found in my logs, I can say that Nginx can register client IP addresses as part of the error log – and probably does so as a matter of course. Can this be considered part of legitimate interests etc.? In all probability, yes, but again, if I am not actually going to use it, I would rather not have it for reasons already stated.

This is however, where we run out of good, easy solutions. If all I care about is compliance, I can set Nginx to log errors to /dev/null. That is a bit extreme, though. I do still need to know about and be able to investigate errors on my server. So, here I am going to cave and accept processing of personal data. I don’t think it invalidates the prevous efforts, as only logging errors with personal data strengthens the argument that personal data are indeed needed here and a large reduction in the processing is still a win, even if it’s not a 100% elimination.

Now that I am processing personal data, however, I need to have a plan for how to get rid of it again – and to do so conclusively. That’s surprisingly difficult to ensure these days. If I log to my VPS’ SSD drive, there aren’t any easy ways to decisively delete the logs beyond recovery. The simple solution to this problem, would be to add hard drive space to the VPS, but SSDs have become so ubiquitous that that is actually quite expensive (almost doubling my monthly rent for a measly 100 GB of storage that isn’t even local). Even then, I cannot actually be sure that it is indeed a mechanical hard drive, only presume so since it should be cheaper for the host. And can you even run shred on a drive you can only access through FTP?

Here’s another idea: Nginx can log to stderr. If the service is run by systemd, that output will get caught and can be inspected using journalctl. With some caveats that information will only get logged to volatile memory that can be purged on a regular basis, either by reboots or using journalctl features.

I found two problems with this approach: First, Debian/Ubuntu runs Nginx as a daemon, even from the systemd service (compare the service file in the nginx-common package with the one suggested by the Nginx developers). Ordinarily, systemd would just use it’s own powers of “daemonization” rather than use some inbuilt ability in the application to daemonize. In other words: The process thinks in running in the foreground while actually it’s in the background, reporting to systemd. This peculiar Debian specific Nginx setup means that any output from Nginx is not captured by the systemd journal. Secondly, even if it did work it would mean that all applications using the same Nginx service (whether as a web server or reverse proxy) would be reporting errors into the same stream.

So, lacking either of these options, I settled on using tmpfs. Essentially, 1 GB of RAM will be set aside and mounted as a partition. As RAM is volatile, there should be no traces left of the logs, though whether that holds for a VPS as well as physical machines, I don’t know. But it’s the best bet I’ve got.

There is not that much to using ramdisk. Mount it, and start using the new mount directories, check that my setup survives reboots. Here’s my fstab line mounting ramdisk to a /mnt/tmplog directory taken almost verbatim from the Arch wiki:

tmpfs   /mnt/logtmp   tmpfs rw,size=1G,nr_inodes=5k,noexec,nodev,nosuid,uid=33,gid=4,mode=1755 0 0

The crucial details are that I set aside 1G for the directory, that it is owned by www-data/adm, just like the regular /var/log/nginx directory, and that is has a standard 755 permission set. I set nginx to use it by changing the error_log settings in all relevant locations:

error_log               /mnt/logtmp/nginx_wp_brokkr.error.log error;

Note that I don’t use an nginx subdirectory as it would have to be recreated with each boot – or that I should create a tmpfs mount for each service logging to my tmplog. If it somehow fails to be created, starting Nginx would fail. While it is obviously possible to create pre-launch scripts, I would rather not jeopardise the launch of the most important service on the server. So, I just use nginx_ as part of the filename instead.

Rotate and reboot

So, I’ve got anonymized access logs and non-anonymized error logs. In both cases – though technically only necessary in the error logs case – I figure, it still makes sense to to adhere to the GDPR’s “to the extent necessary” clause. This means that I will not keep the logs indefinitely, rather deleting them after two weeks.

The main way to do this with the tmpfs setup – I think – is to just setup cron to schedule a reboot at regular intervals as this will help let go of any traces of the deleted files. However, to decrease chances of accidentally keeping logs and to keep them in an orderly one-day-one-file system, I am going to add a routine to logrotate especially for these logs.

Running Ubuntu on my server, log rotation is already taken care of by the logrotate cron job. Nginx installs it’s own custom logrotate config in the folder /etc/logrotate.d/:

/var/log/nginx/*.log {
	daily
	missingok
	rotate 14
	compress
	delaycompress
	notifempty
	create 0640 www-data adm
	sharedscripts
	prerotate
		if [ -d /etc/logrotate.d/httpd-prerotate ]; then \
			run-parts /etc/logrotate.d/httpd-prerotate; \
		fi \
	endscript
	postrotate
		invoke-rc.d nginx rotate >/dev/null 2>&1
	endscript
}

The most relevant parts here are the two lines saying daily and rotate 14. “daily” obviously tells you how often the Nginx logs are included when logrotate’s cron job is run. “rotate X” tells you how many should be kept. So “rotate 14” combined with “daily” means that once per day, logrotate will delete the 15th oldest archived .log(.gz) file it finds in /var/log/nginx/. That all seems fine. Note that logrotate seems capable of distinguishing between log series, so server_a.access.log.* gets 14 archived logs and server_b.access.log.* gets 14 archived logs etc.

Finally, it’s worth paying attention to the notifempty line. If the current, active log file (the one without a number after .log) is empty (i.e. has a length of 0 bytes) when logrotates comes calling, logrotate will just pass it by because after all, it’s not taking up any space. Only once the length exceeds 0 will logrotate actually do something with the series. This default shows that the guiding principle of log rotation has historically been – and still is – disk space conservation and “decluttering”, not personal data protection. The net effect of notifempty can be seen in one application’s error logs:

-rw-r----- 1 A B   0 Sep 13 06:25 app.error.log
-rw-r----- 1 A B 538 Sep 12 15:43 app.error.log.1
-rw-r----- 1 A B 235 Aug 29 13:50 app.error.log.2.gz
-rw-r----- 1 A B 237 Aug 28 16:45 app.error.log.3.gz
-rw-r----- 1 A B 228 Mar 27  2021 app.error.log.4.gz
-rw-r----- 1 A B 494 Feb  4  2021 app.error.log.5.gz
-rw-r----- 1 A B 226 Aug 23  2020 app.error.log.6.gz
-rw-r----- 1 A B 440 Aug 11  2020 app.error.log.7.gz

Yes, the logs are rotated daily and only 7 archives are kept… but they stretch back more than a year, clearly clumping around dates where changes have produced issues. Obviously, there can be value in being able to investigate “historical” issues. Has this happened before? how often does it come up? does it resolve itself magically? However, I don’t think that’s the spirit of “to the extent necessary”. So I’m going to change notifempty to ifempty. This will delete older log archives – that may have actual errors/access data in them – and replace them with newer ones that will often just be a zero length, empty placeholder file (when no errors/access occured on that day).

Notice how I used the singular “file” a couple of paragraphs ago? As in: logrotate will remove a single file (per identified log series) when run. At least this is the behaviour I have observed and wondered at. Wondered at because surely a “loop-delete until there is no more than X” files would make more sense – that would automatically remove all old files when the count is decreased. As it is, it seems I would have to manually delete older files, if I was decreasing the count.

To take care of the special case of my error logs in the tmpfs folder, I copy the nginx config file in /etc/logrotate.d to an nginxtmp config file and change the location to match my new tmpfs setup:

/mnt/tmplog/nginx_*.log {
        daily
        missingok
        rotate 14
        compress
        delaycompress
	ifempty

        ...

And before I forget: I also need to add a reboot to the crontab. sudo crontab -e and add the line:

# m h  dom mon dow     command
30 6  1  * *           /sbin/shutdown --reboot now

Taking care, of course to not rely on shutdown being in the PATH but providing absolute paths.

Conclusion

It’s not a perfect solution and I’m sure that it can be improved upon. However, the most suprising thing to me is just how seemingly unprepared we still are for the GDPR 3 years (!) after it came in to effect. How is there not a few standard settings/solutions in Ubuntu, Nginx and/or logrotate that sysadmins can use to quickly and easily be compliant with it? It should be clear from the above that there is not going to be a one-size-fits-all compliant solutions but there could at least be a couple of ready-to-wear, off-the-shelf ones as built-ins.