Alphanumericalize: Filename munging music file names in ABCDE

Rip-roaring and rigorous

Ripping CD feels a lot more 2005 than 2015. And it was probably already starting to feel old back then. However, I never got round to do a systematic, lossless rip of my entire CD collection. So before it’s too late I have bought a DVD-RW drive that will never get to write a single disc and armed myself with a 1Tb USB drive to hold all the data.

My weapon of choice is ABCDE, A Better CD Encoder, a Bash based CLI tool for ripping that has been in continuous development since before 2002 (!). ABCDE has many advantages over GUI CD rippers, the most obvious being that you don’t need to fiddle with the mouse, you just hit enter. This is important if you’ve got a case of hundreds of CDs to get through.

Another advantage is that the settings file allows you to string a piped line of stream editors together when forming the file name. What this means is that once you have determined how the file names should be formed, e.g. “Artist/Album/Track no. – Track title”, you can further manipulate it e.g. by making it entirely lower-/uppercase, removing slashes and other control characters etc. The genius of ABCDE is that rather than implement these features itself, it simply allows you to use tested and true unix utilities, like sed and tr and whatever else you care to throw at it to accomplish this.

Here I’m going to detail generally how this ‘filename munging’ works and specifically I’m going to show how to get file names that are universally OS-, URL- and command line friendly by only using numerical digits, the letters a-z, underscores (_) and hyphens (-).

The relevant part of the ABCDE settings file looks like this:

# Custom filename munging:
# By default, abcde will do the following to CDDB data to get a useful
# filename:
# * Translate colons to a space and a dash for Windows compatibility
# * Eat control characters, single quotes, and question marks
# * Translate spaces and forward slashes to underscores
# To change that, redefine the mungefilename function.
# mungefilename receives the CDDB data (artist, track, title, whatever)
# as $1 and outputs it on stdout.
#mungefilename ()
#{
#       echo "$@" | sed s,:, -,g | tr  / __ | tr -d '"?[:cntrl:]
#}

To use the functionality, you uncomment the function – the last four lines – and edit the line beginning with ‘echo’ so it “reflect[s] your inner desire to organize things differently than everyone else” to quote another part of the ABCDE settings file.

I don’t want to make it halfway through my CD collection before realizing that I wanted all the file names some other way. So I have sat down and thought through how I want things differently than everybody else. You’ll probably want something different but hearing my preferences should at least help you to realise what that something else is. Here are my criteria for the new file names:

  1. All lowercase. This is partly aesthetics, partly CLI functionality: It’s easier on my eyes and it’s quicker to type.
  2. Transliteration. When there are ‘foreign’ characters (I’m adopting the Anglo-Saxon perspective on foreign here) in a file or directory name, you can do one of three things: Keep them, discard them or transliterate them. The first will make them harder to type on the command line, and the second will risk completely eliminating all characters from say, a directory name which will cause problems. Neither will look particularly ‘orderly’ and even though unicode is slowly becoming the norm on the internet, it’s still safer to avoid ‘foreign’ characters in file names. The third option, transliteration, will attempt to turn non-ASCII characters into ASCII, e.g. é becomes e, Ø becomes Oe etc.
  3. Spaces are represented by underscores. Some people love them, some people hate them. Underscores makes for much easier command line manipulation of files and is traditionally more URL-friendly. If you’ve ever downloaded a file with lots of “%20″s in the file name, you’ve seen what can happen when files with spaces in the name cross the internet. This can be worked around by clever coding of servers, CMSes and browsers. Or you can just learn to love the underscore.
  4. Not all of ASCII, only a subset. Specifically [a-z], [0-9] and underscores and hyphens/minus signs. Some of the excluded signs are ‘control characters’, i.e. they serve some special purpose in various operating systems such as delimiting directories, others just don’t look ‘orderly’ to my eye, like ‘&’. I like keeping it simple and this subset is a nice minimalist choice: You have letters for titles, numbers for track numbering, underscores to separate words and hyphens to separate categories e.g. artist and album if both appear in the file name.

All right, now how do we accomplish this. Let’s look at the file munging function again.

echo "$@" | sed s,:, -,g | tr  / __ | tr -d '"?[:cntrl:]

If you’re unfamiliar with the syntax, here’s the basics. The pipes ‘|’ are the main partitioners of the line. Think pipes with water running through them, not the other kind. The pipe takes the output of the command on the left and feeds it as input to the command on the right. And so on and so on through every pipe on the line. The first command produces the structure and content of the filename that you have defined in OUTPUTFORMAT (e.g. OUTPUTFORMAT='${ARTISTFILE}-${ALBUMFILE}/${TRACKNUM}.${TRACKFILE}'). Each subsequent block (commands in between pipes) manipulates the previous input in some way. You should note though, that only the contents of the variables in the line above are being manipulated. So for instance discarding all forward slashes will not make directories disappear if you have mandated them in OUTPUTFORMAT. Only forwards slashes in the artist, album or track name will be deleted.

Well, first off the default line goes a way toward acomplishing what I want but it doesn’t get me all the way there. Take the final tr command which deletes (-d) all the characters mentioned and outputs what’s left of the input. (Note that the characters apostophe, double quote and question mark are all escaped using a back slash so that the shell will not interpret them. [:cntrl:] are a class of control characters – tr has a number of such classes, see the man page.) This command gets rid of a lot of unwanted characters but it still leaves a lot that I’d rather not see in the file name. So I’ll start from scratch in building up the line rather than rework the default.

Transliteration

First up is transliteration. I’m going to use unicode, specifically UTF-8, as the format of the metadata in the following. We put transliteration first in line for two reasons: Firstly, we don’t want to risk tr or sed brutalising delicate unicode before we get to brutalise it ourselves. Secondly, transliterating international characters can itself output strings that we would want to manipulate in the same way as the rest of the string.

You should note that what we are transliterating is the filename, not the metadata. I want my metadata to be as representative of linguistic diversity as possible, keeping every little weird squiggle from every corner of the UTF-8 table (I’m working with FLAC so UTF-8 is a given unlike with ID3 tags). Therefore I also make a point of avoiding getting transliterated metadata when ABCDE presents a choice of metadata sets, e.g. a choice between “Með blóðnasir” and “Med blodnasir” from Sigur Rós; I will transliterate when passing from metadata to filenames and not before.

There are a number of CLI tools available for transliteration, most prominently iconv. I’m Danish and seeing that iconv fails to recognise the Danish-Norwegian letter ‘Ø’ (known to English speakers as ‘O dash’ and to Monty Python fans as the correct spelling of the word moose) and just transforms it into a question mark, makes me suspect that it’s not all that great a choice, even for non-Scandinavians. The Python Unidecode library looks more promising. As an example unidecode turns the sentence “Åse bor i Ærøskøbing” into “Ase bor i AEroskobing”. It’s not perfect; only the first letter of the ‘AE’ should have been capitalised and I would prefer the ø to be turned into ‘oe’ and the å to transliterate to double, not single a. But anything’s better than question marks.

The Ubuntu package with unidecode for Python 2 (python-unidecode) does not contain a commandline implementation so we have to create one ourselves:

#! /usr/bin/env python

import sys
import unidecode

line_input = sys.stdin.readline()
utf_object = unicode(line_input,'utf8')
line_output = unidecode.unidecode(utf_object)

print line_output

We get the library for reading piped input and the unidecode library. Then we read what’s being piped to the script and turn it into a unicode ‘object’ so that we can transliterate or ‘unidecode’ it. In the final line we simply output the result so that it can be used by the next command block.

I mentioned that we start out with transliteration for various reasons. Let’s see if that makes sense with our new tool. The Japanese word for Japan is 日本国. Let’s see what our script makes of that:

[~] echo -n "日本国" | transliterate.py
Ri Ben Guo

We get spaces and those spaces will later be turned into underscores. We also note that transliterate does not mean translate – as that would have output State of Japan – or even transliterate well (or according to the same romanization scheme that the Wikipedia uses) which should have given us Nihon-koku. What this does is turn an input into an alphanumerical ASCII output. Sometimes the output makes sense, sometimes it doesn’t. We have to take what we can get.

Just to hammer the point home here’s the Hebrew word for Hebrew:

[~] echo -n "עברית" | transliterate.py
`bryt

As you can see transliterate is not above outputting accents. Fortunately, that will not survive the commands to come. And in case you’re wondering: No, the accent grave in the output does not correspond to the apostrophe-like thing in the input (which is not an apostrophe).

So here’s the function that we start out with:

mungefilename ()
{
        echo "$@" | transliterate.py
}

Special characters to words

Before we start discarding all the messy characters, there are some worth saving. The ampersand (&) should be turned into an ‘and’ (assuming the language to be English…) And in some cases a dollar sign ($) can be turned into the word ‘dollars’ (but in others it will just confuse, e.g. Dr. Dre’s “The $20 Sack Pyramid”). There are probably other cases, as well. I will stick to the basics and just go with the ampersand replacement, using sed for substitution:

sed s,&,'and',g

If I get a non-English disc with an ampersand in it, I can always tell ABCDE that I want to edit the information before it starts and manually replace the sign with ‘et’, ‘und’, ‘og’, ‘e’ etc. depending on the language. Our function now looks like this:

mungefilename ()
{
        echo "$@" | transliterate.py | sed s,&,'and',g
}

Underscores

One-for-one sign substitutions can with advantage be left to the wonderfully simple and effective tr. Here we tell it to substitute spaces (escaped with a backslash) as well as forward slashes (set 1) with underscores (set 2). If you’re having difficulty interpreting the line and can only see the outline of an angry mouse, then notice that the non-escaped space (the one before the underscore) is separating the two sets.

tr  / _

This is our function now:

mungefilename ()
{
        echo "$@" | transliterate.py | sed s,&,'and',g | tr  / _
}

Lowercase

Again, one for one substitutions are the domain of tr:

tr "[:upper:]" "[:lower:]"

Here the substitutee and the substituter are the aforementioned tr classes, namely one designating uppercase characters and one designating lowercase characters. If tr encounters a ‘D’, the fourth member of the [:upper:] set, it will replace it with the fourth member of the [:lower:] set – which happens to be ‘d’. Perfect.

mungefilename ()
{
        echo "$@" | transliterate.py | sed s,&,'and',g | tr  / _ | tr "[:upper:]" "[:lower:]"
}

Delete everything but

Okay, so all that remains is, well, getting rid of whatever remains of unwanted characters. We could do as the default function does and list every single one of the characters to delete. Or we could use tr’s complementary (-c) flag and tell it what we want to keep, rather than what we want to shed. I like the second option best. Let’s do that.

tr -dc "0-9a-z_-"

We delete (-d) the complementary (-c) to the set we list, i.e. everything that is not in the set. The set is made up of the digits 0 to 9 AND the lowercase letters a to z AND the underscore AND the hyphen (which needs to be escaped for tr to understand that it’s not a flag).

And so we finalise the function like this:

mungefilename ()
{
        echo "$@" | transliterate.py | sed s,&,'and',g | tr  / _ | tr "[:upper:]" "[:lower:]" | tr -dc "0-9a-z_-"
}

No repeat spaces

Wait, this solution isn’t complete. Take the song “It Overtakes Me / The Stars Are So Big… I Am So Small… Do I Stand a Chance?” by The FLaming Lips. When run through our munglefilename function it becomes:

it_overtakes_me___the_stars_are_so_big_i_am_so_small_do_i_stand_a_chance

The spaces and the forward slash are all replaced by underscores – which means we now have three underscores in a row. Not pretty. And if we delete forward slashes instead of replacing them with underscores, we’ll get problems where you don’t have spaces and slashes but only slashes (e.g. OutKast’s “13th Floor/Growing Old”). Oh tr, is there anything you can’t do? Well, killing repeating signs is perfectly within its powers, it seems:

tr -s _

-s is for ‘squeezing’. If there is more than one underscore in a row they get reduced to a single instance. Exactly what we wanted.

mungefilename ()
{
        echo "$@" | transliterate.py | sed s,&,'and',g | tr  / _ | tr "[:upper:]" "[:lower:]" | tr -dc "0-9a-z_-" | tr -s _
}

I’m sure the function can be improved upon further but we’ll leave it for now. ABCDE is a great CD ripper for the control freak and tools like tr, sed and python complement it beautifully.

P.S. As a bonus to the reader getting all the way down here, here’s a tip for the sister function to mungefilename, mungegenre. The genre metadata that ABCDE gets from CDDB tend to be BS. For one thing the number of genres is so small that it’s useless for anything but telling Classical from Rock. For another the uploaders tend to use them in strange ways, calling LCD Soundsystem ‘data’ and the like. So here’s what I do with the genre manipulation in metadata:

mungegenre ()
{
        echo -n
}

Yup, that function outputs nothing at all. This overwrites whatever nonsense ABCDE got from CDDB and so your metadata doesn’t get contaminated.

Compact disk © Jan Huber, Unsplash License

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.