Renaming music files with beets – and avoiding troublesome characters

Avoiding troublesome characters? Story of my life, mate

The joke about music collection rearranging in the movie High Fidelity (and/or the novel – it’s been a long time since I saw and read them respectively) is that it’s a ritual undertaken in times of emtional stress, asserting power over something, anything just to feel in control. I guess June 2020 should automatically qualify as emotional stress, no matter who you are. My real impetus, however, was just the need to embed album art into the files because my new favourite Subsonic substitute, Navidrome, does not as yet support the folder.jpg/cover.jpg convention.

This led me back to a Github project starred long ago (was it “beans”? “radishes”? Something to do with root vegetables, right? ) and so I spent most of a Sunday ploughing through beets‘ excellent documentation. The deepest dive was reserved, however, for working out what set of characters I think should be used for filenames and why. There is very little of practical use in this – beets’ default settings should be sufficient to avoid most file naming issues – but I found it weirdly fascinating reading about reserved characters.

Beets’ file naming comes from three sources: The asciify_paths setting, path_format or the structure of the path (what elements to use, what functions to apply to them) and the replace setting. The path_format setting is not the place to determine what characters are allowed and which are banned, so I’m going to leave that one out of the discussion here.

asciify_paths uses the unidecode library to turn unicode characters into something that can be contained within the ascii set. For instance a Danish ‘ø’ becomes ‘oe’, a French ‘é’ becomes just plain ‘e’, etc. In the linux world of today there is very little standing in the way of unicode file names but for backwards and keyboard compatibility – entering Japanese kanji on a non-Japanese keyboard is tricky – I do enable asciify_paths. This also helps narrow down the allowed character set to the point where it actually becomes a viable strategy enumerating the allowed characters, as opposed to listing those banned.

This leaves the replace setting which is where the magic happens. This is a series of lines of matches and replacements:

replace:
    'regex_of_match' : character_to_replace_match_with

Whenever any part of the path_format matches the regular expression on the left hand side, it is replaced by the right hand side. The default setting is this:

replace:
    '[\\/]': _
    '^\.': _
    '[\x00-\x1f]': _
    '[<>:"\?\*\|]': _
    '\.$': _
    '\s+$': ''
    '^\s+': ''
    '^-': _

These rules ensure that the files names that beets produce avoid characters prohibited by common filesystems.

OS and file systems rules

[\x00-\x1f] are the first 32 (nos. 0-31) characters in the ASCII set, known as the control characters. These do not produce visible output, so it’s hard to see what value they would provide, and are explicitly disallowed in file names by both FAT* and NTFS file systems. According to the Wikipedia, most linux file systems are more permissive, only banning the first control character \x00, also known as null. Strangely beets’ default does not ban the solitary control character at the end of the ASCII, \x7f (127 in decimal), though this too would make an illegal file name on FAT/NTFS.

Beets’ default incorporates some further restriction from NTFS:

'[\\/]': _
...
'[<>:"\?\*\|]': _

So no back- or forward slashes. In case you’re wondering: The first class consists of an escaped backslash ‘\\’ and a forward slash. It can be quite tricky keeping track of what YAML (beets’ config format of choice) and regular expressions require for referencing various reserved characters in addition to the file naming issues. For now, just keep in mind that inside regex classes, the following characters need escaping:

^-]\

We will get to issues with YAML later on. The second handful of characters – pointy brackets, colons etc. – are also specifically those deemed unacceptable by NTFS, though there is obviously some overlap with other file systems.

Three minor points:

  • Technically, the three escaped characters (?*|) should not need escaping at least according to regular-expressions.info but I can say that the class does work as intended in beets.
  • Keep in mind that the characters that are illegal on FAT32 is a superset of those mentioned here. E.g. a song with a plus sign, if not explicitly replaced by the user, should produce an error when written to a FAT32 partition.
  • OS and file system rules may not be entirely independent of each other. A quick test creating empty files (using touch) called *, :, and <> on an NTFS file system mounted in linux using ntfs-3g produced perfectly fine and useable files… on linux. Booting into Windows and trying to access the folder they were located in resulted in the warning “The file or directory is corrupted and unreadable”.

Regardless of the file system, Windows as an OS also has rules about what is permissible/creatable.

In Windows utilities, the space and the period are not allowed as the final character of a filename.[15] The period is allowed as the first character, but some Windows applications, such as Windows Explorer, forbid creating or renaming such files (despite this convention being used in Unix-like systems to describe hidden files and directories)

https://en.wikipedia.org/wiki/Filename#In_Windows

This explains these lines from the defaults:

'\.$': _
'\s+$': ''
'^\.': _

But not these lines prohibiting whitespace and hyphens at the beginning of the filename, for which I have found no reference:

'^\s+': ''
'^-': _

Probably quite sensible, anyway. I was unable to create a file on ext4/linux using a starting hyphen no matter how many quotes or escapes I used, though it felt like it was more a matter of file programs (touch and vim) misunderstanding my intentions.

In case there was any doubt, the rules all proscribe replacing the matched characters with either an underscore (_) or nothing (''). Linux as an operating system does not seem to have any restriction that is not already included in those imposed by Windows or NTFS, even allowing characters such as the tilde even though it has special meaning when standing alone (commonly used to designate a backup of an open file). As pointed out by the beets docs, even if you don’t use Windows, many of those restrictions will also apply to Windows file sharing protocols and the like, and you never know when you need to copy your collection onto an NTFS formatted USB drive. Better to adhere to MS’ conventions from the start than having to do manual renaming after a transfer breaks down.

Percent-encoding for URIs

What other possible reasons could we have to avoid characters than them being illegal by filesystem or OS? If you have ever tried downloading a file with lots and lots of %20’s, you know the answer. Any URI, like an FTP or HTTP address, requires the following set of characters to be percent-encoded because they are reserved:

!  *  '  (  )  ;  :  @  &  =  +  $  ,  /  ?  #  [  ]

The list doesn’t explicitly mention spaces but clearly a URI has to be one uninterrupted string. I don’t use FTP for file sharing and my days as a music blogger with… questionable sharing of files to boot are behind me. But I really do hate those percentage signs and you never know. So I am going to incorporate these into my blocklist.

Escaped characters in Bash

My final criterion is probably the most idiosyncratic and the least useful. When operating on files in the shell, Bash requires certain characters to be escaped in order for them to be understood as literal characters. Similar to percent encoding, this is most commonly seen with spaces (unless you use quotes around the filename). What are the downsides of having to escape characters on the command line? None, really. A bit more typing. A slightly more unsigthly command line. But hey, with all the characters in previous lists, the escaped characters in Bash that haven’t already been exiled are few and far between .

The best resource I have found is this StackOverflow answer that goes through all ASCII characters and notes whether they require escaping (with a capital E) or not (with a hyphen):

00 E ''         1A E $'\032'    34 - 4          4E - N          68 - h      
01 E $'\001'    1B E $'\E'      35 - 5          4F - O          69 - i      
02 E $'\002'    1C E $'\034'    36 - 6          50 - P          6A - j      
03 E $'\003'    1D E $'\035'    37 - 7          51 - Q          6B - k      
04 E $'\004'    1E E $'\036'    38 - 8          52 - R          6C - l      
05 E $'\005'    1F E $'\037'    39 - 9          53 - S          6D - m      
06 E $'\006'    20 E \          3A - :          54 - T          6E - n      
07 E $'\a'      21 E \!         3B E \;         55 - U          6F - o      
08 E $'\b'      22 E \"         3C E \<         56 - V          70 - p      
09 E $'\t'      23 E \#         3D - =          57 - W          71 - q      
0A E $'\n'      24 E \$         3E E \>         58 - X          72 - r      
0B E $'\v'      25 - %          3F E \?         59 - Y          73 - s      
0C E $'\f'      26 E \&         40 - @          5A - Z          74 - t      
0D E $'\r'      27 E \'         41 - A          5B E \[         75 - u      
0E E $'\016'    28 E \(         42 - B          5C E \\         76 - v      
0F E $'\017'    29 E \)         43 - C          5D E \]         77 - w      
10 E $'\020'    2A E \*         44 - D          5E E \^         78 - x      
11 E $'\021'    2B - +          45 - E          5F - _          79 - y      
12 E $'\022'    2C E \,         46 - F          60 E \`         7A - z      
13 E $'\023'    2D - -          47 - G          61 - a          7B E \{     
14 E $'\024'    2E - .          48 - H          62 - b          7C E \|     
15 E $'\025'    2F - /          49 - I          63 - c          7D E \}     
16 E $'\026'    30 - 0          4A - J          64 - d          7E E \~     
17 E $'\027'    31 - 1          4B - K          65 - e          7F E $'\177'
18 E $'\030'    32 - 2          4C - L          66 - f      
19 E $'\031'    33 - 3          4D - M          67 - g      

What remains

Right, so what’s left of the poor, decimated ASCII set? Can I even still use letters and numbers? Well, yes, fortunately. But truth be told, not much else. Here is a table of the ASCII set with columns for each criterion and what characters that must be purged to comply with all of them. A 1 signifies that the character in question is not permitted by the criterion. A checkmark in the allowed column signifies that a character hasn’t been banned by any of the three criteria. I have left out all the control characters, including 127, for ease of reading. Note that this table does not account for the rules surrounding the use of literal points and hyphens. Windows and linux both allow points in filenames, obviously, but linux does not allow for the specific filenames ‘.’ and ‘..’ and Windows does not allow for filenames ending in ‘.’

DecHexCharNTFSURI-reservedBASH esc.Allowed
3220Space11
3321!11
3422\”11
3523#11
3624$11
3725%1
3826&11
392711
4028(11
4129)11
422A*111
432B+1
442C,11
452D
462E.
472F/11
48300
49311
50322
51333
52344
53355
54366
55377
56388
57399
583A:11
593B;11
603C<11
613D=1
623E>11
633F?111
6440@1
6541A
6642B
6743C
6844D
6945E
7046F
7147G
7248H
7349I
744AJ
754BK
764CL
774DM
784EN
794FO
8050P
8151Q
8252R
8353S
8454T
8555U
8656V
8757W
8858X
8959Y
905AZ
915B[11
925C\11
935D]11
945E^1
955F_
9660`1
9761a
9862b
9963c
10064d
10165e
10266f
10367g
10468h
10569i
1066Aj
1076Bk
1086Cl
1096Dm
1106En
1116Fo
11270p
11371q
11472r
11573s
11674t
11775u
11876v
11977w
12078x
12179y
1227Az
1237B{1
1247C|11
1257D}1
1267E~1

How to write a ‘replace’ setting

There are a lot of different ways to write combinations of these restrictions in the beets configuration. Just to illustrate I’m going to use three different ways to show three differing levels of restrictions. Obviously it also depends on how you would like beets to handle the forbidden characters (i.e. should they all be converted into the same character or treated differently). There is a lot of good cases to make for special treatment, e.g. converting “&” into “and”, but in the following I’m just going to convert everything into underscores.

For the first and simplest restriction – don’t use NTFS-illegal characters – we can really just leave it to the defaults but to be precise the following should suffice:

replace:
    '["*/:<>?\\|]': _

Note that only the backslash itself is escaped.

If we combine the NTFS restrictions with disallowing URI reserved characters, we can see from the table that it would be easier replacing ranges of characters. We do this by designating them with the hexadecimal designations, similar to the way that control characters are removed by default:

replace:
    '[\x20-\x2c]': _
    '[\x3a-\x40]': _
    '[\x5b-\x5d]': _
    '[/|]': _

The forward slash and the pipe don’t fall into any of the ranges so they need to be individually mentioned.

As I mentioned at the beginning in some cases it might be easier to tell beets what characters to allow, rather than list those banned. But how would you do that in the replace setting? Well, strictly speaking you can’t. What you can do is tell it to replace everything but a group of characters with one specific character. The way to designate an everything-but group in regex is of course starting your class with a circonflexe (that’s the “hat” character to barbarians non-French speakers). To turn everything but the “Allowed” set from the table into underscores would require:

replace:
    '[^A-Za-z0-9._\-]' : _

One of the advantages of this approach is that you don’t need to worry about control characters because they are easily included in the ‘everything but’ set. With the other replace setting you do need to explicitly add the control characters as the default ‘replace’ setting is overwritten by your own. The main disadvantage is that there is no easy way to treat characters differently, e.g. turning single quotes into no character at all.

As mentioned regex sets usually don’t need as much escaping as people assume. I did come across one particular issue with regex-classes-inside-of-YAML that I thought merited a mention. Regex does not require escaping a single quote inside of a character class but if you don’t, the yaml parser is going to get lost. However, the correct way of escaping the quote is not a backslash but:

replace:
     '''': _

Yeah, a preceding quote. Which looks weird but I do believe it works as intended.

Note that I have tested all of the above without any errors but with a limited set of some 1000 mainly English language songs from bands that usually don’t stray too far into mad naming ideas (looking at you, Out Hud). YMMV.

Conclusion

When combining character restrictions of file systems, the internet, the shell, regex and yaml, it’s easy to get confused and I’m not at all sure that I have covered or spotted the many interesting ways that it can go wrong. Please leave a comment if you come across any and I’ll update the post.

In this post I have tried to lay out how to implement restrictions and have completely avoided the question of whether or not it makes sense to do so. The harmonizing effect of asciify makes files easier to handle but it does also reduce information and leads to a duller and more homogenous looking collection. Maybe the delight of kana and kanji or some accents are worth the trouble they cause, I’m not sure. The only thing I am sure of is that the noughties dance punk band “!!!” really should have considered that “!!” on the shell recalls your last command. Thanks, guys! /s

Close-Up Photo of Beetroots on Black Background © Eva Bronzini, Pexels License

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.