Renaming music files with beets – and avoiding troublesome characters

The joke about music collection rearranging in the movie High Fidelity (and/or the novel – it’s been a long time since I saw and read them respectively) is that it’s a ritual undertaken in times of emtional stress, asserting power over something, anything just to feel in control. I guess June 2020 should automatically qualify as emotional stress, no matter who you are. My real impetus, however, was just the need to embed album art into the files because my new favourite Subsonic substitute, Navidrome, does not as yet support the folder.jpg/cover.jpg convention.

This led me back to a Github project starred long ago (was it “beans”? “radishes”? Something to do with root vegetables, right? ) and so I spent most of a Sunday ploughing through beets‘ excellent documentation. The deepest dive was reserved, however, for working out what set of characters I think should be used for filenames and why. There is very little of practical use in this – beets’ default settings should be sufficient to avoid most file naming issues – but I found it weirdly fascinating reading about reserved characters.

Beets’ file naming comes from three sources: The asciify_paths setting, path_format or the structure of the path (what elements to use, what functions to apply to them) and the replace setting. The path_format setting is not the place to determine what characters are allowed and which are banned, so I’m going to leave that one out of the discussion here.

asciify_paths uses the unidecode library to turn unicode characters into something that can be contained within the ascii set. For instance a Danish ‘ø’ becomes ‘oe’, a French ‘é’ becomes just plain ‘e’, etc. In the linux world of today there is very little standing in the way of unicode file names but for backwards and keyboard compatibility – entering Japanese kanji on a non-Japanese keyboard is tricky – I do enable asciify_paths. This also helps narrow down the allowed character set to the point where it actually becomes a viable strategy enumerating the allowed characters, as opposed to listing those banned.

This leaves the replace setting which is where the magic happens. This is a series of lines of matches and replacements:

replace:
    'regex_of_match' : character_to_replace_match_with

Whenever any part of the path_format matches the regular expression on the left hand side, it is replaced by the right hand side. The default setting is this:

replace:
    '[\\/]': _
    '^\.': _
    '[\x00-\x1f]': _
    '[<>:"\?\*\|]': _
    '\.$': _
    '\s+$': ''
    '^\s+': ''
    '^-': _

These rules ensure that the files names that beets produce avoid characters prohibited by common filesystems.

OS and file systems rules

[\x00-\x1f] are the first 32 (nos. 0-31) characters in the ASCII set, known as the control characters. These do not produce visible output, so it’s hard to see what value they would provide, and are explicitly disallowed in file names by both FAT* and NTFS file systems. According to the Wikipedia, most linux file systems are more permissive, only banning the first control character \x00, also known as null. Strangely beets’ default does not ban the solitary control character at the end of the ASCII, \x7f (127 in decimal), though this too would make an illegal file name on FAT/NTFS.

Beets’ default incorporates some further restriction from NTFS:

'[\\/]': _
...
'[<>:"\?\*\|]': _

So no back- or forward slashes. In case you’re wondering: The first class consists of an escaped backslash ‘\\’ and a forward slash. It can be quite tricky keeping track of what YAML (beets’ config format of choice) and regular expressions require for referencing various reserved characters in addition to the file naming issues. For now, just keep in mind that inside regex classes, the following characters need escaping:

^-]\

We will get to issues with YAML later on. The second handful of characters – pointy brackets, colons etc. – are also specifically those deemed unacceptable by NTFS, though there is obviously some overlap with other file systems.

Three minor points:

Technically, the three escaped characters (?*|) should not need escaping at least according to regular-expressions.info but I can say that the class does work as intended in beets.
Keep in mind that the characters that are illegal on FAT32 is a superset of those mentioned here. E.g. a song with a plus sign, if not explicitly replaced by the user, should produce an error when written to a FAT32 partition.
OS and file system rules may not be entirely independent of each other. A quick test creating empty files (using touch) called *, :, and <> on an NTFS file system mounted in linux using ntfs-3g produced perfectly fine and useable files… on linux. Booting into Windows and trying to access the folder they were located in resulted in the warning “The file or directory is corrupted and unreadable”.

Regardless of the file system, Windows as an OS also has rules about what is permissible/creatable.

In Windows utilities, the space and the period are not allowed as the final character of a filename.^[15] The period is allowed as the first character, but some Windows applications, such as Windows Explorer, forbid creating or renaming such files (despite this convention being used in Unix-like systems to describe hidden files and directories)
https://en.wikipedia.org/wiki/Filename#In_Windows

This explains these lines from the defaults:

'\.$': _
'\s+$': ''
'^\.': _

But not these lines prohibiting whitespace and hyphens at the beginning of the filename, for which I have found no reference:

'^\s+': ''
'^-': _

Probably quite sensible, anyway. I was unable to create a file on ext4/linux using a starting hyphen no matter how many quotes or escapes I used, though it felt like it was more a matter of file programs (touch and vim) misunderstanding my intentions.

In case there was any doubt, the rules all proscribe replacing the matched characters with either an underscore (_) or nothing (''). Linux as an operating system does not seem to have any restriction that is not already included in those imposed by Windows or NTFS, even allowing characters such as the tilde even though it has special meaning when standing alone (commonly used to designate a backup of an open file). As pointed out by the beets docs, even if you don’t use Windows, many of those restrictions will also apply to Windows file sharing protocols and the like, and you never know when you need to copy your collection onto an NTFS formatted USB drive. Better to adhere to MS’ conventions from the start than having to do manual renaming after a transfer breaks down.

Percent-encoding for URIs

What other possible reasons could we have to avoid characters than them being illegal by filesystem or OS? If you have ever tried downloading a file with lots and lots of %20’s, you know the answer. Any URI, like an FTP or HTTP address, requires the following set of characters to be percent-encoded because they are reserved:

!  *  '  (  )  ;  :  @  &  =  +  $  ,  /  ?  #  [  ]

The list doesn’t explicitly mention spaces but clearly a URI has to be one uninterrupted string. I don’t use FTP for file sharing and my days as a music blogger with… questionable sharing of files to boot are behind me. But I really do hate those percentage signs and you never know. So I am going to incorporate these into my blocklist.

Escaped characters in Bash

My final criterion is probably the most idiosyncratic and the least useful. When operating on files in the shell, Bash requires certain characters to be escaped in order for them to be understood as literal characters. Similar to percent encoding, this is most commonly seen with spaces (unless you use quotes around the filename). What are the downsides of having to escape characters on the command line? None, really. A bit more typing. A slightly more unsigthly command line. But hey, with all the characters in previous lists, the escaped characters in Bash that haven’t already been exiled are few and far between .

The best resource I have found is this StackOverflow answer that goes through all ASCII characters and notes whether they require escaping (with a capital E) or not (with a hyphen):

00 E ''         1A E $'\032'    34 - 4          4E - N          68 - h      
01 E $'\001'    1B E $'\E'      35 - 5          4F - O          69 - i      
02 E $'\002'    1C E $'\034'    36 - 6          50 - P          6A - j      
03 E $'\003'    1D E $'\035'    37 - 7          51 - Q          6B - k      
04 E $'\004'    1E E $'\036'    38 - 8          52 - R          6C - l      
05 E $'\005'    1F E $'\037'    39 - 9          53 - S          6D - m      
06 E $'\006'    20 E \          3A - :          54 - T          6E - n      
07 E $'\a'      21 E \!         3B E \;         55 - U          6F - o      
08 E $'\b'      22 E \"         3C E \<         56 - V          70 - p      
09 E $'\t'      23 E \#         3D - =          57 - W          71 - q      
0A E $'\n'      24 E \$         3E E \>         58 - X          72 - r      
0B E $'\v'      25 - %          3F E \?         59 - Y          73 - s      
0C E $'\f'      26 E \&         40 - @          5A - Z          74 - t      
0D E $'\r'      27 E \'         41 - A          5B E \[         75 - u      
0E E $'\016'    28 E \(         42 - B          5C E \\         76 - v      
0F E $'\017'    29 E \)         43 - C          5D E \]         77 - w      
10 E $'\020'    2A E \*         44 - D          5E E \^         78 - x      
11 E $'\021'    2B - +          45 - E          5F - _          79 - y      
12 E $'\022'    2C E \,         46 - F          60 E \`         7A - z      
13 E $'\023'    2D - -          47 - G          61 - a          7B E \{     
14 E $'\024'    2E - .          48 - H          62 - b          7C E \|     
15 E $'\025'    2F - /          49 - I          63 - c          7D E \}     
16 E $'\026'    30 - 0          4A - J          64 - d          7E E \~     
17 E $'\027'    31 - 1          4B - K          65 - e          7F E $'\177'
18 E $'\030'    32 - 2          4C - L          66 - f      
19 E $'\031'    33 - 3          4D - M          67 - g

What remains

Right, so what’s left of the poor, decimated ASCII set? Can I even still use letters and numbers? Well, yes, fortunately. But truth be told, not much else. Here is a table of the ASCII set with columns for each criterion and what characters that must be purged to comply with all of them. A 1 signifies that the character in question is not permitted by the criterion. A checkmark in the allowed column signifies that a character hasn’t been banned by any of the three criteria. I have left out all the control characters, including 127, for ease of reading. Note that this table does not account for the rules surrounding the use of literal points and hyphens. Windows and linux both allow points in filenames, obviously, but linux does not allow for the specific filenames ‘.’ and ‘..’ and Windows does not allow for filenames ending in ‘.’

Dec	Hex	Char	NTFS	URI-reserved	BASH esc.	Allowed
32	20	Space		1	1
33	21	!		1	1
34	22	\”	1		1
35	23	#		1	1
36	24	$		1	1
37	25	%		1
38	26	&		1	1
39	27	‘		1	1
40	28	(		1	1
41	29	)		1	1
42	2A	*	1	1	1
43	2B	+		1
44	2C	,		1	1
45	2D	–				✓
46	2E	.				✓
47	2F	/	1	1
48	30	0				✓
49	31	1				✓
50	32	2				✓
51	33	3				✓
52	34	4				✓
53	35	5				✓
54	36	6				✓
55	37	7				✓
56	38	8				✓
57	39	9				✓
58	3A	:	1	1
59	3B	;		1	1
60	3C	<	1		1
61	3D	=		1
62	3E	>	1		1
63	3F	?	1	1	1
64	40	@		1
65	41	A				✓
66	42	B				✓
67	43	C				✓
68	44	D				✓
69	45	E				✓
70	46	F				✓
71	47	G				✓
72	48	H				✓
73	49	I				✓
74	4A	J				✓
75	4B	K				✓
76	4C	L				✓
77	4D	M				✓
78	4E	N				✓
79	4F	O				✓
80	50	P				✓
81	51	Q				✓
82	52	R				✓
83	53	S				✓
84	54	T				✓
85	55	U				✓
86	56	V				✓
87	57	W				✓
88	58	X				✓
89	59	Y				✓
90	5A	Z				✓
91	5B	[		1	1
92	5C	\	1		1
93	5D	]		1	1
94	5E	^			1
95	5F	_				✓
96	60	`			1
97	61	a				✓
98	62	b				✓
99	63	c				✓
100	64	d				✓
101	65	e				✓
102	66	f				✓
103	67	g				✓
104	68	h				✓
105	69	i				✓
106	6A	j				✓
107	6B	k				✓
108	6C	l				✓
109	6D	m				✓
110	6E	n				✓
111	6F	o				✓
112	70	p				✓
113	71	q				✓
114	72	r				✓
115	73	s				✓
116	74	t				✓
117	75	u				✓
118	76	v				✓
119	77	w				✓
120	78	x				✓
121	79	y				✓
122	7A	z				✓
123	7B	{			1
124	7C	\|	1		1
125	7D	}			1
126	7E	~			1

How to write a ‘replace’ setting

There are a lot of different ways to write combinations of these restrictions in the beets configuration. Just to illustrate I’m going to use three different ways to show three differing levels of restrictions. Obviously it also depends on how you would like beets to handle the forbidden characters (i.e. should they all be converted into the same character or treated differently). There is a lot of good cases to make for special treatment, e.g. converting “&” into “and”, but in the following I’m just going to convert everything into underscores.

For the first and simplest restriction – don’t use NTFS-illegal characters – we can really just leave it to the defaults but to be precise the following should suffice:

replace:
    '["*/:<>?\\|]': _

Note that only the backslash itself is escaped.

If we combine the NTFS restrictions with disallowing URI reserved characters, we can see from the table that it would be easier replacing ranges of characters. We do this by designating them with the hexadecimal designations, similar to the way that control characters are removed by default:

replace:
    '[\x20-\x2c]': _
    '[\x3a-\x40]': _
    '[\x5b-\x5d]': _
    '[/|]': _

The forward slash and the pipe don’t fall into any of the ranges so they need to be individually mentioned.

As I mentioned at the beginning in some cases it might be easier to tell beets what characters to allow, rather than list those banned. But how would you do that in the replace setting? Well, strictly speaking you can’t. What you can do is tell it to replace everything but a group of characters with one specific character. The way to designate an everything-but group in regex is of course starting your class with a circonflexe (that’s the “hat” character to ~~barbarians~~ non-French speakers). To turn everything but the “Allowed” set from the table into underscores would require:

replace:
    '[^A-Za-z0-9._\-]' : _

One of the advantages of this approach is that you don’t need to worry about control characters because they are easily included in the ‘everything but’ set. With the other replace setting you do need to explicitly add the control characters as the default ‘replace’ setting is overwritten by your own. The main disadvantage is that there is no easy way to treat characters differently, e.g. turning single quotes into no character at all.

As mentioned regex sets usually don’t need as much escaping as people assume. I did come across one particular issue with regex-classes-inside-of-YAML that I thought merited a mention. Regex does not require escaping a single quote inside of a character class but if you don’t, the yaml parser is going to get lost. However, the correct way of escaping the quote is not a backslash but:

replace:
     '''': _

Yeah, a preceding quote. Which looks weird but I do believe it works as intended.

Note that I have tested all of the above without any errors but with a limited set of some 1000 mainly English language songs from bands that usually don’t stray too far into mad naming ideas (looking at you, Out Hud). YMMV.

Conclusion

When combining character restrictions of file systems, the internet, the shell, regex and yaml, it’s easy to get confused and I’m not at all sure that I have covered or spotted the many interesting ways that it can go wrong. Please leave a comment if you come across any and I’ll update the post.

In this post I have tried to lay out how to implement restrictions and have completely avoided the question of whether or not it makes sense to do so. The harmonizing effect of asciify makes files easier to handle but it does also reduce information and leads to a duller and more homogenous looking collection. Maybe the delight of kana and kanji or some accents are worth the trouble they cause, I’m not sure. The only thing I am sure of is that the noughties dance punk band “!!!” really should have considered that “!!” on the shell recalls your last command. Thanks, guys! /s