Regular Expressions in Bash (and Alternatives) ¬
2010-11-03
While cleaning up some old bash
code and preparing tools-osx for release, I happened across a very useful bit of information: bash does support regular expressions! Well, at least bash
3.0 and newer do.
I first learned regular expressions in Perl, so I’ve pined for =~
in other scripting languages ever since. With bash
, I, like most others, get by most of the time by piping things through grep
for matching and sed
for replacements, but the bane of my existence has always been capturing groups (capturing parentheses).
For example, let’s say we want to grab just the volume name out of a path like /Volumes/Macintosh HD/Users/Shared/
, the following regular expression would be perfect for that:
^/Volumes/([^/]+)
That says match a string that starts with “/Volumes/” followed by one or more characters that are not “/” (capturing the one or more characters that are not “/”). So, if we were to match that against the aforementioned example path, it would capture:
Macintosh HD
Well, now I know that you can do this using bash
3.0+‘s built-in regular expressions support:
if [[ "/Volumes/Macintosh HD/Users/Shared/" =~ ^/Volumes/([^/]+) ]]; then
vol="${BASH_REMATCH[1]}"
fi
Very straightforward for those who are familiar with regular expressions. However, it took my a while to get that to even work. Why? I assumed that I needed to quote the regular expression (in bash
quoting is extremely important). The first tutorial I was going by pulled the regex from command line input and used it from a variable, so that offered little evidence for or against quoting the regular expression, but another that I found clearly was quoting the regular expression. Eventually I read the comments on the latter tutorial and there were some that found the regular expression worked in single quotes and some found that it had to be left unquoted.
For me, on Mac OS X 10.5 Leopard, bash
regular expressions have to be left unquoted.
Note: bash
3.0+‘s built-in regular expressions are, like grep -e
or egrep
, POSIX extended regular expressions, not full Perl-compatible regular expressions, so make sure you understand the differences in syntax.
So, now comes the big caveat with all of this new found power and why it’s taken so long for me to discover it: bash
3.0 and newer have only started becoming common in the last few years, so it’s not widely supported yet. I looked through the Mac OS X source code and found that only Mac OS X 10.5 Leopard and 10.6 Snow Leopard have included a version of bash
newer than version 3.0. Mac OS X 10.4 Tiger (including 10.4.11) and earlier all had bash
2.05 or earlier. So, you should really only use bash
’s built-in regular expression support if you know the environment will have version 3.0 or newer.
I know, it certainly dashed my hopes a bit too.
In Which We Come to Understand an Alternative
However, all is not lost, there is a rudimentary alternative in read
. It’ll never be as powerful as regular expressions, but it can allow simple captures like the example discussed above. Let me just throw you into the deep end and see if I can then explain how to swim.
Again, here’s that bash
regular expression code snippet I came up with to parse the volume name out of a path:
if [[ "/Volumes/Macintosh HD/Users/Shared/" =~ ^/Volumes/([^/]+) ]]; then
vol="${BASH_REMATCH[1]}"
fi
And here’s that same capture using read
:
IFS=/ read -r -d '' _ _ vol _ <<< "Volumes/Macintosh HD/Users/Shared/"
Wow, it’s certainly more compact, but it doesn’t look like it contains much actual functionality, right? Just a couple switches and some underscores.
Let’s step through it, argument by argument:
IFS=/
– Characters found in$IFS
are word delimiters, so we’re setting our delimiter to “/”.read
– Well, that’s theread
command we’re calling to pull all this off.-r
– Specify “raw” input (no backslash escaping).-d ''
– Read until we hit ‘’ (an empty string) instead of a newline (so, essentially, read the entire input)._ _ vol _
– This is confusing part, this is actually where we tellread
which variable to store each matching field in. Let’s break it down further:_
– The first character of our input string is a “/” (and so is our delimiter), so the first field is going to match an empty string (everything between the start and the first “/”, i.e. nothing), so we’ll just dump that in$_
to discard it._
– The second match is going to be “Volumes” (everything between the first “/” and the second “/”), but we don’t care about that either, so discard it into$_
as well.vol
– The third match (everything between the second “/” and third “/”) is what we’re actually looking for (the volume name), so we’ll store that in$vol
._
– The fourth match (and all further matches; everything between the third “/” and fourth “/”, and so on, and so on) are also nothing we care about, so also toss them into$_
.
<<<
– This is abash
“here string” operator, it indicates that the following string be sent as standard input to the command."Volumes/Macintosh HD/Users/Shared/"
– This is the string we want to run throughread
to capture from.
Putting it back together a bit, we’d have something like this:
IFS=/
– Split on the “/”.read -r -d '' _ _ vol _
– Store the 3rd field in$vol
.<<< "Volumes/Macintosh HD/Users/Shared/"
– From the string “Volumes/Macintosh HD/Users/Shared”.
And, just like the regular expression code, we end up with the following match stored in $vol
:
Macintosh HD
Okay, you may have caught on that that read
example was not actually the exact same capture as the regular expressions was, here’s why: the string doesn’t have to start with “/Volumes/”. We could’ve matched against “/Users/Shared/” and it would’ve captured “Shared”. That’s not going to cut it!
Fortunately, we could just wrap the call to read
with a string comparison of the first zero through nine characters of the path name against “/Volumes/”, as so:
path="Volumes/Macintosh HD/Users/Shared/"
if [ "${path:0:9}" = "/Volumes/" ]; then
IFS=/ read -r -d '' _ _ vol _ <<< "$path"
fi
Not so scary now, I hope, and far more backwards compatible with older versions of bash
.
If you’re looking to capture from a string that can be reasonably split on a delimiter, like we did with the “/”, read
is an excellent alternative to regular expressions (esp. when paired with other string comparisons). That said, if you know you can rely on having bash
3.0+, by all means, use the regular expressions!
Great post, just one remark: all your commands are fine, but you forget to start the volume string with a “/” in some examples.
André Neves
3452 days ago
@André I’m glad you liked this post.
I definitely see the missing leading “/” in a number of the examples. Naturally, it has been a few years since I wrote it, so I’m not entirely sure whether any of them were intentional (they don’t seem to be), but I’ll review & update them.
Morgan Aldridge
3448 days ago