Clojure Regex Tutorial
Summary: With a few functions from the standard library, Clojure lets you do most of what you want with regular expressions with no muss. The functions are also available in ClojureScript.
Clojure regexes are host language regexes
Clojure is designed to be hosted. Clojure defers to the host regex syntax and semantics instead of defining a standard that works on all platforms. On the JVM, you're using Java regexes. In ClojureScript, it's Javascript regexes.
Refer to the following documents for the regex syntax for a particular host:
And you can use Regex 101 for testing out regexes. Be sure to select the language in the menu in the top left. I also use the REPL.
Of course, this difference means that regexes are not always portable. Other than the syntax and semantics of the regexes themselves, Clojure standardizes many regex functions across all platforms in the core library.
Clojure (and ClojureScript) regex syntax
You construct a regex in Clojure using a literal syntax. Strings with a hash sign in front are interpreted as regexes:
#"regex"
On the JVM, the above line will create an instance of
java.util.regex.Pattern
.
In ClojureScript, it will create a
RegExp
.
Remember, the two regular expression languages are similar but different.
This syntax is the most convenient because you don't need to double escape your special characters. For example, if you want to represent the regex string to match a digit, using a Clojure string you would need to write this:
"\\d" ;; regex string to match one digit
Notice that you have to escape the backslash to get a literal backslash in the string. However, regex literals are smart. They don't need to double escape:
#"\d" ;; match one digit
re-matches
- Matching a regex to a string with groups
Very often, you want to match an entire string. The function to do that in
Clojure is called
re-matches
. re-matches
takes a regex and a string, then returns the result of the match.
(re-matches regex string) ;;=> result
The result it returns is a little complex. There are three things it can return.
1. No match returns nil
If the whole string does not match, re-matches
returns nil
, which is nice because
nil
is falsey.
(re-matches #"abc" "xyz") ;;=> nil
(re-matches #"abc" "zzzabcxxx") ;;=> nil
(re-matches #"(a)bc" "hello, world") ;;=> nil
2. Matching with no groups returns the matched string
If the string does match, and there are no groups (parens) in the regex, then it returns the matched string.
(re-matches #"abc" "abc") ;;=> "abc"
(re-matches #"\d+" "3324") ;;=> "3324"
Since all strings are truthy, you can use re-matches
as the test of a
conditional:
(if (re-matches #"\d+" x)
(println "x is all digits")
(println "x is not all digits"))
We'll see a more convenient way to test and use the return value here.
3. Matching with groups returns a vector
If it matches and there are groups, then it returns a vector. The first element in the vector is the entire match. The remaining elements are the group matches.
(re-matches #"abc(.*)" "abcxyz") ;;=> ["abcxyz" "xyz"]
(re-matches #"(a+)(b+)(\d+)" "abb234") ;;=> ["abb234" "a" "bb" "234"]
The three different return types can get tricky. However, I usually have groups,
so it's either a vector or nil
, which are easy to handle. I tend to use
if-some
. It evaluates the
match, checks for nil
, and destructures the groups. You can even destructure
it before you test it.
(if-some [[whole-match first-name last-name] ;; destructuring form
(re-matches #"(\w+)\s(\w+)" full-name)]
(println first-name last-name) ;; matching case
(println "Unparsable name")) ;; nil case
re-find
- Finding a regex substring within a string with groups
Sometimes we want to find a match within a string.
re-find
returns the first
match within the string. The return values are similar to re-matches
.
1. No match returns nil
(re-find #"sss" "Loch Ness") ;;=> nil
2. Match without groups returns the matched string
(re-find #"s+" "dress") ;;=> "ss"
3. Match with groups returns a vector
(re-find #"s+(.*)(s+)" "success") ;;=> ["success" "ucces" "s"]
re-seq
- Finding all substrings that match within a string
The last function from clojure.core
I use a lot is
re-seq
. **re-seq
returns a
lazy seq of all of the matches.**The elements of the seq are whatever type
re-find
would have returned.
(re-seq #"s+" "mississippi") ;;=> ("ss" "ss")
(re-seq #"[a-zA-Z](\d+)"
"abc x123 b44 234") ;;=> (["x123" "123"] ["b44" "44"])
clojure.string/replace
- Replacing regex matches within a string
Well, matching strings is cool, but often you'd like to replace a substring that
matches with some other string.
clojure.string/replace
will
replace all substring matches with a new string.
Do not confuse clojure.string/replace
with clojure.core/replace
. They are
very different. I will often alias clojure.string
as str
in my ns
declaration:
(ns my-app.core
(:require [clojure.string :as str]))
That lets me refer to clojure.string/replace
as str/replace
.
Here's a quick example:
(str/replace "mississippi" #"i.." "obb") ;;=> "mobbobbobbi"
This example matches an i followed by any two characters. It replaces all matches with the string "obb".
Notice the argument order. The string you are matching against comes first,
followed by the regex. Most functions in clojure.string
follow that pattern.
Since the functions are about strings, the strings are the first argument.
Referring to groups in the replacement string
clojure.string/replace
is actually quite versatile. You can refer directly to
the groups in the replacement string using a dollar sign. $0
means the entire
match. $1
means the first group. $2
means the second group, etc.:
(str/replace "mississippi" #"(i)" "$1$1") ;;=> "miissiissiippii"
This example doubles all of the i
's.
Calculating the replacement with a function
You can replace matches with the return value of a function applied to the match:
(str/replace "mississippi" #"(.)i(.)"
(fn [[_ b a]]
(str (str/upper-case b)
"—"
(str/upper-case a)))) ;;=> "M—SS—SS—Ppi"
You can replace just the first occurrence with
clojure.string/replace-first
.
clojure.string/split
- Splitting a string by a regex
Let's say you want to split a string on some character pattern, like one or more
whitespace. You can use
clojure.string/split
:
(str/split "This is a string that I am splitting." #"\s+")
;;=> ["This" "is" "a" "string" "that" "I" "am" "splitting."]
Again, we see the same argument pattern: The string to match comes first, since
the clojure.string
functions are about strings.
Creating a case insensitive regex in Clojure (and other flags)
Some languages have syntax which allow you to put modifiers on the regex, such
as the i
modifier which makes it a case insensitive match. Here is an example
from JavaScript:
/jjj/i;
This regex will match three j
's regardless of the case. "jJj"
and "JJj"
will match. These are called flags.
Unfortunately, Clojure's syntax does not allow for flags. You have to rely on the native host mechanisms for creating regexes.
1. JVM Clojure
On the JVM, there are two ways to use flags.
JVM Regex Flags Method 1: Special flag syntax
The JVM regexes allow for a special syntax to enable flags within the regex.
;; no flags (case-sensitive)
#"abc" ;;=> #"abc"
;; case-insensitive flag set
#"(?i)abc" ;;=> #"(?i)abc"
These are flags that can be turned on and off along the regex. For instance:
#"ab(?i)cdef(?-i)ghi" ;;=> #"ab(?i)cdef(?-i)ghi"
The flag starts off, so ab
is case-sensitive. Then the first (?i)
turns it
on, so cdef
is case-insensitive. Then (?-i)
turns it off (due to the -
),
so ghi
is case-sensitive.
You can even selectively turn them on or off in non-capturing groups:
#"ab(?iu:cdef)ghi" ;;=> #"ab(?iu:cdef)ghi"
This turns on the i
and u
flags for just the cdef
part.
You can read about the JVM regex flags syntax and the available flags.
The JVM regex flags syntax is quite powerful, and, if I had to guess, I would say that it's the main reason setting global flags using other syntax is hard.
JVM Regex Flags Method 2: Create a regular expression by using the host classes
We will be using the java.util.regex.Pattern
class, so we should import it for
easier typing:
(ns my-app.core
(:import (java.util.regex Pattern)))
Now we can use it to compile a regex:
;; These two are equivalent:
#"abc" ;;=> #"abc"
(Pattern/compile "abc") ;;=> #"abc"
To add flags, we have to refer to them by their name. It's not very convenient to type, but here it is:
(Pattern/compile "abc" Pattern/CASE_INSENSITIVE) ;;=> #"abc"
Notes
- This makes a case-insensitive regular expression.
- The regular expressions using flags like this print the same as the regexes without flags.
- The flag applies to the entire regex.
- You can find out the flags on a regex using the
.flags
method. - You will need to escape backslashes (
\
) twice since you're using a string literal, not a regex literal.
You can combine flags using +
:
(Pattern/compile "abc" (+ Pattern/CASE_INSENSITIVE
Pattern/UNICODE_CASE)) ;;=> #"abc"
It's not convenient to type, but at least it's explicit. You can read about the available flags on the JVM.
There is a trick I've used to make escaping a little easier. You can use a regex
literal (#""
), then convert it to a string to pass it to Pattern/compile
:
;; double escaped
(Pattern/compile "\\d" Pattern/CASE_INSENSITIVE)
;; more ergonomic
(Pattern/compile (str #"\d") Pattern/CASE_INSENSITIVE)
2. ClojureScript
In ClojureScript, we will construct a JavaScript RegExp
. If you don't need
flags, you can construct one like this:
;; These two are equivalent:
#"abc" ;;=> #"abc"
(js/RegExp. "abc") ;;=> #"abc"
To add flags, just add a second argument, a string containing the letter codes:
(js/RegExp. "abc" "iu") ;;=> #"abc"
Unfortunately, regexes with flags print the same as regexes without flags, so be careful.
You can read about the available flags in JavaScript.
Find whether a string contains another
I commonly use regexes to determine if a string contains another string. That's
easy to do with re-find
:
(re-find #"needle" "Find a needle in a haystack.") ;;=> "needle"
(re-find #"needle" "Empty haystack.") ;;=> nil
Because the return is truthy or falsey, you can use it as the condition of an if
.
But if you're just using a substring match (and not using fancy regex features
like flags, character classes, and repetition), you can use
clojure.string/includes?
:
(str/includes? "Find a needle in a haystack." "needle") ;;=> true
(str/includes? "Empty haystack." "needle") ;;=> false
Regexes are nice because you can match the beginning of the line or the end of the line:
(re-find #"^This string" "This string starts with ...") ;;=> "This string"
(re-find #"end$" "Find a string at the end") ;;=> "end"
There are functions for that (again, only if you don't need regex features),
clojure.string/starts-with?
and
clojure.string/ends-with?
:
(str/starts-with? "This string starts with ..." "This string") ;;=> true
(str/ends-with? "Find a string at the end" "end") ;;=> true
Remember, we commonly alias clojure.string
to str
in the ns
declaration:
(ns my-app.core
(:require [clojure.string :as str]))
Escaping regex characters in a string
Sometimes you have a string that contains some special characters that are meaningful as part of a regex.
"(??^$]" ;; A string I want to match literally
However, if you want to match those literally, you'll be in for a world of pain.
#"\(\?\?\^\$\]" ;; you can't escape the escapes!
The java.util.regex.Pattern
class has a static method that's useful for
quoting such strings:
(Pattern/quote "(??^$]") ;;=> "\\Q(??^$]\\E"
You can then pass it to compile:
(-> "(??^$]" Pattern/quote Pattern/compile) ;;=> #"\Q(??^$]\E"
Notable libraries
Regal
Regal is a library from Lambda Island Open Source. It makes regular expressions more readable and translatable between JavaScript, JVM 8, and JVM 9.
It lets you write this regex:
[\w.%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,4}
As this edn:
[:cat
[:+ [:class :word ".%+-"]]
"@"
[:+ [:class ["A" "Z"] ["a" "z"] ["0" "9"] ".-"]]
"."
[:repeat [:class ["A" "Z"] ["a" "z"]] "2" "4"]]
Check out this interactive tutorial.
Other rarely-used functions
Those are all of the functions I use routinely. There are some more, which are useful when you need them.
re-pattern
Construct a regex from a String
.
re-matcher
This one is not available in ClojureScript. On the JVM, it creates a
java.util.regex.Matcher
,
which is used for iterating over subsequent matches. This is not so
useful since re-seq
exists.
If you find yourself with a Matcher
, you can call re-find
on it to get the
next match (instead of the first). You can also call
re-groups
from the most
recent match. You can also use a Matcher
to get named capture groups. See
this
example.
Unless you need a Matcher
for some Java API, stick to re-seq
.
Matcher
s are mutable and don't work well with threads.