Clojure Regex Tutorial

Summary: With a few functions from the standard library, Clojure lets you do most of what you want with regular expressions with no muss.

Clojure regexes are host language regexes

Clojure is designed to be hosted. Clojure defers to the host regex syntax and semantics instead of defining a standard that works on all platforms. On the JVM, you're using Java regexes. In ClojureScript, it's Javascript regexes.

Refer to the following documents for the regex syntax for a particular host:

And you can use Regex 101 for testing out regexes. Be sure to select the language in the menu in the top left. I also use the REPL.

Of course, this difference means that regexes are not always portable. Other than the syntax and semantics of the regexes themselves, Clojure standardizes many regex functions across all platforms in the core library.

Clojure regex syntax

You construct a regex in Clojure using a literal syntax. Strings with a hash sign in front are interpreted as regexes:

#"regex"

On the JVM, the above line will create an instance of java.util.regex.Pattern. In ClojureScript, it will create a RegExp. Remember, the two regular expression languages are similar but different.

This syntax is the most convenient because you don't need to double escape your special characters. For example, if you want to represent the regex string to match a digit, using a Clojure string you would need to write this:

"\\d" ;; regex string to match one digit

Notice that you have to escape the backslash to get a literal backslash in the string. However, regex literals are smart. They don't need to double escape:

#"\d" ;; match one digit

Matching a regex to a string with groups

Very often, you want to match an entire string. The function to do that in Clojure is called re-matches. re-matches takes a regex and a string, then returns the result of the match.

(re-matches regex string) ;;=> result

The result it returns is a little complex. There are three things it can return.

1. No match returns nil

If the whole string does not match, re-matches returns nil, which is nice because nil is falsey.

(re-matches #"abc" "xyz")            ;;=> nil
(re-matches #"abc" "zzzabcxxx")      ;;=> nil
(re-matches #"(a)bc" "hello, world") ;;=> nil

2. Matching with no groups returns the matched string

If the string does match, and there are no groups (parens) in the regex, then it returns the matched string.

(re-matches #"abc" "abc")  ;;=> "abc"
(re-matches #"\d+" "3324") ;;=> "3324"

Since all strings are truthy, you can use re-matches as the test of a conditional:

(if (re-matches #"\d+" x)
  (println "x is all digits")
  (println "x is not all digits"))

We'll see a more convenient way to test and use the return value here.

3. Matching with groups returns a vector

If it matches and there are groups, then it returns a vector. The first element in the vector is the entire match. The remaining elements are the group matches.

(re-matches #"abc(.*)" "abcxyz")       ;;=> ["abcxyz" "xyz"]
(re-matches #"(a+)(b+)(\d+)" "abb234") ;;=> ["abb234" "a" "bb" "234"]

The three different return types can get tricky. However, I usually have groups, so it's either a vector or nil, which are easy to handle. I tend to use if-some. It evaluates the match, checks for nil, and destructures the groups. You can even destructure it before you test it.

(if-some [[whole-match first-name last-name]      ;; destructuring form
          (re-matches #"(\w+)\s(\w+)" full-name)]
  (println first-name last-name)                  ;; matching case
  (println "Unparsable name"))                    ;; nil case

Finding a regex substring within a string with groups

Sometimes we want to find a match within a string. re-find returns the first match within the string. The return values are similar to re-matches.

1. No match returns nil

(re-find #"sss" "Loch Ness") ;;=> nil

2. Match without groups returns the matched string

(re-find #"s+" "dress") ;;=> "ss"

3. Match with groups returns a vector

(re-find #"s+(.*)(s+)" "success") ;;=> ["success" "ucces" "s"]

Finding all substrings that match within a string

The last function from clojure.core I use a lot is re-seq. **re-seq returns a lazy seq of all of the matches.**The elements of the seq are whatever type re-find would have returned.

(re-seq #"s+" "mississippi") ;;=> ("ss" "ss")
(re-seq #"[a-zA-Z](\d+)"
        "abc x123 b44 234")  ;;=> (["x123" "123"] ["b44" "44"])

Replacing regex matches within a string

Well, matching strings is cool, but often you'd like to replace a substring that matches with some other string. clojure.string/replace will replace all substring matches with a new string.

Do not confuse clojure.string/replace with clojure.core/replace. They are very different. I will often alias clojure.string as str in my ns declaration:

(ns my-app.core
  (:require [clojure.string :as str]))

That lets me refer to clojure.string/replace as str/replace.

Here's a quick example:

(str/replace "mississippi" #"i.." "obb") ;;=> "mobbobbobbi"

This example matches an i followed by any two characters. It replaces all matches with the string "obb".

Notice the argument order. The string you are matching against comes first, followed by the regex. Most functions in clojure.string follow that pattern. Since the functions are about strings, the strings are the first argument.

Referring to groups in the replacement string

clojure.string/replace is actually quite versatile. You can refer directly to the groups in the replacement string using a dollar sign. $0 means the entire match. $1 means the first group. $2 means the second group, etc.:

(str/replace "mississippi" #"(i)" "$1$1") ;;=> "miissiissiippii"

This example doubles all of the i's.

Calculating the replacement with a function

You can replace matches with the return value of a function applied to the match:

(str/replace "mississippi" #"(.)i(.)"
  (fn [[_ b a]]
    (str (str/upper-case b)
        "—"
        (str/upper-case a)))) ;;=> "M—SS—SS—Ppi"

You can replace just the first occurrence with clojure.string/replace-first.

Splitting a string by a regex

Let's say you want to split a string on some character pattern, like one or more whitespace. You can use clojure.string/split:

(str/split "This is a string    that I am splitting." #"\s+")
  ;;=> ["This" "is" "a" "string" "that" "I" "am" "splitting."]

Again, we see the same argument pattern: The string to match comes first, since the clojure.string functions are about strings.

Creating a case insensitive regex in Clojure (and other flags)

Some languages have syntax which allow you to put modifiers on the regex, such as the i modifier which makes it a case insensitive match. Here is an example from JavaScript:

/jjj/i;

This regex will match three j's regardless of the case. "jJj" and "JJj" will match. These are called flags.

Unfortunately, Clojure's syntax does not allow for flags. You have to rely on the native host mechanisms for creating regexes.

1. JVM Clojure

On the JVM, there are two ways to use flags.

JVM Regex Flags Method 1: Special flag syntax

The JVM regexes allow for a special syntax to enable flags within the regex.

;; no flags (case-sensitive)
#"abc"     ;;=> #"abc"
;; case-insensitive flag set
#"(?i)abc" ;;=> #"(?i)abc"

These are flags that can be turned on and off along the regex. For instance:

#"ab(?i)cdef(?-i)ghi" ;;=> #"ab(?i)cdef(?-i)ghi"

The flag starts off, so ab is case-sensitive. Then the first (?i) turns it on, so cdef is case-insensitive. Then (?-i) turns it off (due to the -), so ghi is case-sensitive.

You can even selectively turn them on or off in non-capturing groups:

#"ab(?iu:cdef)ghi" ;;=> #"ab(?iu:cdef)ghi"

This turns on the i and u flags for just the cdef part.

You can read about the JVM regex flags syntax and the available flags.

The JVM regex flags syntax is quite powerful, and, if I had to guess, I would say that it's the main reason setting global flags using other syntax is hard.

JVM Regex Flags Method 2: Create a regular expression by using the host classes

We will be using the java.util.regex.Pattern class, so we should import it for easier typing:

(ns my-app.core
  (:import (java.util.regex Pattern)))

Now we can use it to compile a regex:

;; These two are equivalent:
#"abc"                  ;;=> #"abc"
(Pattern/compile "abc") ;;=> #"abc"

To add flags, we have to refer to them by their name. It's not very convenient to type, but here it is:

(Pattern/compile "abc" Pattern/CASE_INSENSITIVE) ;;=> #"abc"

Notes

  • This makes a case-insensitive regular expression.
  • The regular expressions using flags like this print the same as the regexes without flags.
  • The flag applies to the entire regex.
  • You can find out the flags on a regex using the .flags method.
  • You will need to escape backslashes (\) twice since you're using a string literal, not a regex literal.

You can combine flags using +:

(Pattern/compile "abc" (+ Pattern/CASE_INSENSITIVE
                          Pattern/UNICODE_CASE)) ;;=> #"abc"

It's not convenient to type, but at least it's explicit. You can read about the available flags on the JVM.

There is a trick I've used to make escaping a little easier. You can use a regex literal (#""), then convert it to a string to pass it to Pattern/compile:

;; double escaped
(Pattern/compile "\\d"       Pattern/CASE_INSENSITIVE)
;; more ergonomic
(Pattern/compile (str #"\d") Pattern/CASE_INSENSITIVE)

2. ClojureScript

In ClojureScript, we will construct a JavaScript RegExp. If you don't need flags, you can construct one like this:


;; These two are equivalent:
#"abc"             ;;=> #"abc"
(js/RegExp. "abc") ;;=> #"abc"

To add flags, just add a second argument, a string containing the letter codes:

(js/RegExp. "abc" "iu") ;;=> #"abc"

Unfortunately, regexes with flags print the same as regexes without flags, so be careful.

You can read about the available flags in JavaScript.

Find whether a string contains another

I commonly use regexes to determine if a string contains another string. That's easy to do with re-find:

(re-find #"needle" "Find a needle in a haystack.") ;;=> "needle"
(re-find #"needle" "Empty haystack.")              ;;=> nil

Because the return is truthy or falsey, you can use it as the condition of an if.

But if you're just using a substring match (and not using fancy regex features like flags, character classes, and repetition), you can use clojure.string/includes?:

(str/includes? "Find a needle in a haystack." "needle") ;;=> true
(str/includes? "Empty haystack." "needle")              ;;=> false

Regexes are nice because you can match the beginning of the line or the end of the line:

(re-find #"^This string" "This string starts with ...") ;;=> "This string"
(re-find #"end$" "Find a string at the end")            ;;=> "end"

There are functions for that (again, only if you don't need regex features), clojure.string/starts-with? and clojure.string/ends-with?:

(str/starts-with? "This string starts with ..." "This string") ;;=> true
(str/ends-with?   "Find a string at the end"    "end")         ;;=> true

Remember, we commonly alias clojure.string to str in the ns declaration:

(ns my-app.core
  (:require [clojure.string :as str]))

Escaping regex characters in a string

Sometimes you have a string that contains some special characters that are meaningful as part of a regex.

"(??^$]" ;; A string I want to match literally

However, if you want to match those literally, you'll be in for a world of pain.

#"\(\?\?\^\$\]" ;; you can't escape the escapes!

The java.util.regex.Pattern class has a static method that's useful for quoting such strings:

(Pattern/quote "(??^$]") ;;=> "\\Q(??^$]\\E"

You can then pass it to compile:

(-> "(??^$]" Pattern/quote Pattern/compile) ;;=> #"\Q(??^$]\E"

Notable libraries

Regal

Regal is a library from Lambda Island Open Source. It makes regular expressions more readable and translatable between JavaScript, JVM 8, and JVM 9.

It lets you write this regex:

[\w.%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,4}

As this edn:

[:cat
 [:+ [:class :word ".%+-"]]
 "@"
 [:+ [:class ["A" "Z"] ["a" "z"] ["0" "9"] ".-"]]
 "."
 [:repeat [:class ["A" "Z"] ["a" "z"]] "2" "4"]]

Check out this interactive tutorial.

Other rarely-used functions

Those are all of the functions I use routinely. There are some more, which are useful when you need them.

re-pattern

Construct a regex from a String.

re-matcher

This one is not available in ClojureScript. On the JVM, it creates a java.util.regex.Matcher, which is used for iterating over subsequent matches. This is not so useful since re-seq exists.

If you find yourself with a Matcher, you can call re-find on it to get the next match (instead of the first). You can also call re-groups from the most recent match. You can also use a Matcher to get named capture groups. See this example.

Unless you need a Matcher for some Java API, stick to re-seq. Matchers are mutable and don't work well with threads.