Skip to main content

SenseTalk Pattern Language Basics

SenseTalk provides several ways to search for text in scripts. You can use text comparison operators and text operators to check for the presence or location of specific characters or substrings within a larger text string. SenseTalk's chunk expressions make it convenient to search for text by using lines, words, delimited items, and individual characters.

However, sometimes finding a specific string isn't enough. For example, you might need to find a phone number within some text without knowing what the phone number is. Or you might need to determine whether sensitive information, such as a credit card number or Social Security number, is exposed.

These tasks require the ability to recognize a pattern within the text. SenseTalk scripts support two methods for defining patterns:

The pattern language is typically the most natural way to use pattern matching in SenseTalk scripting. However, if you are familiar with regular expressions or already have regex definitions from other projects, you should be able to use them with your SenseTalk code.

note

SenseTalk's pattern recognition capability for both the pattern language and regex is based on the International Components for Unicode (ICU) regular expressions engine. Other versions of regular expressions should work to the extent that they are compatible with the ICU engine. See the ICU User Guide for more information.

SenseTalk's Pattern Language

SenseTalk's pattern language lets you search within text by describing patterns. The pattern can be a combination of elements that can be as simple or complex as you need it to be to describe what you want to locate. In the SenseTalk pattern language, the pattern description is enclosed in angle brackets (< ... >).

As an example, you could use pattern matching to look for a Social Security identification number. A Social Security number consists of nine digits, and it is always written in the form 999-99-9999—that is, as three digits followed by a dash, then two digits, another dash, and then the final four digits. In SenseTalk, one way to write this pattern would be:

<3 digits then dash then 2 digits then dash then 4 digits>

However, SenseTalk is flexible. The elements that make up a pattern can be specified by listing each element one after another separated by commas, separated by the word then, or separated by listing each element on a new line (or some combination of these options). So the following examples are all equivalent:

<3 digits then "-", 2 digits then "-", 4 digits>

<3 digits, "-", 2 digits, "-", 4 digits>

<3 digits then "-"
2 digits, "-",
then 4 digits>

You can assign a pattern to a variable for convenient reuse:

set ssn to <3 digits then dash then 2 digits then dash then 4 digits>

Regex in SenseTalk

A regex is a special sequence of characters that defines a search pattern. While regex is popular in many programming languages for pattern matching, the regex syntax can be complex, difficult to understand, and challenging to deploy. Except for some basic regex uses, written patterns can be difficult to decipher, even by the person who wrote them.

If you are familiar with creating regex definitions, you can incorporate that syntax in SenseTalk. You can include regex patterns in SenseTalk scripts by using the word pattern followed by the regex string or expression enclosed in double quotes. For example, a regex pattern for a US zip code could be written in SenseTalk as:

pattern "\d{5}(?:-\d{4})?"

You could also store the regex pattern in a variable for reuse:

set zipcode to pattern {{
\d{5}(?:-\d{4})?
}}

Note that the pattern for a US zip code in the SenseTalk pattern language might look like this:

set zipcode to < 5 digits then preferably ("-" , 4 digits)>

Next Steps

The pattern language provides a powerful method of creating dynamic matches against source text. For more information about building patterns and implementing pattern matching in your scripts, see: