Elements of the SenseTalk Pattern Language

The SenseTalk pattern language lets you define patterns that you can use to match strings in text. As explained in SenseTalk Pattern Language Basics, the pattern matching capabilities are built on top of regular expressions (regex). The SenseTalk pattern language lets you define patterns in easy-to-read syntax.

You can create pattern definitions for simple patterns, such as the occurrence of any three digits. You can also define patterns for complex patterns that have optional or alternative portions, and that can have varying lengths. Every pattern, however simple or complex, is built from a number of basic pattern elements.

Pattern Language Syntax

Pattern definitions in the SenseTalk pattern language consist of the pattern description enclosed in angle brackets (< ... >).

Syntax:

{pattern} < patternLanguageExpression >

Note the word pattern is optional with the pattern language and typically will be omitted.

A pattern definition—represented in the syntax above by patternLanguageExpression—can be a single element, such as 7 digits to find any occurrence of seven digits in a row. However, most patterns will include a sequence of elements, or subpatterns. The sequence is specified by listing each subpattern one after another separated by commas, separated by the word then, or separated by listing each element on a new line (or some combination of these options).

Therefore, the following examples are all equivalent methods of representing a pattern definition for a Social Security identification number:

<3 digits then "-", 2 digits then "-", 4 digits>

 

<3 digits, "-", 2 digits, "-", 4 digits>

 

<3 digits then "-"

2 digits, "-",

then 4 digits>

You can use the word or to specify alternative choices in a subpattern. For example,

<"cat" or "cow">

matches either cat or cow.

You can use parentheses to group elements when necessary. For example,

<"cat" or "cow" then 2 digits>

matches text like cat24 or cow17. But

<"cat" or ("cow" then 2 digits)>

matches either cat (with no digits needed) or something like cow97.

Pattern Elements

A pattern definition is made up of individual elements, or subpatterns. An element can be the characters included in a quoted string, such as "cat". It can also be a SenseTalk variable, an expression in parentheses, or one of several pattern elements described in the Pattern Definition Elements table below.

Note the following about how elements can be defined:

  • Most pattern elements can be singular or plural.
    • Singular elements match exactly one of the specified value.
    • Plural elements match a sequence of one or more of the indicated value.
  • Quantifiers can be used to explicitly control the number of characters or values you want an element to match.
Pattern Definition Elements
Elements Definition
"quoted string" Use an exact string of characters

variable

Use any valid element or combination of elements stored in a variable
expression Use any SenseTalk expression, in parentheses, that yields a string

character

characters

Matches any characters

letter

letters

Matches letters of any alphabet

nonletter

nonletters

Matches characters other than letters of an alphabet

lowercase letter

lowercase letters

Matches lowercase letters in any language

nonlowercase letter

nonlowercase letters

Matches letters other than lowercase letters in any language

uppercase letter, capital letter

uppercase letters, capital letters

Matches uppercase letters in any language

nonuppercase letter, noncapital letter

nonuppercase letters, noncapital letters

Matches letters other than uppercase letters in any language

digit

digits

Matches digits from 0 to 9

nondigit

nondigits

Matches characters other than digits

letterOrDigit, alphanumeric

lettersOrDigits, alphanumerics

Matches either letters or digits

nonLetterOrDigit, nonAlphanumeric

nonLettersOrDigits, nonAlphanumerics

Matches characters other than letters or digits

whitespace character

whitespace characters

Matches white space characters (space, tab, line separator, etc.)

nonwhitespace character

nonwhitespace characters

Matches characters other than white space characters

word character

word characters

Matches word characters (letters or digits)

nonword character

nonword characters

Matches characters other than word characters

punctuation character

punctuation characters

Matches punctuation characters
nonpunctuation character

nonpunctuation characters

Matches nonpunctuation characters

character [of | in | from] characterSet

characters [of | in | from] characterSet

Matches characters that are in the characterSet (string, range, character class identifier, or list of these items)

character not [of | in | from] characterSet

characters not [of | in | from] characterSet

Matches character that are not in the characterSet

Note: You can abbreviate character as char, and characters as chars, in all cases.

Quantifiers and Elements

When you specify a singular pattern element, such as letter or digit, exactly one of that type of element creates a match for the pattern. Specifying a plural form, such as letters or digits, indicates that one or more of that element can create a match.

In addition, there are a number of quantifiers that you can use to specify how many times an element should appear to create a pattern match.

Note: The following descriptions use the term character for simplicity, but any element term can be used.

Terms that Mean Exactly One Character

character

a character

one character

exactly one character

Example:

set myPattern to < "(" then a character then ")" >

Matches Patterns Like: (w) ( ) (7) (.) ())

Doesn't Match: () (42) (salamander) otherStuff

Terms that Mean an Exact Number of Characters

2 characters

exactly 2 characters

The number 2 is shown in these examples, but any positive integer can be used. A variable whose value is an integer can also be used, but the word "exactly" must be used in this case.

Example:

set myPattern to < "(" then 2 characters then ")" >

Matches Patterns Like: (42) (CO) (())

Doesn't Match: () (w) ( ) (7) (.) ()) (salamander) otherStuff

Terms that Mean Zero or One Character

maybe character

maybe a character

maybe one character

zero or one character

zero or maybe one character

Example:

set myPattern to < "(" then maybe a character then ")" >

Matches Patterns Like: () (w) ( ) (7) (.)

Doesn't Match: (42) (salamander) otherStuff

Note: This pattern can also match ()) but will prefer to match just () unless it needs to match all three characters in some context.

Terms that Mean One or More Characters

characters

some characters

one or more characters

Example:

set myPattern to < "(" then characters then ")" >

Matches Patterns Like: (w) ( ) (7) (.) ()) (42) (salamander)

Doesn't Match: () otherStuff

In the sentence example below, matches two strings: (a woman) and (her cat)

Terms that Mean Zero or More Characters

maybe characters

maybe some characters

zero or more characters

Example:

set myPattern to < "(" then zero or more characters then ")" >

Matches Patterns Like: () (w) ( ) (7) (.) (42) (salamander)

Doesn't Match: otherStuff

Note: This pattern can also match ()) but will prefer to match just () unless it's necessary to match all three characters.

In the sentence example below, matches two strings: (a woman) and (her cat)

Terms that Specify a Minimum Number of Characters

at least 2 characters

2 or more characters

2 or maybe more characters

at least 2 or more characters

at least 2 and maybe more characters

The number 2 is shown in these examples, but any positive integer can be used. A variable whose value is an integer can also be used in place of specific number.

Example:

set myPattern to < "(" then 2 or more characters then ")" >

Matches Patterns Like: (42) (salamander)

Doesn't Match: () (w) ( ) (7) (.) ()) otherStuff

In the sentence example below, matches two strings: (a woman) and (her cat)

Terms that Specify a Minimum and Maximum Number of Characters

2 to 4 characters

from 2 to 4 characters

A range of 2 to 4 is shown in these examples, but any range of positive integers can be used. A variable whose value is an integer can also be used in place of either or both numbers.

Example:

set myPattern to < "(" then 2 to 4 characters then ")" >

Matches Patterns Like: (42)

Doesn't Match: () (w) ( ) (7) (.) ()) (salamander) otherStuff

Lazy vs. Greedy Quantifiers

When you create a pattern that can match a varying number of values, the match can be either lazy or greedy. A lazy subpattern matches as few occurrences as possible while still providing a match to the overall pattern. A greedy subpattern matches as many as possible.

In SenseTalk, pattern matches are lazy by default. So the pattern

< "(" , characters , ")" >

will match a sequence beginning with ( followed by one or more characters up to and including the first occurrence of ) that is encountered. The pattern

< "(" , lots of characters , ")" >

on the other hand, will match a sequence beginning with ( then greedily consume as many characters as it can, up to and including the final occurrence of ) in the source text.

To illustrate the difference, consider this example sentence:

Amy (a woman) and Flossie (her cat) lie down to take a nap.

And this pattern:

< "(" then one or more characters then ")" >

Now, assuming the example text is read into a variable, myText, the following code would return two values:

set myPattern to < "(" then one or more characters then ")" >

put every occurrence of myPattern in myText into myList

put mylist

Output: ((a woman),(her cat))

It works this way because one or more characters in SenseTalk is lazy—it consumes the smallest number of characters needed to satisfy the pattern.

However, sometimes you might need to create a greedy match. The greedy behavior can be achieved in this example by changing the pattern to this:

< "(" then one or preferably more characters then ")" >

By specifying that you prefer to have the pattern match more characters if possible, the expression every occurrence of myPattern in myText now returns just one value that includes everything from the first ( to the last ):

(a woman) and Flossie (her cat)

Terms that Mean Zero or One Character, But Prefer One

preferably character

preferably a character

preferably one character

zero or preferably one character

Example:

set myPattern to < "(" then preferably a character then ")" >

Matches Patterns Like: () (w) ( ) (7) (.) ())

Doesn't Match: (42) (salamander) otherStuff

Note: If the pattern encounters ()), it will greedily match all three characters unless it's in a context where the second ) is needed to satisfy a later part of the pattern, in which case it will match just ().

Terms that Mean One or More Characters, But Prefer as Many as Possible:

lots of characters

preferably lots of characters

one or lots of characters

one or preferably lots of characters

one or preferably more characters

Example:

set myPattern to < "(" then lots of characters then ")" >

Matches Patterns Like: (w) ( ) (7) (.) ()) (42) (salamander)

Doesn't Match: () otherStuff

In the sentence example above, matches: (a woman) and Flossie (her cat)

Terms that Mean Zero or More Characters, But Prefer as Many as Possible

preferably characters

maybe lots of characters

zero or lots of characters

zero or preferably lots of characters

zero or preferably more characters

Example:

set myPattern to < "(" then zero or preferably more characters then ")" >

Matches Patterns Like: () (w) ( ) (7) (.) ()) (42) (salamander)

Doesn't Match: otherStuff

Note: If the search encounters ()), this pattern will greedily match all three characters unless it's in a context where the second ) is needed to satisfy a later part of the pattern, in which case it will match just ().

In the sentence example above, matches one string: (a woman) and Flossie (her cat)

Terms that Specify a Minimum Number of Characters, but Prefer as Many as Possible

at least 2 or lots of characters

2 or lots of characters

2 or preferably lots of characters

2 or preferably more characters

at least 2 and preferably lots of characters

at least 2 and preferably more characters

The number 2 is shown in these examples, but any positive integer can be used.

Example:

set myPattern to < "(" then 2 or preferably more characters then ")" >

Matches Patterns Like: (42) (salamander)

Doesn't Match: () (w) ( ) (7) (.) ()) otherStuff

In the sentence example above, matches one string: (a woman) and Flossie (her cat)

Terms that Specify a Minimum and Maximum Number of Characters, but Prefer as Many as Possible

2 to 4 characters greedily

from 2 to 4 characters greedily

A range of 2 to 4 is shown in these examples, but any range of positive integers can be used.

Example:

set myPattern to < "(" then from 2 to 4 characters greedily then ")" >

Matches Pattern Like: (42)

Doesn't Match: () (w) ( ) (7) (.) ()) (salamander) otherStuff

Explicitly Forcing Greedy or Lazy Behavior

You can use the terms greedily or lazily after any term that represents a variable number of characters to explicitly enforce greedy or lazy behavior. When used, these terms always take precedence, regardless of the usual behavior of the preceding term.

Here are a few examples to illustrate this behavior:

maybe a character greedily -- Zero or one, but prefers one

some characters greedily -- One or more, but prefers as many as possible

lots of characters greedily -- One or more, but prefers as many as possible (the usual behavior made explicit)

lots of characters lazily -- One or more, but prefers as few as possible (overriding the usual behavior)

Greedy Synonyms

All of these terms can be used interchangeably when specifying a greedy quantifier:

lots

lots of

many

max

maximum

the maximum number of

the most

Case Sensitivity in Patterns

Text comparisons in SenseTalk are typically not case sensitive, as explained in Expressions. This behavior also applies to pattern matches. Therefore, when your pattern looks for letters, a match is met whether the letters are capitals or lowercase. You can change the default behavior to require case sensitivity by using the caseSensitive local property.

Most comparison operators also allow case-sensitivity to be specified directly as an option of the operator, which overrides the global setting. These options apply to pattern comparisons as well as ordinary text comparisons, as shown in the following example:

put <3 chars from "ABCDEF"> matches "fab" --> True

put <3 chars from "ABCDEF"> matches "fab" case sensitive --> False

Case-Sensitivity Settings Within a Pattern

You can incorporate case-sensitivity settings directly into a pattern definition. These in-line settings always take precedence over the default settings or options specified in the command. You can include these options within a pattern for fine-grained control over case-sensitivity in different parts of the pattern.

Case-Sensitivity Elements Syntax:

case sensitive -- All later elements of the pattern must be case sensitive

case insensitive -- All later elements of the pattern won't be case sensitive

case sensitive: element -- The specified element must be case sensitive

case insensitive: element -- The specified element won't be case sensitive

case sensitive: ( subPattern ) -- The specified subpattern must be case sensitive

case insensitive: ( subPattern ) -- The specified subpattern won't be case sensitive

Note: Wherever case sensitive is shown, any of its synonyms can be used: case-sensitive or caseSensitive or considering case or with case.

Wherever case insensitive is shown, any of its synonyms can be used: case-insensitive or caseInsensitive or ignoring case or without case.

Example:

put "abc" matches <case sensitive, 3 chars from "DCBA"> --> False

Example:

set partNum to <"ABC", digit, case-sensitive: char in "JQXZ", digit>

put partNum matches "Abc9Q2" --> True ("ABC" is not case sensitive)

put partNum matches "ABC7x3" --> False ("x" is not capitalized)

Example:

set code to <with case, character of "ABC", ignoring case, character of "XYZ">

put "aZ" matches code --> False ("a" must be uppercase to match)

put "Bx" matches code --> True ("x" can be upper- or lower-case)

Note: The case-sensitivity indicator, when present, is effectively not part of the pattern definition itself, but rather affects the entire definition to which it applies. Therefore, specifying a case-sensitivity value doesn't consume characters from the source text for making matches of the pattern.

 

This topic was last updated on August 19, 2021, at 03:30:51 PM.

Eggplant icon Eggplantsoftware.com | Documentation Home | User Forums | Support | Copyright © 2022 Eggplant