Skip to main content

Using Capture Groups with Pattern Definitions

When you define a pattern with SenseTalk's pattern language, the pattern definition can be created from multiple elements, or subpatterns, as described in Elements of the SenseTalk Pattern Language. You use curly braces around a subpattern to mark it as a capture group.

Capture groups serve three useful purposes:

You can name each capture group within a pattern by using the syntax:

{ name : subPattern }

You can omit the name for a capture group, in which case a default name is assigned, such as Group_1, Group_2, and so forth.

Important

Important: Because the pattern as a whole is always reported in the match property list with the property name text, you should not use text as a capture group name.

Capturing Information

Using capture groups lets you capture additional information about pattern matches. A capture group returns the specific matched text of the capture group and the range of the match as part of the match property list when an overall pattern match is made.

When you use the match() and everyMatch() functions as well as match chunk expressions, SenseTalk returns a property list that includes the complete text that was matched and the range within the source where the text was matched. When you use capture groups, the match property list includes two additional properties for each capture group that was matched:

  • name: This property has the actual name of the capture group, and the value is the text that was matched.
  • name_range: A property with _range appended to the group name, and a value that is the range of characters where the group was matched.

For complete information about these functions, see Match(), EveryMatch() Functions.

The following example shows a pattern definition with no capture groups:

set parenPattern to <"(" then some characters then ")">
put the match of parenPattern in "John Jacob (Jingleheimer) Smith"

In this case, parenPattern is a simple pattern that matches text enclosed in parentheses. The resulting match is:

{text:"(Jingleheimer)", text_range:"12" to "25"}

The same pattern could be defined by using a capture group:

set parenPattern to <"(" then {content: some characters} then ")">
put the match of parenPattern in "John Jacob (Jingleheimer) Smith"

Here, the some characters portion of the pattern has been enclosed in curly braces and labeled with the group name content. The resulting match is now:

{content:"Jingleheimer", content_range:"13" to "24", text:"(Jingleheimer)", text_range:"12" to "25"}

The text and text_range properties are still present with information about the overall match, but the content and content_range properties have been added with information about the capture group—that is, the text matched between the parentheses.

A pattern can include multiple capture groups to capture different parts of the matched pattern. Typically, each capture group should be given a different name. However, if your pattern definition uses alternative options based on or conditions, you might want to use the same name for a capture group on both sides of that condition, where only one or the other can be returned.

As an example, the following code defines a pattern for a telephone area code that can appear enclosed in parentheses or not:

set areaCode to <({ac:3 digits} followed by non-digit) or ("(" then {ac:3 digits} then ")" then maybe space)>

In this example, the capture group ac appears twice, once on each side of the or operator.

note

Some capture groups might not participate as part of the overall match if they belong to an or alternative that was not used. A capture group that isn't used as part of the overall match isn't reported in the match property list.

Referring to a Previous Subpattern Match

The text matched by the subpattern in a capture group can be used in a back-reference later in the overall pattern in order to match the same text again. The back-reference takes the form

{: name }

where name is the name of the previous capture group.

For example, this pattern can be used to find words that appear twice in a row in some text:

set doubleWord to <word break, {word: word chars}, nonword chars, {:word}, word break>
put every instance of doubleWord in "Come to to Paris in the the spring"

This example displays two matches from the source string:

["to to","the the"]

The first part of the pattern captures a word (word characters after a word break and up until nonword characters) and assigns it the group name word. The pattern then uses the back-reference {:word} to refer to the captured value.

Another situation where a back-reference is especially useful is in finding matching pairs of tags in HTML or XML, such as <body> and </body>, where there might be many other <tag> entries in between. Here's one way to write such a pattern using a capture group and a back-reference:

set tagPattern to < "<", {tag: chars}, ">", chars, "</", {:tag}, ">" >

Back-References without Names

You can use back-references without assigning names to each capture group. To do this, use { groupNum } as the back-reference, where groupNum is the sequential number of the capture group. So {1} refers to the value of the first capture group in the pattern, {2} to the second group, and so on.

For example, the double-word example above could be defined without naming the capture group like this:

set doubleWord to <word break, {word chars}, nonword chars, {1}, word break>

Text Manipulation Based on Pattern Matches

You can use the value matched by a capture group in a replacement string with the Replace command. The approach is similar to a back-reference within a pattern. That is, you use the reference to the previous capture group, {:name}, within the replacement string. You can use this technique to transform pattern matches in any number of ways, either replacing or rearranging text from the source.

For example, consider a list of names in the form LastName, FirstName, like this:

set nameList to {{
Disney, Walter Elias
Earhart, Amelia
Einstein, Albert
Tolkien, J.R.R.
}}

The task is to reverse the order of the last name and first name on each line, while removing the comma that separates them. First, you need a pattern that matches each name, capturing the relevant parts:

set namePattern to < line start, {lastname: chars}, ", ", {firstname: chars}, line end >

You can then use references to the named capture groups within the replacement string of a Replace command to perform the transformation:

replace namePattern in nameList with "{:firstname} {:lastname}"
put nameList

The result is the transformed list:

Walter Elias Disney
Amelia Earhart
Albert Einstein
J.R.R. Tolkien