Using Capture Groups with Pattern Definitions
When you define a pattern with SenseTalk's pattern language, the pattern definition can be created from multiple elements, or subpatterns, as described in Elements of the SenseTalk Pattern Language. You use curly braces around a subpattern to mark it as a capture group.
Capture groups serve three useful purposes:
- Capturing Information
- Referring to a Previous Subpattern Match
- Text Manipulation Based on Pattern Matches
You can name each capture group within a pattern by using the syntax:
{ name : subPattern }
You can omit the name for a capture group, in which case a default name is assigned, such as Group_1, Group_2, and so forth.
Important: Because the pattern as a whole is always reported in the match property list with the property name text
, you should not use text
as a capture group name.
Capturing Information
Using capture groups lets you capture additional information about pattern matches. A capture group returns the specific matched text of the capture group and the range of the match as part of the match property list when an overall pattern match is made.
When you use the match()
and everyMatch()
functions as well as match
chunk expressions, SenseTalk returns a property list that includes the complete text that was matched and the range within the source where the text was matched. When you use capture groups, the match property list includes two additional properties for each capture group that was matched:
- name: This property has the actual name of the capture group, and the value is the text that was matched.
- name_range: A property with
_range
appended to the group name, and a value that is the range of characters where the group was matched.
For complete information about these functions, see Match()
, EveryMatch()
Functions.
The following example shows a pattern definition with no capture groups:
set parenPattern to <"(" then some characters then ")">
put the match of parenPattern in "John Jacob (Jingleheimer) Smith"
In this case, parenPattern
is a simple pattern that matches text enclosed in parentheses. The resulting match is:
{text:"(Jingleheimer)", text_range:"12" to "25"}
The same pattern could be defined by using a capture group:
set parenPattern to <"(" then {content: some characters} then ")">
put the match of parenPattern in "John Jacob (Jingleheimer) Smith"
Here, the some characters
portion of the pattern has been enclosed in curly braces and labeled with the group name content
. The resulting match is now:
{content:"Jingleheimer", content_range:"13" to "24", text:"(Jingleheimer)", text_range:"12" to "25"}
The text
and text_range
properties are still present with information about the overall match, but the content
and content_range
properties have been added with information about the capture group—that is, the text matched between the parentheses.
A pattern can include multiple capture groups to capture different parts of the matched pattern. Typically, each capture group should be given a different name. However, if your pattern definition uses alternative options based on or
conditions, you might want to use the same name for a capture group on both sides of that condition, where only one or the other can be returned.
As an example, the following code defines a pattern for a telephone area code that can appear enclosed in parentheses or not:
set areaCode to <({ac:3 digits} followed by non-digit) or ("(" then {ac:3 digits} then ")" then maybe space)>
In this example, the capture group ac
appears twice, once on each side of the or
operator.
Some capture groups might not participate as part of the overall match if they belong to an or
alternative that was not used. A capture group that isn't used as part of the overall match isn't reported in the match property list.