Chunk Types
Chunk expressions let you work with all of these chunk types:
Type | Definition |
---|---|
characters | individual characters within text |
words | words separated by any amount of white space (spaces, tabs, returns) within text |
lines | paragraphs separated by any of several standard line endings (CR, LF, CRLF, etc.) |
text items | portions of text separated by commas |
list items | the individual items in a list |
bytes | the bytes within binary data |
occurrences | the text matches of a defined pattern |
matches | the text matches and text range of a defined pattern and its capture groups |
In addition, you can specify custom delimiters to be used in identifying text items, lines, and words, giving even greater functionality. These three text chunk types each have distinctive types of delimiters: text items are delimited by a single text string, lines are delimited by any of a list of text strings, and words are delimited by any number and combination of characters from a set of characters.
Characters
The simplest type of chunk is the character chunk. A character is simply one character of text, including both visible and invisible characters (invisible characters include control characters such as tab, carriage return, and linefeed characters). The word character
may be abbreviated as char
.
put "The quick brown fox" into animal
put character 1 of animal--> T
put the last char of animal --> x
put chars 3 to 7 of animal --> e qui
Words
A single word is defined as a sequence of characters not containing any whitespace characters, or a sequence of characters contained in quotation marks. A range of words includes all characters from the first word specified through the last word specified, including all intervening words and whitespace. Whitespace characters are spaces, tabs, and returns (newlines).
put "Sometimes you feel like a nut; sometimes you don’t." into slogan
put the second word of slogan --> you
put word 6 of slogan --> nut;
put words 1 to 3 of slogan --> Sometimes you feel
Note that quoted phrases are ordinarily treated as a single word, including the quotation marks:
put <<Mary said "Good day" to John.>> into sentence
put the third word of sentence --> "Good day"
Related Local and Global Properties
SenseTalk includes local and global properties you can use to govern aspects of working with words in chunks. The set of characters that are used to identify words can be changed to something other than Space, Tab, and Return by setting the wordDelimiter
local property or the defaultWordDelimiter
global property. The quote characters used to identify a quoted word (or whether word quoting should be disabled completely) can be specified with the wordQuotes
local property or the defaultWordQuotes
global property.
These local properties are defined on Local and Global Properties for Chunk Expressions:
the wordDelimiter, the defaultWordDelimiter
the wordQuotes, the defaultWordQuotes
Lines
A line chunk expression allows you to specify one or more lines or paragraphs of text within the subject text, where lines are initially defined as the characters between any of the standard line ending characters.
put "line 1" & return & "line 2" & return & "line 3" into text
put the second line of text --> line 2
put line 6 of text --> ""
put lines 2 to 3 of text
--> line 2
--> line 3
Related Local and Global Properties
SenseTalk includes two properties you can use to govern aspects of working with lines in chunks. The set of line endings (delimiter strings) that define what a line is can be changed to something other than the default by setting the lineDelimiter
local property. Setting the lineDelimiter
to empty causes it to return to the default list.
the defaultLineDelimiter
global property defines the default set of line delimiters. This property is initially set to: CRLF, Return, CarriageReturn, LineSeparator, ParagraphSeparator.
These properties are defined on Local and Global Properties for Chunk Expressions:
the lineDelimiter, the defaultLineDelimiter
Text Items
An item within text is usually defined as the portion of text between commas:
put "A man, a plan, a canal. Panama!" into palindrome
put item 2 of palindrome --> " a plan"
The separation (delimiter) character can be specified as something other than a comma by setting the itemDelimiter
property. the itemDelimiter
's default value is determined by the defaultItemDelimiter
global property. These two properties are defined on Local and Global Properties for Chunk Expressions:
the itemDelimiter, the defaultItemDelimiter
List Items
The word items
can also refer to the elements in a list.
put ["red", "green", "blue"] into colors
put item 2 of colors --> green
SenseTalk decides whether item
refers to text items or list items depending on whether the value is a list or not. When referring to items within a value which is a list, SenseTalk will automatically assume the reference is to list items, not text items. However, if the itemDelimiter
is set to “” (empty), items
will refer to list items rather than text items. You may explicitly refer to list items
or text items
instead of the more generic items
if you need to control the way items are treated. This is especially important if you are trying to create a list by putting values into individual items, like this:
put 1 into myText -- 1
put 2 into item 2 of myText
put mytext --> "1,2"
The code above will generate a text string, with the middle character being the itemDelimiter
(unless the itemDelimiter
has been set to empty
). To generate a list instead of text, specify list item
:
put 1 into myList -- 1
put 2 into list item 2 of myList
put myList --> [1,2]
See Lists and Property Lists for more information on working with lists.
Bytes
A byte
chunk can be used to refer to a portion of binary data.
set the defaultDataFormat to "auto"
put <3f924618> into binaryData
put byte 2 of binaryData --> <92>
See Binary Data Manipulation for more information on byte chunks.
Occurrences
The words occurrence
and occurrences
let you access a pattern match as chunks of a string and return the matched text. Use occurrence
to access a specific single occurrence of a pattern within a string, and use occurrences
to return a list of occurrences.
You can use instance
as a synonym for occurrence
, and instances
in place of occurrences
in all cases.
Example:
put occurrence 4 of <digit> in "V2.7 for 4/3/18" --> 3
Example:
set proverb to "If wishes were horses, beggars would ride"
set wordEndingWithS to <start of word, word chars, word ending with "s">
put occurrence 2 of wordEndingWithS in proverb --> horses
Requesting a range of occurrences returns a list of values rather than a substring of the source string.
Example:
Used with the Set commands in the previous example
put instances 1 to 3 of wordEndingWithS in proverb --> [wishes,horses,beggars]
Example:
put instances 3 to 5 of <digit> in "V2.7 for 4/3/18" --> [4,3,1]
put the first 3 occurrences of <max digits> in "42-16gh9-88" --> [42,16,9]
For information about using patterns, see SenseTalk Pattern Language Basics.
Matches
The matches
keyword lets you access a pattern as chunks of a string. The value returned is a match property list that contains the full text of the match as the text
property and the range of the match as the text_range
property.
Requesting a range by using matches
returns a list of property lists, which includes one property list for each match of the pattern.
put the second match of <3 digits> in "987654321" --> {text:"654", text_range:"4" to "6"}
put the last 2 matches of <max digits> in "42-16gh9-88" --> [{text:"9", text_range:"8" to "8"},{text:"88", text_range:"10" to "11"}]
For information about using patterns, see SenseTalk Pattern Language Basics.
Custom Chunks
The standard word
, line
, and text item
chunks are useful for many things just as they are. Sometimes you may have text in specific formats that you would like to divide in other ways, however. For example, many programs can produce data files containing several values separated by tab characters on each line of the file.
One way to work with such data would be to set the itemDelimiter
to tab
and then access the items
of each line. But suppose that each tab-separated item contains several values separated by commas. To access these values individually would require switching the itemDelimiter
back and forth between tab and comma.
SenseTalk offers an easier alternative for such cases, by specifying the delimiter to be used as part of each chunk, using the phrase delimited by
:
add 1 to item 3 delimited by "," of item 5 delimited by tab \
of line 18 of file complexDataFile
The same syntax may be used with line
chunks if you like:
get line 6 delimited by creturn of oddLineBreakText
The delimiters used to separate text items and lines are not restricted to a single character:
put item 2 delimited by "<>" of "12<>A19<>X" --> A19
Custom delimiters are also allowed with word
chunks, but the behavior is different than with items and lines. Words are normally separated by spaces, tabs, and line breaks. Any number of these “whitespace” characters may appear in sequence between two words. If you specify a custom delimiter for a word chunk, the “words” will be delimited by any number and combination of the characters contained in the delimiter string you supply:
put word 2 delimited by "<>" of "12><<>>A19><>X" --> A19
The following example may help to illustrate the difference between the use of custom delimiters for line chunks (which treat each delimiter string found as a separate chunk) and for word chunks (which treat each sequence of delimiter characters as a single word break):
put each line delimited by ["<",">"] of "12><<>>A19><>X" --> ["12","","","","","A19","","","X"]
put each word delimited by "<>" of "12><<>>A19><>X" --> ["12","A19","X"]