Lesson #5: Using RegexTaggers

This lesson uses Objective-C in order that we might work with NSRange objects as opposed to Range<String.Index> objects

Regular expressions refer to patterns that can be identified in strings. For example, emails often have a pattern __@___.__, where the underscores represent different combinations of alphanumeric characters. You can use such patterns to validate input or help you in data mining. If you want to learn more about regular expression, click here

This lesson is provided mainly for instrumental purposes, as regular expressions are not used to a large extent in the current project. However, they can be a useful tool whose availability should be known to any programmer doing computational linguistics. Below is the text of a short story stored in an NSStringobject named story

		
			 NSString* story = @"You should hear this: A long time ago, there was this bird. This bird could swim, but it couldn't fly.  But it wanted to be like the other birds.  What would you've done? I may've tried to make friends with some fish, but that would've been weird.  How might you've advised the bird?";
   
    RegexParser* parser = [[RegexParser alloc] initWith:story];
    
    [parser getModalsInText:^(NSString * _Nonnull modal, NSRange range) {
        NSLog(@"A modal (%@) occurred in this story at location %d in the text",modal,(int)range.location);
    }];

A little side note before explaining the code above: in English, a modal refers to words such as "should, could, would, may, and might," which be used to express politeness, uncertainty, probability, or advice, depending on the context. Therefore, knowing where modals are, especially if they are part of a verb phrase, can provide helpful informaton about text. Therefore, we want to be able to find words and tag them as "modals" in any given text.

In addition to their normal present tense form, modals can also be contracted, and these contractions can have negative inflections (e.g. shouldn't, wouldn't, couldn't) and past tense inflections (e.g. should've, could've, might've). Overall, though, they have an easily identifiable pattern that makes them easy to identify using regular expressions, which is a way of finding such patterns in string data.

We instantiate RegexParser object,w hich we will define shortly. THen, we call a method getModalsInText on the RegexParser object, passing in a completion handler that takes a string for the modal word identified by the parser along with its range.

Here is the output generated by the code above:

		
			2019-09-15 06:28:49.643850-0400 RegexPractice[2303:33094] A modal (should) occurred in this story with at location 4 in the text
2019-09-15 06:28:49.644071-0400 RegexPractice[2303:33094] A modal (could) occurred in this story with at location 70 in the text
2019-09-15 06:28:49.644216-0400 RegexPractice[2303:33094] A modal (couldn't) occurred in this story with at location 89 in the text
2019-09-15 06:28:49.644317-0400 RegexPractice[2303:33094] A modal (would) occurred in this story with at location 153 in the text
2019-09-15 06:28:49.644402-0400 RegexPractice[2303:33094] A modal (may've) occurred in this story with at location 174 in the text
2019-09-15 06:28:49.644483-0400 RegexPractice[2303:33094] A modal (would've) occurred in this story with at location 228 in the text
2019-09-15 06:28:49.644602-0400 RegexPractice[2303:33094] A modal (might) occurred in this story with at location 254 in the text

In addition to the code above, which instantiates a RegexParser with an input text, we want to be able to call static methods on the RegexParser class that will be able to tag words in a set of word tokens, as follows:

	
			  [RegexParser getModalsInTokenArray:@[@"You",@"should",@"eat",@"more",@"vegetables",@"and",@"you",@"could",@"use",@"some",@"rest"] withCompletionHandler:^(NSString * _Nonnull modal, int index) {

        NSLog(@"The modal word - %@ - occurred in the token array at index %d",modal,index);
    }];

The output generated by the code above is shown below:

		
			2019-09-15 06:26:56.393425-0400 RegexPractice[2144:30672] The modal word - should - occurred in the token array at index 1
2019-09-15 06:26:56.393844-0400 RegexPractice[2144:30672] The modal word - could - occurred in the token array at index 7

In order to define our RegexParser class, we will define data model RangeWord, which will store a word token along with its location in a given text (hence the name "RangeObect," where "Range" refers to the NSRange property that is used to store the location of the word). Here is the header file for the class RangeWord:

		
			
@interface RangeWord : NSObject

@property (nonatomic,strong)NSString*wordToken;
@property NSRange wordRange;

-(id)initWith:(NSString*)wordToken andWithRange: (NSRange)rangeInText;

@end

The implementation file for the RangeWord class is given below. It's pretty self-explanatory - just an initializer that takes an NSString representing the word token and an NSString representing the location of the object. RangeWord objects are intended to store data and not have any associated functionality.

		

#import "RangeWord.h"

@implementation RangeWord

-(id)initWith:(NSString*)wordToken andWithRange: (NSRange)rangeInText{
    self = [super init];
    
    if(self){
        
        _wordToken = wordToken;
        _wordRange = rangeInText;
    }
    
    return self;
}
@end

Now on to the RegexParser. We define an initializer that takes an NSString as input. This will be the text that we will be analyzing for the presence of modals or other regular expression patterns. In addition, we define an instance method getModalsInText, which you saw in the code above. This will pass all identified modal words to a completion handler along with their respective locations in the text being analyzed. In addition, we define a static method getModalsInText that can be used to identify modal words in an array of word tokens. This static method has completion handler that only provides identified modal along with its index in an array of word tokens.


	
@interface RegexParser : NSObject

@property (nonatomic,strong) NSString* text;

-(id) initWith:(NSString*)text;

-(void)getModalsInText:(void(^)(NSString*,NSRange))completion;

+(void)getModalsInTokenArray:(NSArray*)tokenArray withCompletionHandler:(void(^)(NSString*,int))completion;

@end

In order to implement the class, we need to import RangeWord.h along with Apple's NaturalLanguage framework in our implementation file.


	#import "RegexParser.h"
#import <NaturalLanguage/NaturalLanguage.h>
#import "RangeWord.h"

We define an extension in our implementation file as well:


	
@interface RegexParser ()

@property (readonly)NLTokenizer* tokenizer;
@property (readonly) NSRange fullTextRange;
@property (readwrite) bool hasBeenTokenized;
@property (nonatomic,strong) NSMutableArray* rangeWords;

@end

In the extension, you'll notice that we have a property for an NLTokenizer instance, which will be used to tokenize the input text. The fullTextRange property is simply a computed property that provides the entire range of the input text. This is a useful property used in instantiating the NLTokenizer. Furthermore, the bool property hasBeenTokenized is simply a flag used to make sure we only instantiate the tokenizer one time. Finally, rangeWords is an array of words tokens with associated NSRange data that we will get after tokenizing the text. The regex objects in Objective-C provide such range objects when identifying patterns in text. Therefore, we want to be able to determine use range objects provided by the built-in regex expressions to determine where exactly in the text our modal words occur.

Here is the implementation for our RegexParser:

	
		
@implementation RegexParser

@end

In the implementation section, we define a static constant for the regex pattern that is used to identify all forms of modals, including negative contracted forms and past tense contracted forms.

	
		static NSString*const modalRegexPattern = @"\\b((sh|c|w)ould((n't)|('ve))*)|(may('ve)*)|(might('ve)*)\\b";

NLTokenizer* _tokenizer;

Here is our initializer:

	
		

-(id) initWith:(NSString*)text{
    
    self = [super init];
    
    if(self){
        _text = text;
        _rangeWords = [[NSMutableArray alloc] init];
    }
    
    return self;
}

	
		
- (NSMutableArray<RangeWord *> *)rangeWords{
    return _rangeWords;
}

/** Computed property that returns the range for the user-provided text sample **/
- (NSRange)fullTextRange{
    return NSMakeRange(0, _text.length);
}

/** An instance of NL tokenizer that breaks up a user-provided text into words **/

- (NLTokenizer *)tokenizer{
    if(_tokenizer){
        return _tokenizer;
    } else {
        _tokenizer = [[NLTokenizer alloc] initWithUnit:NLTokenUnitWord];
        [_tokenizer setString:_text];
        return _tokenizer;
    }
    
}

	
		
/** Helper function that tokenizes a text.  This allows for indexing of the words in a text, which allows for calling functions to associate word index with other data, such as linguistic part-of-speech tags **/

-(void) tokenize{
    
    
    [self.tokenizer enumerateTokensInRange:self.fullTextRange usingBlock:^(NSRange tokenRange, NLTokenizerAttributes flags, BOOL * _Nonnull stop) {
            
            NSString* wordToken = [self->_text substringWithRange:tokenRange];
        
        
            RangeWord* newRangeWord = [[RangeWord alloc] initWith:wordToken andWithRange:tokenRange];
        
            [self->_rangeWords addObject:newRangeWord];
        }];
        
    
    
    _hasBeenTokenized = YES;
    
}


	-(void)findTokensWith:(NSString*)pattern withCompletionHandler:(void(^)(NSString*,NSRange))completion{
    
    NSRegularExpression* regexExpression = [[NSRegularExpression alloc] initWithPattern:pattern options:NSRegularExpressionCaseInsensitive error:nil];
    
    NSArray<NSTextCheckingResult*>* results = [regexExpression matchesInString:_text options:NSMatchingReportCompletion range:NSMakeRange(0, [_text length])];
    
    
    for(NSTextCheckingResult*result in results){
        
        NSRange matchedRange = [result range];
        
        NSString* wordToken = [_text substringWithRange:matchedRange];
        
        completion(wordToken,matchedRange);
        
        
    }
}


	/* This method will tokenize the string and then pass into a closure any matched modals along their corresponding range*/
- (void)getModalsInText:(void(^)(NSString*,NSRange))completion{
    

    if(_hasBeenTokenized){
    
        [self findTokensWith:modalRegexPattern withCompletionHandler:completion];
    
    } else{
        
        [self tokenize];
        [self findTokensWith:modalRegexPattern withCompletionHandler:completion];
    }
}


	/** Helper function that accepts a set of word tokens (i.e. an array of strings) along with a RegEx pattern and a completion handler that allows the user to determine how to further handle matched words and their index in a given text **/

+(void)findModalsIn:(NSArray<NSString*>*)tokenArray withPattern:(NSString*)pattern andWithCompletionHandler: (void(^)(NSString*,int))completion{
    
      NSRegularExpression* regexExpression = [[NSRegularExpression alloc] initWithPattern:pattern options:NSRegularExpressionCaseInsensitive error:nil];
    
        for(int i=0; i<tokenArray.count; i++){
            
            NSString* token = [tokenArray objectAtIndex:i];
            
            NSTextCheckingResult* result = [regexExpression firstMatchInString:token options:NSMatchingReportCompletion range:NSMakeRange(0, [token length])];
            
            NSRange range = [result range];
            
            if(range.location != NSNotFound && range.length != 0){
                completion(token,i);
            }

        }
}


	/** Given a set of tokens, this function will accept a set of tokens (i.e. an array of strings) and pass a modal word along with its index in said array into a completion handler**/

+(void)getModalsInTokenArray:(NSArray<NSString*>*)tokenArray withCompletionHandler:(void(^)(NSString*,int))completion{
    
        [self findModalsIn:tokenArray withPattern:modalRegexPattern andWithCompletionHandler:completion];
}