Make sure your Spotify application is up and running for this to work properly.
iOS: indexing local HTML files contents for search
Situation: You’re serving up a bunch of local HTML files in your iOS application as content in some fashion, and you’d like to be able to give the user the opportunity to be able to search that file, and all other possible files, for a keyword. They would be presented with some form of user interface to allow them to choose which page to jump to in the available pages.
I have seen many javascript examples around where you include a method in your HTML files, but that really only works for the HTML presently residing within your UIWebView control. You want the user to be able to search throughout all the files.
My quick solution does not use a .plist file (but it could). It also doesn’t take into account page titles, etc. All I am capturing are which keywords are associated with which URLs so that I can provide a starting point for this kind of search functionality.
So this is just a jumping off point, but it’s using NSLinguisticTagger to provide “nouns” to use as keywords. Since we’re using HTML I want to be able to strip out all of the tags easily. My local method currently looks like this:
- (NSString *)flattenHTML:(NSString *)html {
NSScanner *thescanner;
NSString *text = nil;
thescanner = [NSScanner scannerWithString:html];
while ([thescanner isAtEnd] == NO) {
// find start of tag
[thescanner scanUpToString:@"<" intoString:nil];
// find end of tag
[thescanner scanUpToString:@">" intoString:&text];
// replace the found tag with a space
//(you can filter multi-spaces out later if you wish)
html = [html stringByReplacingOccurrencesOfString:
[NSString stringWithFormat:@"%@>", text]
withString:@" "];
}
return [html stringByTrimmingCharactersInSet:
[NSCharacterSet whitespaceCharacterSet]];
}
Now in my indexing method, I deploy the guts. I am creating an NSLinguisticTagger so I can check for nouns, getting all the .html files in my application as an array, looping through the array getting the path for the file, all the contents of the file as a string, stripping out all the HTML tags from the string, finding the nouns, assigning mutableArrays for each noun, etc. It’s kind of slick.
- (void)indexAllHTML
{
NSLinguisticTagger *tagger = [[NSLinguisticTagger alloc]
initWithTagSchemes:[NSArray arrayWithObjects: NSLinguisticTagSchemeTokenType,
NSLinguisticTagSchemeLexicalClass,
NSLinguisticTagSchemeNameType,nil]
options:0];
//Parse through the HTML files.
NSMutableArray *index = [[NSMutableArray alloc] init];
myDictionary = [[NSMutableDictionary alloc] init];
NSArray *pages = [[NSBundle mainBundle] pathsForResourcesOfType:@".html" inDirectory:nil];
for(int i=0; i<[pages count]; i++){
NSString *path = [pages objectAtIndex:i];
NSString *pageData = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:nil];
NSArray *foo = [path componentsSeparatedByString:@"/"];
NSString *shortPath = [foo lastObject];
//We want all nouns for the pages to build an index from.
pageData = [self flattenHTML:pageData];
[tagger setString:pageData];
NSRange textRange = NSMakeRange(0, [pageData length]);
[tagger enumerateTagsInRange:textRange scheme:NSLinguisticTagSchemeLexicalClass options:0 usingBlock:^(NSString *tag, NSRange tokenRange, NSRange sentenceRange, BOOL *stop) {
if(tag == NSLinguisticTagNoun){
NSString *word = [pageData substringWithRange:tokenRange];
//Trim out stragglers we don't want to be indexed. Your milage will vary!
if(![word isEqualToString:@"reg"] && ![word isEqualToString:@"III"] && ![word isEqualToString:@"labeled"]
&& ![word isEqualToString:@"are"] && ![word isEqualToString:@"13"]){
//NSLog(@"%d. %@", i, [pageData substringWithRange:tokenRange]);
if([myDictionary objectForKey:word] == nil){
//key doesn't exist yet, create nsmutablearray for it.
NSMutableArray *anArray = [[NSMutableArray alloc] init];
[anArray addObject:shortPath];
[myDictionary setObject:anArray forKey:word];
} else {
//key already exists, add to the nsmutablearray in it.
NSMutableArray *arr = [myDictionary objectForKey:word];
//No repeats for same url. Why bother.
if(![arr containsObject:shortPath]){
[arr addObject:shortPath];
}
}
if(![index containsObject:word]){
[index addObject:word];
}
}
}
}];
}
[index sortUsingSelector:@selector(localizedCaseInsensitiveCompare:)];
//Some random logging so you get the picture.
NSLog(@"done finding nouns. Number nouns = %d", [index count]);
NSLog(@"1: %@, 20: %@", [index objectAtIndex:0], [index objectAtIndex:19]);
NSLog(@"power paths count: %d", [[myDictionary objectForKey:@"power"] count]);
NSMutableArray *temp = [myDictionary objectForKey:@"power"];
NSLog(@"first path: %@, 2: %@, 3: %@", [temp objectAtIndex:0], [temp objectAtIndex:1], [temp objectAtIndex:2]);
NSLog(@"\n");
NSEnumerator *enumerator = [myDictionary keyEnumerator];
NSString *myKey;
while(myKey = [enumerator nextObject]){
NSLog(@"> %@ :: %@\n", myKey, [myDictionary objectForKey:myKey]);
}
}
And there you have it. All your HTML pages indexed (right now by short path only).
Titles for the pages aren’t retained in this code. I’ve run this on about 80 HTML files as a test and it all logs in around 236ms on an iPad2. Not too shabby.
The .plist could be faster in the end, but for now there isn’t enough of a hit to speed or memory to warrant storage like that.
Enjoy.
Popularity: 3%
