iOS: indexing local HTML files contents for search

Situation: You’re serving up a bunch of local HTML files in your iOS application as content in some fashion, and you’d like to be able to give the user the opportunity to be able to search that file, and all other possible files, for a keyword. They would be presented with some form of user interface to allow them to choose which page to jump to in the available pages.

I have seen many javascript examples around where you include a method in your HTML files, but that really only works for the HTML presently residing within your UIWebView control. You want the user to be able to search throughout all the files.

My quick solution does not use a .plist file (but it could). It also doesn’t take into account page titles, etc. All I am capturing are which keywords are associated with which URLs so that I can provide a starting point for this kind of search functionality.

So this is just a jumping off point, but it’s using NSLinguisticTagger to provide “nouns” to use as keywords. Since we’re using HTML I want to be able to strip out all of the tags easily. My local method currently looks like this:

- (NSString *)flattenHTML:(NSString *)html {
    NSScanner *thescanner;
    NSString *text = nil;
    thescanner = [NSScanner scannerWithString:html];
    while ([thescanner isAtEnd] == NO) {
        // find start of tag
        [thescanner scanUpToString:@"<" intoString:nil];
        // find end of tag
        [thescanner scanUpToString:@">" intoString:&text];
        // replace the found tag with a space
        //(you can filter multi-spaces out later if you wish)
        html = [html stringByReplacingOccurrencesOfString:
                [NSString stringWithFormat:@"%@>", text] 
                 withString:@" "];
    }
    return [html stringByTrimmingCharactersInSet:
            [NSCharacterSet whitespaceCharacterSet]];
}

Now in my indexing method, I deploy the guts. I am creating an NSLinguisticTagger so I can check for nouns, getting all the .html files in my application as an array, looping through the array getting the path for the file, all the contents of the file as a string, stripping out all the HTML tags from the string, finding the nouns, assigning mutableArrays for each noun, etc. It’s kind of slick.

- (void)indexAllHTML
{    
    NSLinguisticTagger *tagger = [[NSLinguisticTagger alloc] 
                                  initWithTagSchemes:[NSArray arrayWithObjects: NSLinguisticTagSchemeTokenType,
                                                      NSLinguisticTagSchemeLexicalClass,
                                                      NSLinguisticTagSchemeNameType,nil]
                                                                        options:0];

    //Parse through the HTML files.

    NSMutableArray *index = [[NSMutableArray alloc] init];
    myDictionary = [[NSMutableDictionary alloc] init];

    NSArray *pages = [[NSBundle mainBundle] pathsForResourcesOfType:@".html" inDirectory:nil];
    for(int i=0; i<[pages count]; i++){
        NSString *path = [pages objectAtIndex:i];
        NSString *pageData = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:nil];
        NSArray *foo = [path componentsSeparatedByString:@"/"];
        NSString *shortPath = [foo lastObject];

        //We want all nouns for the pages to build an index from.

        pageData = [self flattenHTML:pageData]; 

        [tagger setString:pageData];
        NSRange textRange = NSMakeRange(0, [pageData length]);

        [tagger enumerateTagsInRange:textRange scheme:NSLinguisticTagSchemeLexicalClass options:0 usingBlock:^(NSString *tag, NSRange tokenRange, NSRange sentenceRange, BOOL *stop) {

            if(tag == NSLinguisticTagNoun){
                NSString *word = [pageData substringWithRange:tokenRange];

                //Trim out stragglers we don't want to be indexed. Your milage will vary!

                if(![word isEqualToString:@"reg"] && ![word isEqualToString:@"III"] && ![word isEqualToString:@"labeled"]
                   && ![word isEqualToString:@"are"] && ![word isEqualToString:@"13"]){
                    //NSLog(@"%d. %@", i, [pageData substringWithRange:tokenRange]);

                    if([myDictionary objectForKey:word] == nil){

                        //key doesn't exist yet, create nsmutablearray for it.

                        NSMutableArray *anArray = [[NSMutableArray alloc] init];
                        [anArray addObject:shortPath];
                        [myDictionary setObject:anArray forKey:word];
                    } else {

                        //key already exists, add to the nsmutablearray in it.

                        NSMutableArray *arr = [myDictionary objectForKey:word];

                        //No repeats for same url. Why bother.

                        if(![arr containsObject:shortPath]){
                           [arr addObject:shortPath]; 
                        }                        
                    }                    

                    if(![index containsObject:word]){
                        [index addObject:word];
                    }
                }
            }
        }];
    }

    [index sortUsingSelector:@selector(localizedCaseInsensitiveCompare:)];

    //Some random logging so you get the picture.

    NSLog(@"done finding nouns. Number nouns = %d", [index count]); 
    NSLog(@"1: %@, 20: %@", [index objectAtIndex:0], [index objectAtIndex:19]);

    NSLog(@"power paths count: %d", [[myDictionary objectForKey:@"power"] count]);
    NSMutableArray *temp = [myDictionary objectForKey:@"power"];
    NSLog(@"first path: %@, 2: %@, 3: %@", [temp objectAtIndex:0], [temp objectAtIndex:1], [temp objectAtIndex:2]);
    NSLog(@"\n");
    NSEnumerator *enumerator = [myDictionary keyEnumerator];
    NSString *myKey;
    while(myKey = [enumerator nextObject]){
        NSLog(@"> %@ :: %@\n", myKey, [myDictionary objectForKey:myKey]);
    }
}

And there you have it. All your HTML pages indexed (right now by short path only).

Titles for the pages aren’t retained in this code. I’ve run this on about 80 HTML files as a test and it all logs in around 236ms on an iPad2. Not too shabby.

The .plist could be faster in the end, but for now there isn’t enough of a hit to speed or memory to warrant storage like that.

Enjoy.

Related Posts Plugin for WordPress, Blogger...

10 thoughts on “iOS: indexing local HTML files contents for search

  1. Philip Borges

    Hi Eric.

    I came across your post here about searching through HTML files. I have a Bible app wherein I’d like to implement a search function. Your post here seems to be the answer. However, I was wondering if you could provide a sample project that illustrates in full how it works. I’m unsure how to hook up the code you posted.

    Thanks for posting this.

    /Philip

    Reply
  2. Eric Post author

    Philip,

    When I get some down time I can whip up a quick project for you. Basically though it’s all there… make a project with .html files in it. Then call the indexAllHTML method in your viewDidLoad or whatever and see what you get in the NSLogs. If you really think you need a project, I’ll get around to it.

    Reply
  3. Philip Borges

    Thanks Eric for taking the time to make this project. It works well. I’ll take a look at it and see how I can implement it.

    Reply
    1. Eric Post author

      I can add a method of using NSMutableDictionary for storing away the title, the word, etc. if you’d like. It would just be changing the storage mechanism and pulling the bits back out. Also you might want to consider storing results if you’re going to do a Bible application… with under 100 HTML files the code runs fast enough. With something massive, you might want to consider a different way of doing things.

      Reply
  4. Philip Borges

    That would be very helpful Eric. Thank you. The application contains 66 HTML files, one for each Bible book. If you could provide an interface that demonstrates this feature fully would really be helpful. Like a textfield for searching, display the results in a tableview, and going to the respective file where the keywords are when tapping a table row. I appreciate any help you can provide in this area.

    /Philip

    Reply
  5. Eric Dolecki

    Well… I might as well build the whole app :) Use a UISearchBar and it’s delegates. 66 HTML files isn’t all that many – you are probably okay with the system provided.

    Reply
  6. Philip Borges

    Well that would extremely helpful Eric :) But I don’t subscribe to letting others write every line of code for me. Having said that sample project that fully illustrates how search through HTML files would be so useful, not only for myself but for others too. :-) It would allow me to see how it works and get me started on tailoring the project to my own.

    Reply
  7. Scott

    Thanks for posting this Eric, I really appreciate it! I was wondering if you could help me figure out how to modify your code to search for a specific word (as opposed to searching for all nouns). For example, if I wanted to search all the HTML files in your example for the word “love” (in your example the only result it would find would be in the second HTML file).

    Thanks!!

    Scott

    Reply
    1. Eric Post author

      The word “love” is indeed a noun. So all you’d have to do is check in that collection of nouns:


      if([word isEqualToString:@”love”]){
      NSLog(@”I found the word \”love\””);
      }

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *


+ five = 14

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>