Friday, January 18, 2013

A note on Lucene-Mahout integration

I've been playing around with Mahout 0.7's LDA implementation (cvb0) for topic modelling. I had a Lucene index with a largish text field (Wiki pages with html tags removed) that I fed as input. The resulting topics had a lot of noise in terms of terms that looked a lot like stop-words. So I decided to clean up my index, using a StopFilter to filter out the noise terms (high doc-frequency, not much discriminative power within my domain). Ran my document updates and tried running Mahout's lucene.vector tool again. Looking at the dictionary file, to my dismay, I found that the terms I had intended to remove from the index were still there. I printed out some of the TermFreqVectors to check, and these were clean: none of the stopwords were anywhere to be found.

Investigating the lucene.vector code, I found that the tool acquires a Lucene TermEnum and iterates over that to get all terms to be put into the dictionary. This is a bit counter-intuitive because lucene.vector also requires the field to be indexed with TermVectors, so it should be using TermFreqVectors to derive the dictionary. Since TermFreqVectors are written afresh for each updated document, if lucene.vector actually used these, we'd be safe. Anyway, so it uses TermEnum instead, which is an iterator over the Lucene index dictionary and pays no regard to the contents of the posting list. It may well be that the actual Mahout vectors are indeed formed using TermFreqVectors, but the --numTerms parameter to cvb relies on the unique term count written to the dictionary file, which was wrong in this case.

This got me thinking about how Lucene implements document updates - adds the old document id to a del file, and "adds back the document", generating a new document id. And this is crazy fast. So it should be the case that the dictionary is not touched during a document delete. A quick check by acquiring my own TermEnum confirmed this: the old terms that I had now stopworded out were still present in the dictionary, with their highest thus far docFreqs!

Turns out that if you want to generate Mahout vectors from a Lucene index, you need to make sure that the Lucene dictionary is in sync with the current logical state of the index. And to ensure this without purging your index and adding all documents afresh, one option is to call the IndexWriter.forceMergeDeletes() command, making sure that the MergePolicy is set to something like a TieredMergePolicy, with the ForceMergeDeletesPctAllowed set to 0.0d. This is a horribly expensive call and can temporarily take up a large amount of disk space.

Fixed my index thus, and fed it to mahout lucene.vector. Voila! Dictionary was clean of those pesky stopwords. I am a happy man. Mahout cvb now running 200 iterations or till perplexity convergence, whichever is earlier.

1 comment:

  1. The TermEnum is used so that we can know the Document Frequency, which is not available via the TermFreqVectors. Without it, you won't have TF/IDF weights. See the CachedTermInfo class and the TermEntry class.

    You are correct about the delete case.