~farhaven/entropy

Text classifier based on Huffman Codes

dbc20dd Remove unused golf template

5 months ago

00a4ba6 Remove unused golf template

5 months ago

#Entropy

This is a simple text classifier based on Huffman Codes.

Use cmd/train to create/update a classifier with text from stdin:

; cat ham.msg | train -class /tmp/ham.gob
; cat spam.msg | train -class /tmp/spam.gob

Then use cmd/classify to pass stdin through previously created classifiers:

; cat hello.msg | classify -class /tmp/ham.gob -class /tmp/spam.gob
spam 80408 0.505279
ham  82124 0.494721

The output of cmd/classify is the length in bits of the message read from stdin when compressed with the given classifiers and a score. Classifiers in the output are sorted by shortest length first, that is, the classifier that compresses the message into the shortest bit string is at the top. This very roughly means that the classifier was trained on text with a distribution of bytes similar to what was read on stdin.

The score is an estimation of how likely the message is to match the data the classifier was trained on. It is always relative to how well the other classifiers performed.

The name of a classifier is its file name without any extension it may have, so /tmp/spam.gob becomes spam.

If you pass the flag -inject to classify, it will treat its input as an RFC822-style email, copy it to its output and inject a header that contains its verdict. The header looks like this:

X-Entropy: spam 75059 0.5052789604508651

for a message that most closely matched /tmp/spam.gob with an encoded length of 75059 bits. The classifiers relative score (comparing it to all other classifiers used for generating this verdict) is roughly ~0.505. That's not much, but this classifier was trained on very little data and the others were trained on nearly identical input.

#How?

The classifier and trainer scan over their input in overlapping 2-byte windows. That means that the symbols for the Huffman Code are 16 bytes long, and that messages need to be at least 2 bytes long.

The classifier sums the bit length of subsequent windows encoded with each of the Huffman codes it has been provided with. If a 16 bit symbol is encountered that the code has not yet seen, the encoder returns a bit length that is longer than that of any symbol it has seen. All unencountered symbols get the same fallback length, so this is a bit unprecise.

#Why?

I like building text classifiers from simple building blocks. I've already done a Bayesian filter with an approximate counting bloom filter for frequency storage, now I thought I'd try my hand on an entropy coder.

#This doesn't work

That's fine, it's just a toy.

#Misc

There's an issue tracker that I use for loosely tracking things I may want to do with this.

Funnily enough, it looks like not training for spam (only ham) is actually a pretty reasonable strategy:

; echo ham | train -class ham.gob
; echo spam | train -class spam.gob

Then train a few more messages (just a handful) as ham.gob and that's it. For spam, this is roughly the result:

spam 426061  0.807123
ham  1782913 0.192877

And vice-versa for ham. That's really weird and somewhat unexpected.