This is a simple text classifier based on Huffman Codes.
Use cmd/train
to create/update a classifier with text from stdin:
; cat ham.msg | train -class /tmp/ham.gob
; cat spam.msg | train -class /tmp/spam.gob
Then use cmd/classify
to pass stdin through previously created classifiers:
; cat hello.msg | classify -class /tmp/ham.gob -class /tmp/spam.gob
spam 80408 0.505279
ham 82124 0.494721
The output of cmd/classify
is the length in bits of the message read from stdin when compressed with the given classifiers and a score. Classifiers in the output are sorted by shortest length first, that is, the classifier that compresses the message into the shortest bit string is at the top. This very roughly means that the classifier was trained on text with a distribution of bytes similar to what was read on stdin.
The score is an estimation of how likely the message is to match the data the classifier was trained on. It is always relative to how well the other classifiers performed.
The name of a classifier is its file name without any extension it may have, so /tmp/spam.gob
becomes spam
.
If you pass the flag -inject
to classify
, it will treat its input as an RFC822-style email, copy it to its output and inject a header that contains its verdict. The header looks like this:
X-Entropy: spam 75059 0.5052789604508651
for a message that most closely matched /tmp/spam.gob
with an encoded length of 75059 bits. The classifiers relative score (comparing it to all other classifiers used for generating this verdict) is roughly ~0.505. That's not much, but this classifier was trained on very little data and the others were trained on nearly identical input.
The classifier and trainer scan over their input in overlapping 2-byte windows. That means that the symbols for the Huffman Code are 16 bytes long, and that messages need to be at least 2 bytes long.
The classifier sums the bit length of subsequent windows encoded with each of the Huffman codes it has been provided with. If a 16 bit symbol is encountered that the code has not yet seen, the encoder returns a bit length that is longer than that of any symbol it has seen. All unencountered symbols get the same fallback length, so this is a bit unprecise.
I like building text classifiers from simple building blocks. I've already done a Bayesian filter with an approximate counting bloom filter for frequency storage, now I thought I'd try my hand on an entropy coder.
That's fine, it's just a toy.
There's an issue tracker that I use for loosely tracking things I may want to do with this.
Funnily enough, it looks like not training for spam (only ham) is actually a pretty reasonable strategy:
; echo ham | train -class ham.gob
; echo spam | train -class spam.gob
Then train a few more messages (just a handful) as ham.gob
and that's it. For spam, this is roughly the result:
spam 426061 0.807123
ham 1782913 0.192877
And vice-versa for ham. That's really weird and somewhat unexpected.