Forwards anomaly detection using LZW

isik · September 10, 2019, 12:23am

My brain fried a little bit when I was researching into this so feel free to ask me any questions you get.

https://docs.google.com/document/d/1MqBXjWytfrUxEy2jZ5E7TG0LCk9KC0H8hEhmzLo6_QM/edit

It’s a pretty crazy idea and the best part is that it works – and it’s as fast as compressing data while being forwards so it could be applied to streaming data.

kyle · September 11, 2019, 9:13pm

Very cool. I love the writing style. Are you doing this stuff for work?

A few questions:

What are the multiple lines on the figures after Fig 16? The number of patterns you allow to be recorded in the dictionary?
Are patterns generally stored in the compressed file?
Is my interpretation correct?: For anomaly detection you would run this algorithm on past data samples and create a dictionary, then you would run the compression on new cases, perhaps without further dictionary additions, and see which parts have poor compression ratios (implying that they contain different patterns than the original/previous data).

isik · September 13, 2019, 3:03pm

I did it with a number of data samples from work, using easy to visualize data such as time series to confirm that it works. It’s mostly a hobby project, a professor told me about his friend applying this method as a security layer wrapped around important software, and the time series data input was function calls – anomalous sets of function calls could denote “bad” or “corrupted” or “malicious” behavior.

1: Yeah that’s where my explanations started to break down. Every data point there was calculated as the raw score of a single pattern’s ratio of input size vs output size. The smooth lines seen in later graphs or in the shakespeare graph are of a moving window of an average of the surrounding 10,000 data points. Those graphs are the raw values. The multiple lines are because there are basically a finite number of discrete values that can exist. Think 1/2, 2/3, 3/4, 1/4, etc. It’s trippy as fuck!

2: The patterns are held at runtime in LZW. If you’re asking how someone can go backwards from a compressed LZW file to an uncompressed file, it’s because LZW writes its output as basically a concatenation of many, many terms of {binary value mapped to previously seen pattern}{single new bit / element that makes the “new” pattern “new”}. So to go in reverse, you read the compressed data and say see every new bit that defined every already made pattern, and you can construct the dictionary again. This explanation blows, I suggest reading the wiki page.

3: Your interpretation is correct. Although how I was envisioning it was running 4 different streams of the same thing, each running for X duration and each offset from the other by X/4. The “old” data you are seeding each run with is actually just real data in realtime, but you don’t take it seriously until the stream is say 1/4(X) old and then you start comparing against the other data to see if multiple streams are seeing similar anomalies. Anomaly detection is done by quickly trying to discern whether you have a significant positive slope – one implementation might be that your data point D at each moment in the stream is the average of the current and the past 100 data points “d” – then, you consider yourself in an “up” state if N out of the last L data points D are greater than the last value for D. So instead of measuring the real slope, you get sets of binary measurements of “is it going up RIGHT NOW”. This is less perfect, but it is very fast

I want to apply it to more stuff! I want to find data streams that are fast, repetitive, and people care about quickly finding anomalies.