22/03/21

Anomaly Detection using K-means Clustering in BQ ML

By Varun Chitale

Use Case

In the jolly good month of December, while you were out of office enjoying what little outdoors COVID let you enjoy, two of your team’s developers used some service accounts (or even possibly their private ones) to run a bunch of rogue queries to see if you could identify them. They refuse to define rogue and wish to see if ML is any good in the sphere of cybersecurity.

Spoiler: It is.

Approach

We will collect meta-data about all transactions in our BQ database. We will use BQ Machine Learning (BQ ML) to train an (unsupervised) k-means algorithm to cluster transactions. We will use this clustering information to predict outliers.

Generating the datasets

CREATE OR REPLACE TABLE
  `analysis.ref_data_training` AS
SELECT
  *
FROM
  `region-xx`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
  DATE(creation_time) BETWEEN "2020-11-01" AND "2020-12-01"

Let’s consider data between the above dates as our training data. This assumes that no rogue transactions happened in this duration and we generate our ‘normal’ clusters based on this data. region-xx should be replaced by the appropriate region. Additionally, let’s assume the presence of rogue/outlier transactions in the period between 2020-12-01 to 2021–01–01. We create a similar table analysis.ref_data_testing with this period.

The table contains the following columns that we will use to build the model:

job_id
creation_time
end_time, start_time
statement_type
query
total_bytes_processed
total_slot_ms
destination_table.dataset_id

Creating the model

CREATE OR REPLACE MODEL
 `anomaly_kmeans_001` OPTIONS (model_type='kmeans',
   standardize_features = TRUE) AS
SELECT
 r.job_id,
 EXTRACT (HOUR
 FROM
   creation_time) AS hod,
 EXTRACT (DAYOFWEEK
 FROM
   creation_time) AS dow,
 TIMESTAMP_DIFF(end_time, start_time, MINUTE) AS duration_mins,
 statement_type,
 ML.NGRAMS(bt.words_array,
   [1,
   2]) query_grams,
 query,
 total_bytes_processed,
 total_slot_ms,
 destination_table.dataset_id AS dest_dataset_id
FROM
 `analysis.ref_data_training` r
JOIN (
 SELECT
   REGEXP_EXTRACT_ALL(LOWER(query), '[a-z0-9]+') AS words_array,
   job_id
 FROM
   `ref_data_training`) AS bt
ON
 r.job_id = bt.job_id
ORDER BY
 RAND() --usually rand() < number_rows/total_rows works well
LIMIT
 100000 --depending on your data size, you might want to skip this

We also craft certain features like hour of the day hod and day of the week dow. In addition, we have one main feature, query_grams, something that will be the defining characteristic of the clusters. You can extend grams beyond 2 for experimentation.

We do not specify the optimal (relevant reading) number of clusters during model creation, we let BQ ML take care of that.

Once your model finishes training be sure to check out it’s stats. You would do this like you normally preview tables in BQ. It has a lot of handy visualisations.

Visualizing centroid points for an example categorical feature, some numeric features

Statement_type distribution (Image by Author)
Our numeric features, all of these could use some additional preprocessing

Predict the right clusters for unseen samples

CREATE OR REPLACE TABLE
 `analysis.predicted_clusters` AS
WITH
 refs_table AS (
 SELECT
   r.job_id,
   EXTRACT (HOUR
   FROM
     creation_time) AS hod,
   EXTRACT (DAYOFWEEK
   FROM
     creation_time) AS dow,
   TIMESTAMP_DIFF(end_time, start_time, MINUTE) AS duration_mins,
   statement_type,
   ML.NGRAMS(bt.words_array,
     [1,
     2]) query_grams,
   query,
   total_bytes_processed,
   total_slot_ms,
   destination_table.dataset_id AS dest_dataset_id
 FROM
   `analysis.ref_data_testing` r
 JOIN (
   SELECT
     REGEXP_EXTRACT_ALL(LOWER(query), '[a-z0-9]+') AS words_array,
     job_id
   FROM
     `analysis.ref_data_testing`) AS bt
 ON
   r.job_id = bt.job_id)
SELECT
 * EXCEPT(query_grams) --we don't want to complicate o/p table
FROM
 ML.PREDICT( MODEL `analysis.anomaly_kmeans_001`,
   (
   SELECT
     *
   FROM
     refs_table))

Two important columns returned by the ML.PREDICT function: CENTROID_ID and a nested NEAREST_CENTROIDS_DISTANCE. The second column contains an array of centroid_id and distance. Needless to say, the instance is assigned to the cluster with the centroid_id with the minimum distance.

This nested field can also be used for other purposes.

Calculating Percentiles

As you might have guessed by now, the idea is that outliers (rogue transactions) will be relatively further away from the cluster centroids. We create a table to store the percentiles (9990 is the shorthand for 99.90 percentile). If it’s unclear, 99.90% of the classified instances have the distance less than this value.

Two notes:

  1. Depending on the size of your data, you can do with different percentiles (ex. 75%, 90% and 95%)
  2. You do not need to store these percentiles, I did them for persistence.

CREATE OR REPLACE TABLE
 `analysis.percentiles_anomaly` AS
SELECT
 percentiles[
OFFSET
 (9500)] AS p9500,
 percentiles[
OFFSET
 (9900)] AS p9900,
 percentiles[
OFFSET
 (9990)] AS p9990,
 percentiles[
OFFSET
 (9999)] AS p9999
FROM (
 SELECT
   APPROX_QUANTILES((dist_from_centroid), 10000) AS percentiles
 FROM (
   SELECT
     job_id,
     c.CENTROID_ID,
     MIN(ncd.DISTANCE) AS dist_from_centroid
   FROM
     `analysis.predicted_clusters4` c,
     UNNEST(NEAREST_CENTROIDS_DISTANCE) AS ncd
   GROUP BY
     1,
     2) t) AS t2

Additional Note

Ideally, we should calculate percentiles for each cluster independently since clusters are not uniformly spread out. Modify the above query to reflect this.

I see you, anomaly

It’s time to weed out rogue transactions based on the predicted data and the distance percentiles. Start with the highest percentile to see how many instances are classified as outliers and work your way downwards.

CREATE OR REPLACE TABLE
 `analysis.anomalies9999` AS
WITH
 anomalies AS (
 SELECT
   job_id,
   c.CENTROID_ID,
   MIN(ncd.DISTANCE) AS dist_from_centroid
 FROM
   `analysis.predicted_clusters4` c,
   UNNEST(NEAREST_CENTROIDS_DISTANCE) AS ncd
 GROUP BY
   1,
   2
 HAVING
   MIN(ncd.DISTANCE) >= 34.3668866 --example percentile 99.99
 ORDER BY
   dist_from_centroid DESC)
SELECT
 d.*
FROM
 `ref_data_check_all_datasets` d
JOIN
 anomalies a
ON
 a.job_id = d.job_id

Service accounts running routine queries were close to centroids, they did not show up in the above potential anomalies. Some of the instances that did show up had the following characteristics:

  1. Unique queries run from personal user accounts with a small number of total lifetime queries. user_email is a feature we chose to ignore since we want to check the integrity on a transaction level and not on a user level.
  2. Erroneous queries run by all kinds of accounts.
  3. Queries with the above features in the tail part of their respective distributions. e.x enormously high total_bytes_processed value. 
  4. dow and hod filters on these anomalies helped narrow them down further. Same goes with statement_type and user_email.

As always, Delivery Hero is still hiring, so check out some of our latest openings, or join our Talent Community to stay up to date with what’s going on at Delivery Hero and receive customized job alerts!