By Varun Chitale
Anomaly Detection using K-means Clustering in BQ MLBy Varun Chitale
In the jolly good month of December, while you were out of office enjoying what little outdoors COVID let you enjoy, two of your team’s developers used some service accounts (or even possibly their private ones) to run a bunch of rogue queries to see if you could identify them. They refuse to define
rogue and wish to see if ML is any good in the sphere of cybersecurity.
Spoiler: It is.
We will collect meta-data about all transactions in our BQ database. We will use BQ Machine Learning (BQ ML) to train an (unsupervised) k-means algorithm to cluster transactions. We will use this clustering information to predict outliers.
Generating the datasets
CREATE OR REPLACE TABLE `analysis.ref_data_training` AS SELECT * FROM `region-xx`.INFORMATION_SCHEMA.JOBS_BY_PROJECT WHERE DATE(creation_time) BETWEEN "2020-11-01" AND "2020-12-01"
Let’s consider data between the above dates as our training data. This assumes that no rogue transactions happened in this duration and we generate our ‘normal’ clusters based on this data.
region-xx should be replaced by the appropriate region. Additionally, let’s assume the presence of rogue/outlier transactions in the period between
2020-12-01 to 2021–01–01. We create a similar table
analysis.ref_data_testing with this period.
The table contains the following columns that we will use to build the model:
job_id creation_time end_time, start_time statement_type query total_bytes_processed total_slot_ms destination_table.dataset_id
Creating the model
CREATE OR REPLACE MODEL `anomaly_kmeans_001` OPTIONS (model_type='kmeans', standardize_features = TRUE) AS SELECT r.job_id, EXTRACT (HOUR FROM creation_time) AS hod, EXTRACT (DAYOFWEEK FROM creation_time) AS dow, TIMESTAMP_DIFF(end_time, start_time, MINUTE) AS duration_mins, statement_type, ML.NGRAMS(bt.words_array, [1, 2]) query_grams, query, total_bytes_processed, total_slot_ms, destination_table.dataset_id AS dest_dataset_id FROM `analysis.ref_data_training` r JOIN ( SELECT REGEXP_EXTRACT_ALL(LOWER(query), '[a-z0-9]+') AS words_array, job_id FROM `ref_data_training`) AS bt ON r.job_id = bt.job_id ORDER BY RAND() --usually rand() < number_rows/total_rows works well LIMIT 100000 --depending on your data size, you might want to skip this
We also craft certain features like hour of the day
hod and day of the week
dow. In addition, we have one main feature,
query_grams, something that will be the defining characteristic of the clusters. You can extend grams beyond 2 for experimentation.
We do not specify the optimal (relevant reading) number of clusters during model creation, we let BQ ML take care of that.
Once your model finishes training be sure to check out it’s stats. You would do this like you normally preview tables in BQ. It has a lot of handy visualisations.
Visualizing centroid points for an example categorical feature, some numeric features
Predict the right clusters for unseen samples
CREATE OR REPLACE TABLE `analysis.predicted_clusters` AS WITH refs_table AS ( SELECT r.job_id, EXTRACT (HOUR FROM creation_time) AS hod, EXTRACT (DAYOFWEEK FROM creation_time) AS dow, TIMESTAMP_DIFF(end_time, start_time, MINUTE) AS duration_mins, statement_type, ML.NGRAMS(bt.words_array, [1, 2]) query_grams, query, total_bytes_processed, total_slot_ms, destination_table.dataset_id AS dest_dataset_id FROM `analysis.ref_data_testing` r JOIN ( SELECT REGEXP_EXTRACT_ALL(LOWER(query), '[a-z0-9]+') AS words_array, job_id FROM `analysis.ref_data_testing`) AS bt ON r.job_id = bt.job_id) SELECT * EXCEPT(query_grams) --we don't want to complicate o/p table FROM ML.PREDICT( MODEL `analysis.anomaly_kmeans_001`, ( SELECT * FROM refs_table))
Two important columns returned by the
CENTROID_ID and a nested
NEAREST_CENTROIDS_DISTANCE. The second column contains an array of
centroid_id and distance. Needless to say, the instance is assigned to the cluster with the
centroid_id with the minimum distance.
This nested field can also be used for other purposes.
As you might have guessed by now, the idea is that outliers (rogue transactions) will be relatively further away from the cluster centroids. We create a table to store the percentiles (9990 is the shorthand for 99.90 percentile). If it’s unclear, 99.90% of the classified instances have the
distance less than this value.
- Depending on the size of your data, you can do with different percentiles (ex. 75%, 90% and 95%)
- You do not need to store these percentiles, I did them for persistence.
CREATE OR REPLACE TABLE `analysis.percentiles_anomaly` AS SELECT percentiles[ OFFSET (9500)] AS p9500, percentiles[ OFFSET (9900)] AS p9900, percentiles[ OFFSET (9990)] AS p9990, percentiles[ OFFSET (9999)] AS p9999 FROM ( SELECT APPROX_QUANTILES((dist_from_centroid), 10000) AS percentiles FROM ( SELECT job_id, c.CENTROID_ID, MIN(ncd.DISTANCE) AS dist_from_centroid FROM `analysis.predicted_clusters4` c, UNNEST(NEAREST_CENTROIDS_DISTANCE) AS ncd GROUP BY 1, 2) t) AS t2
Ideally, we should calculate percentiles for each cluster independently since clusters are not uniformly spread out. Modify the above query to reflect this.
I see you, anomaly
It’s time to weed out rogue transactions based on the predicted data and the distance percentiles. Start with the highest percentile to see how many instances are classified as outliers and work your way downwards.
CREATE OR REPLACE TABLE `analysis.anomalies9999` AS WITH anomalies AS ( SELECT job_id, c.CENTROID_ID, MIN(ncd.DISTANCE) AS dist_from_centroid FROM `analysis.predicted_clusters4` c, UNNEST(NEAREST_CENTROIDS_DISTANCE) AS ncd GROUP BY 1, 2 HAVING MIN(ncd.DISTANCE) >= 34.3668866 --example percentile 99.99 ORDER BY dist_from_centroid DESC) SELECT d.* FROM `ref_data_check_all_datasets` d JOIN anomalies a ON a.job_id = d.job_id
Service accounts running routine queries were close to centroids, they did not show up in the above potential anomalies. Some of the instances that did show up had the following characteristics:
- Unique queries run from personal user accounts with a small number of total lifetime queries.
user_emailis a feature we chose to ignore since we want to check the integrity on a transaction level and not on a user level.
- Erroneous queries run by all kinds of accounts.
- Queries with the above features in the tail part of their respective distributions. e.x enormously high
hodfilters on these anomalies helped narrow them down further. Same goes with
As always, Delivery Hero is still hiring, so check out some of our latest openings, or join our Talent Community to stay up to date with what’s going on at Delivery Hero and receive customized job alerts!