Scikit-Learn : Anomaly Detection

Scikit-Learn : Anomaly Detection

Here, we will learn about what is anomaly detection in Sklearn and how it is used in identification of the data points.

Anomaly detection is a technique used to identify data points in dataset that does not fit well with the rest of the data. It has many applications in business such as fraud detection, intrusion detection, system health monitoring, surveillance, and predictive maintenance. Anomalies, which are also called outlier, can be divided into following three categories βˆ’

  • Point anomalies βˆ’ It occurs when an individual data instance is considered as anomalous w.r.t the rest of the data.
  • Contextual anomalies βˆ’ Such kind of anomaly is context specific. It occurs if a data instance is anomalous in a specific context.
  • Collective anomalies βˆ’ It occurs when a collection of related data instances is anomalous w.r.t entire dataset rather than individual values.

Methods

Two methods namely outlier detection and novelty detection can be used for anomaly detection. It’s necessary to see the distinction between them.

Outlier detection

The training data contains outliers that are far from the rest of the data. Such outliers are defined as observations. That’s the reason, outlier detection estimators always try to fit the region having most concentrated training data while ignoring the deviant observations. It is also known as unsupervised anomaly detection.

Novelty detection

It is concerned with detecting an unobserved pattern in new observations which is not included in training data. Here, the training data is not polluted by the outliers. It is also known as semi-supervised anomaly detection.

There are set of ML tools, provided by scikit-learn, which can be used for both outlier detection as well novelty detection. These tools first implementing object learning from the data in an unsupervised by using fit () method as follows βˆ’

estimator.fit(X_train)

Now, the new observations would be sorted as inliers (labeled 1) or outliers (labeled -1) by using predict() method as follows βˆ’

estimator.fit(X_test)

The estimator will first compute the raw scoring function and then predict method will make use of threshold on that raw scoring function. We can access this raw scoring function with the help of score_sample method and can control the threshold by contamination parameter.

We can also define decision_function method that defines outliers as negative value and inliers as non-negative value.

estimator.decision_function(X_test)

Sklearn algorithms for Outlier Detection

Let us begin by understanding what an elliptic envelop is.

Fitting an elliptic envelop

This algorithm assume that regular data comes from a known distribution such as Gaussian distribution. For outlier detection, Scikit-learn provides an object named covariance.EllipticEnvelop.

This object fits a robust covariance estimate to the data, and thus, fits an ellipse to the central data points. It ignores the points outside the central mode.

Parameters

Following table consist the parameters used by sklearn. covariance.EllipticEnvelop method βˆ’

Sr.NoParameter & Description
1store_precision βˆ’ Boolean, optional, default = TrueWe can specify it if the estimated precision is stored.
2assume_centered βˆ’ Boolean, optional, default = FalseIf we set it False, it will compute the robust location and covariance directly with the help of FastMCD algorithm. On the other hand, if set True, it will compute the support of robust location and covarian.
3support_fraction βˆ’ float in (0., 1.), optional, default = NoneThis parameter tells the method that how much proportion of points to be included in the support of the raw MCD estimates.
4contamination βˆ’ float in (0., 1.), optional, default = 0.1It provides the proportion of the outliers in the data set.
5random_state βˆ’ int, RandomState instance or None, optional, default = noneThis parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options βˆ’int βˆ’ In this case, random_state is the seed used by random number generator.RandomState instance βˆ’ In this case, random_state is the random number generator.None βˆ’ In this case, the random number generator is the RandonState instance used by np.random.

Attributes

Following table consist the attributes used by sklearn. covariance.EllipticEnvelop method βˆ’

Sr.NoAttributes & Description
1support_ βˆ’ array-like, shape(n_samples,)It represents the mask of the observations used to compute robust estimates of location and shape.
2location_ βˆ’ array-like, shape (n_features)It returns the estimated robust location.
3covariance_ βˆ’ array-like, shape (n_features, n_features)It returns the estimated robust covariance matrix.
4precision_ βˆ’ array-like, shape (n_features, n_features)It returns the estimated pseudo inverse matrix.
5offset_ βˆ’ floatIt is used to define the decision function from the raw scores. decision_function = score_samples -offset_

Implementation Example

import numpy as np^M
from sklearn.covariance import EllipticEnvelope^M
true_cov = np.array([[.5, .6],[.6, .4]])
X = np.random.RandomState(0).multivariate_normal(mean = [0, 0], cov=true_cov,size=500)
cov = EllipticEnvelope(random_state = 0).fit(X)^M
# Now we can use predict method. It will return 1 for an inlier and -1 for an outlier.
cov.predict([[0, 0],[2, 2]])

Output

array([ 1, -1])

Isolation Forest

In case of high-dimensional dataset, one efficient way for outlier detection is to use random forests. The scikit-learn provides ensemble.IsolationForest method that isolates the observations by randomly selecting a feature. Afterwards, it randomly selects a value between the maximum and minimum values of the selected features.

Here, the number of splitting needed to isolate a sample is equivalent to path length from the root node to the terminating node.

Parameters

Followings table consist the parameters used by sklearn. ensemble.IsolationForest method βˆ’

Sr.NoParameter & Description
1n_estimators βˆ’ int, optional, default = 100It represents the number of base estimators in the ensemble.
2max_samples βˆ’ int or float, optional, default = β€œauto”It represents the number of samples to be drawn from X to train each base estimator. If we choose int as its value, it will draw max_samples samples. If we choose float as its value, it will draw max_samples βˆ— 𝑋.shape[0] samples. And, if we choose auto as its value, it will draw max_samples = min(256,n_samples).
3support_fraction βˆ’ float in (0., 1.), optional, default = NoneThis parameter tells the method that how much proportion of points to be included in the support of the raw MCD estimates.
4contamination βˆ’ auto or float, optional, default = autoIt provides the proportion of the outliers in the data set. If we set it default i.e. auto, it will determine the threshold as in the original paper. If set to float, the range of contamination will be in the range of [0,0.5].
5random_state βˆ’ int, RandomState instance or None, optional, default = noneThis parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options βˆ’int βˆ’ In this case, random_state is the seed used by random number generator.RandomState instance βˆ’ In this case, random_state is the random number generator.None βˆ’ In this case, the random number generator is the RandonState instance used by np.random.
6max_features βˆ’ int or float, optional (default = 1.0)It represents the number of features to be drawn from X to train each base estimator. If we choose int as its value, it will draw max_features features. If we choose float as its value, it will draw max_features * X.shape[𝟏] samples.
7bootstrap βˆ’ Boolean, optional (default = False)Its default option is False which means the sampling would be performed without replacement. And on the other hand, if set to True, means individual trees are fit on a random subset of the training data sampled with replacement.
8n_jobs βˆ’ int or None, optional (default = None)It represents the number of jobs to be run in parallel for fit() and predict() methods both.
9verbose βˆ’ int, optional (default = 0)This parameter controls the verbosity of the tree building process.
10warm_start βˆ’ Bool, optional (default=False)If warm_start = true, we can reuse previous calls solution to fit and can add more estimators to the ensemble. But if is set to false, we need to fit a whole new forest.

Attributes

Following table consist the attributes used by sklearn. ensemble.IsolationForest method βˆ’

Sr.NoAttributes & Description
1estimators_ βˆ’ list of DecisionTreeClassifierProviding the collection of all fitted sub-estimators.
2max_samples_ βˆ’ integerIt provides the actual number of samples used.
3offset_ βˆ’ floatIt is used to define the decision function from the raw scores. decision_function = score_samples -offset_

Implementation Example

The Python script below will use sklearn. ensemble.IsolationForest method to fit 10 trees on given data

from sklearn.ensemble import IsolationForest
import numpy as np
X = np.array([[-1, -2], [-3, -3], [-3, -4], [0, 0], [-50, 60]])
OUTDClf = IsolationForest(n_estimators = 10)
OUTDclf.fit(X)

Output

IsolationForest(
   behaviour = 'old', bootstrap = False, contamination='legacy',
   max_features = 1.0, max_samples = 'auto', n_estimators = 10, n_jobs=None,
   random_state = None, verbose = 0
)

Local Outlier Factor

Local Outlier Factor (LOF) algorithm is another efficient algorithm to perform outlier detection on high dimension data. The scikit-learn provides neighbors.LocalOutlierFactor method that computes a score, called local outlier factor, reflecting the degree of anomality of the observations. The main logic of this algorithm is to detect the samples that have a substantially lower density than its neighbors. Thats why it measures the local density deviation of given data points w.r.t. their neighbors.

Parameters

Followings table consist the parameters used by sklearn. neighbors.LocalOutlierFactor method

Sr.NoParameter & Description
1n_neighbors βˆ’ int, optional, default = 20It represents the number of neighbors use by default for kneighbors query. All samples would be used if .
2algorithm βˆ’ optionalWhich algorithm to be used for computing nearest neighbors.If you choose ball_tree, it will use BallTree algorithm.If you choose kd_tree, it will use KDTree algorithm.If you choose brute, it will use brute-force search algorithm.If you choose auto, it will decide the most appropriate algorithm on the basis of the value we passed to fit() method.
3leaf_size βˆ’ int, optional, default = 30The value of this parameter can affect the speed of the construction and query. It also affects the memory required to store the tree. This parameter is passed to BallTree or KdTree algorithms.
4contamination βˆ’ auto or float, optional, default = autoIt provides the proportion of the outliers in the data set. If we set it default i.e. auto, it will determine the threshold as in the original paper. If set to float, the range of contamination will be in the range of [0,0.5].
5metric βˆ’ string or callable, defaultIt represents the metric used for distance computation.
6P βˆ’ int, optional (default = 2)It is the parameter for the Minkowski metric. P=1 is equivalent to using manhattan_distance i.e. L1, whereas P=2 is equivalent to using euclidean_distance i.e. L2.
7novelty βˆ’ Boolean, (default = False)By default, LOF algorithm is used for outlier detection but it can be used for novelty detection if we set novelty = true.
8n_jobs βˆ’ int or None, optional (default = None)It represents the number of jobs to be run in parallel for fit() and predict() methods both.

Attributes

Following table consist the attributes used by sklearn.neighbors.LocalOutlierFactor method βˆ’

Sr.NoAttributes & Description
1negative_outlier_factor_ βˆ’ numpy array, shape(n_samples,)Providing opposite LOF of the training samples.
2n_neighbors_ βˆ’ integerIt provides the actual number of neighbors used for neighbors queries.
3offset_ βˆ’ floatIt is used to define the binary labels from the raw scores.

Implementation Example

The Python script given below will use sklearn.neighbors.LocalOutlierFactor method to construct NeighborsClassifier class from any array corresponding our data set

from sklearn.neighbors import NearestNeighbors
samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
LOFneigh = NearestNeighbors(n_neighbors = 1, algorithm = "ball_tree",p=1)
LOFneigh.fit(samples)

Output

NearestNeighbors(
   algorithm = 'ball_tree', leaf_size = 30, metric='minkowski',
   metric_params = None, n_jobs = None, n_neighbors = 1, p = 1, radius = 1.0
)

Example

Now, we can ask from this constructed classifier is the closet point to [0.5, 1., 1.5] by using the following python script βˆ’

print(neigh.kneighbors([[.5, 1., 1.5]])

Output

(array([[1.7]]), array([[1]], dtype = int64))

One-Class SVM

The One-Class SVM, introduced by SchΓΆlkopf et al., is the unsupervised Outlier Detection. It is also very efficient in high-dimensional data and estimates the support of a high-dimensional distribution. It is implemented in the Support Vector Machines module in the Sklearn.svm.OneClassSVM object. For defining a frontier, it requires a kernel (mostly used is RBF) and a scalar parameter.

For better understanding let’s fit our data with svm.OneClassSVM object βˆ’

Example

from sklearn.svm import OneClassSVM
X = [[0], [0.89], [0.90], [0.91], [1]]
OSVMclf = OneClassSVM(gamma = 'scale').fit(X)

Now, we can get the score_samples for input data as follows βˆ’

OSVMclf.score_samples(X)

Output

array([1.12218594, 1.58645126, 1.58673086, 1.58645127, 1.55713767])

Next Topic : Click Here

This Post Has 2 Comments

Leave a Reply