Machine Learning Confusion Matrix

Subhashyadav
9 min readJun 6, 2021

When building a machine learning model, the first step is to acknowledge that real-world data is imperfect, requiring different approaches and tools, and trade-offs are common when determining the right model. Creating a machine learning model includes some following steps:-

>>Gathering data

>>Preparing that data

>>Choosing a model

>>Training

>>Evaluation

>>Hyperparameter tuning

>>Prediction.

In this article, we explain the techniques used in evaluating how well a machine learning model generalizes to new, previously unseen data.

Model Evaluation Techniques

The above issues can be handled by evaluating the performance of a machine learning model, which is an integral component of any data science project. Model evaluation aims to estimate the generalization accuracy of a model on future (unseen/out-of-sample) data.

Model Evaluation Metrics

Model evaluation metrics are required to quantify model performance. The choice of evaluation metrics depends on a given machine learning task (such as classification, regression, ranking, clustering, topic modeling, among others). Some metrics, such as precision -recall are useful for multiple tasks.

Classification Metrics

  • Confusion matrix

A confusion matrix is a performance measurement technique for Machine learning classification. It is a kind of table which helps you to the know the performance of the classification model on a set of test data for that the true values are known. The term confusion matrix itself is very simple, but its related terminology can be a little confusing. Here, some simple explanation is given for this technique.

Four outcomes of the confusion matrix

The confusion matrix visualizes the accuracy of a classifier by comparing the actual and predicted classes. The binary confusion matrix is composed of squares:

  • TP: True Positive: Predicted values correctly predicted as actual positive
  • FP: Predicted values incorrectly predicted an actual positive. i.e., Negative values predicted as positive
  • FN: False Negative: Positive values predicted as negative
  • TN: True Negative: Predicted values correctly predicted as an actual negative

You can compute the accuracy test from the confusion matrix:

Example of Confusion Matrix:

Confusion Matrix is a useful machine learning method which allows you to measure Recall, Precision, Accuracy, and AUC-ROC curve. Below given is an example to know the terms True Positive, True Negative, False Negative, and True Negative.

True Positive:

You projected positive and its turn out to be true. For example, you had predicted that You will pass in Exam and you really passed.

True Negative:

When you predicted negative, and it’s true. You had predicated that you would fail in exam , and you really failed.

False Positive:

Your prediction is positive, and it is false.

You had predicted that you will pass the exam , but you failed.

False Negative:

Your prediction is negative, and result it is also false.

You had predicted that you will not pass the exam, but you passed.

You should remember that we describe predicted values as either True or False or Positive and Negative.

Type I and Type II Error representation in a Confusion Matrix :

How can we measure the Error in Classification Problem? :-

Type I Error (False Positive) and Type II Error (False Negative) help us to identify the accuracy of our Model which can be found with the help of Confusion Matrix. If we sum the value of Type I and Type II Error, we can have a Total Error = False Negative + False Positive.

Accuracy will be higher if Error is less and vice versa. Better the accuracy, better the performance and that exactly what we want.

Precision

You can also use a confusion matrix to calculate the precision. The precision, along with the true positive rate (also known as “recall”), will be needed later on to calculate the area under the precision-recall curve (AUPRC), another popular performance metric.

Precision = True Positives / (True Positives + False Positives)

i.e.

Here ,I thing we should take a close watch to type -1 error and type 2 error.

Lets says we have a model of Lungs cancer predication.

what do you think?

In lung cancer classification in CT scan images, is false positive or false negative is more important to care about?

First, let us make some definitions:

A false positive = person is considered healthy but actually is sick

A false negative = person is considered as not healthy but actually is healthy.

What does it mean?

False negative cases might be good at some point if we don’t consider about patient time and money lost because he is not loosing life.

A false positive case means that your patients get sicker or die.

It is very difficult to make the right decisions.

So it is important to have a correct model. Although we can’t achieve 100% accurate model.

Here will see a case study on cyber security accuracy detection via confusion Matrix.

Introduction :

The evolution of malicious software (malware) poses a critical challenge to the design of intrusion detection systems (IDS). Malicious attacks have become more sophisticated and the foremost challenge is to identify unknown and obfuscated malware, as the malware authors use different evasion techniques for information concealing to prevent detection by an IDS.

With the increasing volume of computer malware, the development of improved IDSs has become extremely important. In the last few decades, machine learning has been used to improve intrusion detection, and currently there is a need for an up-to-date, thorough taxonomy and survey of this recent work. There are a large number of related studies using either the KDD-Cup 99 or DARPA 1999 dataset to validate the development of IDSs; however there is no clear answer to the question of which data mining techniques are more effective. Secondly, the time taken for building IDS is not considered in the evaluation of some IDSs techniques, despite being a critical factor for the effectiveness of ‘on-line’ IDSs.

Signature-based intrusion detection systems (SIDS):

Signature intrusion detection systems (SIDS) are based on pattern matching techniques to find a known attack; these are also known as Knowledge-based Detection .In SIDS, matching methods are used to find a previous intrusion. In other words, when an intrusion signature matches with the signature of a previous intrusion that already exists in the signature database, an alarm signal is triggered. For SIDS, host’s logs are inspected to find sequences of commands or actions which have previously been identified as malware. SIDS have also been labelled in the literature as Knowledge-Based Detection or Misuse Detection .

Anomaly-based intrusion detection system (AIDS):

AIDS has drawn interest from a lot of scholars due to its capacity to overcome the limitation of SIDS. In AIDS, a normal model of the behavior of a computer system is created using machine learning, statistical-based or knowledge-based methods. Any significant deviation between the observed behavior and the model is regarded as an anomaly, which can be interpreted as an intrusion. The assumption for this group of techniques is that malicious behavior differs from typical user behavior. The behaviors of abnormal users which are dissimilar to standard behaviors are classified as intrusions. Development of AIDS comprises two phases: the training phase and the testing phase. In the training phase, the normal traffic profile is used to learn a model of normal behavior, and then in the testing phase, a new data set is used to establish the system’s capacity to generalise to previously unseen intrusions. AIDS can be classified into a number of categories based on the method used for training, for instance, statistical based, knowledge-based and machine learning based . The main advantage of AIDS is the ability to identify zero-day attacks due to the fact that recognizing the abnormal user activity does not rely on a signature database . AIDS triggers a danger signal when the examined behavior differs from the usual behavior. Furthermore, AIDS has various benefits. First, they have the capability to discover internal malicious activities. If an intruder starts making transactions in a stolen account that are unidentified in the typical user activity, it creates an alarm. Second, it is very difficult for a cybercriminal to recognize what is a normal user behavior without producing an alert as the system is constructed from customized profiles.

Intrusion data sources

The previous two sections categorised IDS on the basis of the methods used to identify intrusions. IDS can also be classified based on the input data sources used to detect abnormal activities. In terms of data sources, there are generally two types of IDS technologies, namely Host-based IDS (HIDS) and Network-based IDS (NIDS). HIDS inspect data that originates from the host system and audit sources, such as operating system, window server logs, firewalls logs, application system audits, or database logs. HIDS can detect insider attacks that do not involve network traffic (Creech & Hu, 2014a). NIDS monitors the network traffic that is extracted from a network through packet capture, NetFlow, and other network data sources. Network-based IDS can be used to monitor many computers that are joined to a network. NIDS is able to monitor the external malicious activities that could be initiated from an external threat at an earlier phase, before the threats spread to another computer system. On the other hand, NIDSs have limited ability to inspect all data in a high bandwidth network because of the volume of data passing through modern high-speed communication networks .NIDS deployed at a number of positions within a particular network topology, together with HIDS and firewalls, can provide a concrete, resilient, and multi-tier protection against both external and insider attacks.

Techniques for implementing AIDS:

Statistics-based techniques: A statistics-based IDS builds a distribution model for normal behaviour profile, then detects low probability events and flags them as potential intrusions. Statistical AIDS essentially takes into account the statistical metrics such as the median, mean, mode and standard deviation of packets. In other words, rather than inspecting data traffic, each packet is monitored, which signifies the fingerprint of the flow. Statistical AIDS are employed to identify any type of differences in the present behavior from normal behavior.

AIDS based on machine learning techniques :

Machine learning is the process of extracting knowledge from large quantities of data. Machine learning models comprise of a set of rules, methods, or complex “transfer functions” that can be applied to find interesting data patterns, or to recognise or predict behaviour . Machine learning techniques have been applied extensively in the area of AIDS. Several algorithms and techniques such as clustering, neural networks, association rules, decision trees, genetic algorithms, and nearest neighbour methods, have been applied for discovering the knowledge from intrusion datasets.

Performance metrics for IDS There are many classification metrics for IDS, some of which are known by multiple names. The confusion matrix for a two-class classifier which can be used for evaluating the performance of an IDS. Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. IDS are typically evaluated based on the following standard performance measures:

True Positive Rate (TPR): It is calculated as the ratio between the number of correctly predicted attacks and the total number of attacks. If all intrusions are detected then the TPR is 1 which is extremely rare for an IDS. TPR is also called a Detection Rate (DR) or the Sensitivity. The TPR can be expressed mathematically as

TPR =TP/TP +FN

False Positive Rate (FPR): It is calculated as the ratio between the number of normal instances incorrectly classified as an attack and the total number of normal instances. FPR =FP /FP +TN

False Negative Rate (FNR): False negative means when a detector fails to identify an anomaly and classifies it as normal. The FNR can be expressed mathematically as: FNR =FN/ FN +TP

Classification rate (CR) or Accuracy: The CR measures how accurate the IDS is in detecting normal or anomalous traffic behavior. It is described as the percentage of all those correctly predicted instances to all instances:

Accuracy=TP +TN /TP +TN +FP +F

Why you we need Confusion matrix?

Here are pros/benefits of using a confusion matrix.

  • It shows how any classification model is confused when it makes predictions.
  • Confusion matrix not only gives you insight into the errors being made by your classifier but also types of errors that are being made.
  • This breakdown helps you to overcomes the limitation of using classification accuracy alone.
  • Every column of the confusion matrix represents the instances of that predicted class.
  • Each row of the confusion matrix represents the instances of the actual class.
  • It provides insight not only the errors which are made by a classifier but also errors that are being made.

That all from this article . Thank you.

--

--