Sample question
Assume you are working with data on population size n. The data is divided into two groups, with x% of the rows or instances belonging to one class labeled as positive (P) and (1 - x) % belonging to another class labeled as negative (N). Also assume that the majority of the data is labeled N (Baron, 2013).
Formulate the ground truth
Population Size = n
Fraction of instances labeled positive = x
Fraction of instances labeled negative = (1 - n)
Number of instances labeled positive (P) = xn
Number of instances labeled negative (N) = (1 - n) n
Design a confusion matrix with rows reflecting the ground truth and columns reflecting the machine learning classifier classifications.
Predicted
Confusion matrix
TP FN
FP TN
P
n
N
Actual
P =TP + FN = xn .... (1)
N = FP + TN = (1 - x) n ... (2)
n = TP + FP + TN + FN = P + N ..... (3)
Where
TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative
Show two different ways to build a non-machine learning classifier and compute their accuracy.
Simple non-machine learning classifiers that assign labels based on proportions found in the training data include (Ripley, 2007).
Random Guess Classifier which randomly assigns half of the labels to P and another half to N
Weighted Guess Classifier which randomly assigns x% of the labels as P and the remaining (1 - x) % as N
Majority Class Classifier which assigns all of the labels as N(the majority class in the data)
Random Guess Classifier
Let's build a model by randomly assigning half of the labels to positive and half to negative and find its accuracy. This random guessing is just another classifier.
Since this model is labeling half the data as positive and another half as negative
TP + FP = n / 2 ... (4)
TN + FN = n / 2 ...... (5)
The random model will assign half of the actual positive labels as positive and half of the actual negative labels as negative
TP = FN ... (6)
FP = TN ....... (7)
Based on the above equations and some algebra
TP = P FN
= xn FN .... from (1)
= xn TP ..... from (6)
= xn / 2
TN = N TP
= N TN .............................. from (7)
= n / 2
= (1 - x)n / 2 .. from (2)
Accuracy = (TP + TN) / n = (xn / 2 + (1 - x) n / 2) / n = (n / 2) / n = 0.5 (8)
Recall = TP / TP + FN = TP / P = (xn / 2) xn = 0.5 .... (9)
Precision = TP / TP + FP = (xn / 2) n / 2 = x .. (10)
For a multiclass class classification with k classes with xi being the fraction of instances belonging to class i (i = 1 to k), it can be similarly be shown that the accuracy would be 1 / k. Class i would have a precision of xi, the fraction of the instances in the data with class i. Recall for a class will be equal to 1 / k.
In the language of probability, the accuracy is the probability of selecting a class which for two classes (binary classification) is 0.5 and for k classes will be 1 / k.
Weighted Guess Classifier
Let's build a model by assigning the labels as positive and negative based on a weighted guess instead of randomly. The weights are based on the distribution of classes in the data. Find the accuracy of the classifier in which the classes have been assigned as shown below.
x% of actual positives are assigned as positive by this model (i.e., TP = x% of P) and (1 - x) % of actual negatives are assigned as negatives (i.e., TN = (1 - x) % of N).
TP = xP . (11)
TN = (1 - x) N ... (12)
FP = N TN
= N (1 - x) N from (12)
= xN
FN = P TP
= P xP ... from (11)
= P (1 - x)
Accuracy = (TP + TN) / n
= (xP + (1 - x) N) / n
= (xxn + (1 - x) (1 - x) n) / n ... from (1) and (2)
= x2 + (1 - x)2 ... (13)
Recall = TP / P = xP / P = x from (14)
Precision = TP / (TP + FP) = xP / xP + xN = P / P + N = P / n = x . (15)
For multiple classification with k classes with xi being the fraction of instances belonging to class i (i = 1 to k).
Accuracy = ki = 1 xi2
Precision and Recall for a class would equal to xi, the fraction of the instances in the data with the class i.
In the language of the probability, xn / n or x is the probability of a label being positive in the data. If there are more negative instances in the data, the model has a higher probability of assigning an instance as negative. The probability of assigning true positives to the model will be x * x. The accuracy is the sum of the two probabilities.
Sample question 2
If you have 100,000 patients, of which 200 (20%) have cancer. Show test results and explain, how can you overcome the problem of choosing the cutoff? (Rupp, 2010)Define ROC analysis and explain its significance (Krzanowski, 2009)
Test positive Test Negative Total
Patient diseased 160 40 200
Patient Healthy 29940 69860 99800
Total 30100 69900 100000
For the above data,
Sensitivity = TP / (TP + FN) = 160 / (160 + 40) = 80.0%
Specificity = TN / (TN + FP) = 69,860 / (69,860 + 29, 940) = 70.0%
The test will correctly identify 80% of the people with the disease, but it will fail to test positive for the 30% of the healthy people. Important information is lost by only considering the sensitivity (or accuracy) of the test. Considering the wrong results as well as the correct ones, greater insight into the performance of the classifier is established.
Starting with a threshold of 0.0 is one ways to overcome the problem of having to choose a cutoff, so that everything is considered as positive. All of the positive cases are correctly classified, and all of the negative cases are incorrectly classified. Then the threshold is moved over every value between 0.0 and 1.0, progressively increasing the number of true positives and decreasing the number of false positives. TP (sensitivity) is then be plotted against FP (1 - specificity) for each given threshold. The resulting graph is known as Receiver Operating Characteristic (ROC) curve.
ROC analysis provides tools to select possible optimal models and to discard suboptimal ones independently from the class distribution. For a perfect classifier, the ROC curve will go straight up the Y-axis and then along the X-axis. The classifier that has no power will sit on the diagonal, whereas most performance classifiers fall somewhere in between.
ROC curves are used to select thresholds for a classifier which maximizes the true positives while minimizing the false positive.
ROC curves give the ability to access the performance of the classifier over its operating range. The area under the curve (AUC) is the widely-used measure of performance. The AUC for a classifier with no power, essentially random guesting, is 0.5 because the curve follows the diagonal. An AUC having less than 0.5 might signify that there is something wrong. A very low AUC might be indicating that the problem was set up wrongly; the classifier is finding a relationship in the data which is, essentially, the opposite of what is expected. In such situations, inspecting the entire ROC curve might give some clues as to what is going on. The negatives and the positives might have been mislabeled.
ROC curves are used to compare the performance of two or more classifiers. A single threshold can be selected and the classifiers performance at that point compared, or the overall performance can be compared by considering the AUC. Comparing AUCs in absolute terms, if classifier 1 has an AUC of 0.85, and classifier 2 has an AUC of 0.79, then classifier 1 is better.
Â
References
Baron, M. (2013). Probability and Statistics for Computer Scientists. CRC Press.
Krzanowski, J. W. (2009). ROC Curves for Continuous Data. CRC Press.
Ripley, D. B. (2007). Pattern Recognition and Neural Networks. Cambridge University Press.
Rupp, A. T. (2010). Diagnostic Measurement: Theory, Methods, and Applications. Guilford Press.
Â
Â
Request Removal
If you are the original author of this essay and no longer wish to have it published on the thesishelpers.org website, please click below to request its removal:
- Research on the Relationship Between Accounting Education, Research and Practice in the UAE
- Relationship Between Salary and a Worker's Age and Background
- Statistics Essay Example: Scales of Measurement
- Statistics on the Enrolled Students in UK - Paper Example
- Relationship Between Criminological Theory and Statistical Data - Essay Sample
- Statistical Analysis on Industrial Employees Data - Statistics Coursework Example
- Analysis of Descriptive Statistics - Paper Example