Need a unique essay?
Order now

Machine Learning Algorithms Using CNV Data to Classify Cancers - Literature Review Example

2021-08-10
6 pages
1612 words
University/College: 
Boston College
Type of paper: 
Literature review
This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

There has been a continuous evolution in cancer research in recent years (Hanahan and Weinberg, 2011). Various methods have been applied to characterize types of cancers before the appearance of their symptoms e.g. early-stage screening. In addition, scientists have developed new strategies for the early prediction of the outcome of cancer therapy. These advancements in technology have led to the collection of large amounts of cancer data and are available for scientific research. However, the accurate prediction of occurrence of cancer is one of the most challenging tasks. Consequently, machine learning techniques have become a popular tool for medical researchers for the discovery and identification of patterns and relationships between complex datasets, and help in effectively predicting future occurrence of a cancer type.

Several studies that are based on different strategies that could enable the early cancer diagnosis and prediction have been reported in the literature (Fortunato et al., 2014; Heneghan, Miller, and Kerin, 2010; Zen, Zhang, 2010). Precisely, these studies describe approaches associated with circulating miRNAs profiles, which are promising for cancer classification. However, the limitation of these methods includes low sensitivity during their application in early stages screening of cancers. In addition, these methods are limited in discriminating benign cancers from malignant tumours. Different studies discuss the various aspects cancer prediction based on gene expression signatures (Koscielny, 2010; Michiels, Koscielny, and Hill, 2005). These studies explore the potential and limitations of microarrays used in the classification of cancers. Despite the potential of gene signatures helping us in understanding, classifying, and predicting cancer patients, there has been a limited application of these techniques in clinical settings. Nevertheless, studies using larger data samples and better validation need to be carried out in order to enable useful gene expression profiling in clinical practice.

Machine learning relates to the characterization of data samples with the purpose of drawing inferences (Bishop, 2006; Mitchell, 2006; Witten, Frank, Hall, & Pal, 2005). Machine learning consists of the estimation of unknown dependencies in a large dataset and using these dependencies to predict new outcomes. Machine learning is an interesting field in biomedical research where generalizations are obtained by searching through a given biological dataset, using different algorithms and techniques (Niknejad and Petrovic, 2013). The main common types of machine learning techniques include supervised and unsupervised machine learning. An estimation of the data to obtain the desired output is achieved using a labelled set of training data in supervised learning. In unsupervised learning methods, however, the machine is trained with unlabeled data and there is no idea of the expected outcome of the learning process.

Consequently, the learning model becomes useful in finding patterns within the input data. The main function of the supervised learning process is the classification of data into a set of finite groups (Bishop, 2006; Mitchell, 2006; Niknejad and Petrovic, 2013; Witten and Frank, 2005). Additional functions of machine learning process include regression and clustering. A machine learning model can map data into a quantifiable variable in regression problems where each new sample is assigned an estimated predictive variable. Unsupervised learning is commonly applied to clustering problems where the model categorizes samples in order to describe the datasets. As a result, each new sample is assigned to respective groups with similar characteristics. Suppose for example that a hospital has a large sample of medical records related to prostate cancer. Classification of this data can help predict whether a tumour is benign or malignant depending on their sizes. Regarding machine learning, the question would be related to estimating the probability of whether the cancer is malignant or not (1 = Yes, 0=No).

Besides supervised and unsupervised machine learning techniques, semi-supervised learning, which is a combination of supervised and unsupervised learning, has been widely applied to the classification and prediction of cancers (Park et al., 2013). Semi-supervised learning combines labelled and unlabeled training examples in order to develop a learning model. In many practical situations, the cost labelling data is quite high because it requires skilled human expertise to achieve this (Kourou et al., 2015). Therefore, in the absence of labelled data in the majority of the observations but present in few, semi-supervised algorithms are the best candidates for the model building.

Data samples comprise the most basic components in the application of a machine learning model. Consequently, every sample is described quantitatively with its features assigned to different types of values. Also, one is able to select the right techniques and tools to use for the analysis if some information is known prior. Because of the importance of primary data to machine learning, preprocessing the data to make it more suitable for machine learning becomes critical. This preprocessing takes care of data quality issues such as the duplicate or missing data, noise, and outliers. Consequently, processing of data improves its quality and that of the resultant analysis. Preprocessing techniques and strategies commonly applied to the improvement of data quality include dimensionality reduction and feature selection as well as extraction. According to Tan, Steinbach and Kumar (2006), dimensionality reduction enables machine learning algorithms to work better. Furthermore, it can eliminate irrelevant features and reduce noise in the data.

Following the importance of personalized care and the growing trend on the use of machine learning techniques, this review presents previous studies that have successfully applied these techniques for cancer classification (Polley et al., 2013). In addition, the review discusses the types of machine learning methods applied using CNV data and the overall performance of each proposed algorithms.

Methodology

A systematic literature review of various databases was conducted for relevant data on machine learning algorithms using CNV data to classify cancers. Secondary data plays an important role in research because it gives information of other studies in the area of interest. The gathering of evidence on machine learning algorithms using CNV data to classify cancers consisted of searching databases and reference searches. The predetermined strategy to find evidence on the topic of study from a variety of databases is shown in table 1.

Keywords Databases Date range Inclusion criteria

machine learning algorithms AND CNV data AND cancer classification OR prediction NBCi, PubMed, and Google Scholar

2000-2018 Articles are written in English

Clinical and genomic data

Related to machine learning, cancer classification, and cancer prediction

Published between 2000 and 2018

Table 1. The keywords used for the search, the databases searched in this systematic literature review, the date range assessed and the inclusion criteria.

The strategy began with identifying relevant keywords for the database search. In addition, a review of PubMed's medical headings revealed relevant keywords to be used in the database search. This strategy was replicated for other databases listed above in order to optimize the search sensitivity of the keywords in keeping with each database's functionality. It is expected that the key terms used for the search of individual databases will reproduce the same results when running the searches in future. However, the exact sources might not be guaranteed at that point in future because databases continually update their journal entries.

 

Results

The initial search using the criteria set out in table 1 yielded 38 articles. After screening both the title and abstracts, assessing the full text, removing non-full articles/duplicate articles/those not written in English, 87 articles were excluded from the review. The overview of results is represented in figure 1.

Results of the initial database search

(n=38)

Studies recorded after duplicates removed

(n=38)

2Records excluded by either title or abstract

(n=3)

Records Screened

(n=35)

Full-text unavailable (n=0)

Not in English (n=0)

Insufficient data (n=0)

Full-text articles assessed for eligibility

(n=35)

 

Off-topic articles (n=0)

Books (n=1)

Studies recorded via citation searches (n=35)

 

Figure1: Summary of the search results according to the PRISMA protocol

Discussion

As mentioned earlier, the main objective of machine learning techniques is to produce a model capable of classifying data, predicting or estimating an outcome. Classification, which is the primary role of these models, is important in biomedical research. The development of a classification model by machine learning techniques may result in training errors (misclassification errors) and generalization errors (errors on testing data). Nevertheless good classification model should fit the dataset for training and accurately classify all the samples. The bias-variance decomposition is a method for depicting generalization error of a machine learning algorithm. The biases of learning algorithm measure the rate of the algorithms error. Moreover, the variance of the machine learning model results in another error related to the training and test sets. The sum of bias and variance popularly referred to as bias-variance decomposition constitute the overall expected error of a classification model.

After obtaining a classification model via the machine learning methods, it is important to test its performance. The performance metrics of the classifier include its sensitivity, accuracy, specificity, and area under the curve. Sensitivity and specificity are defined by the proportion of true positives and negatives correctly identified by the model, respectively. On the hand, the predictive accuracy of the classification model is defined by its ability to estimate the generalization errors. However, the reliability of the outcomes depends on the size of the samples and their independence. Besides, the labels of the testing sets should be known.

Machine learning for classification of cancers

This literature review shows that several studies have been performed on the survival prediction problem using statistical approaches and artificial neural networks. Nonetheless, only a few are related to medical diagnosis and recurrence using machine learning tools (Zhou and Jiang, 2003; Delen, Walker, & Kadam, 2005; Ahmad et al., 2013). Using the SEER cancer incidence database, Delen and colleagues used three data mining tools namely artificial neural networks, decision trees and logistic regression to develop machine learning models for breast cancer survival. Several other studies have also investigated the use of artificial neural networks in the prediction of breast cancer survival (Singh, Gupta, & Sharma, 2010). These studies are some exa...

Have the same topic and dont`t know what to write?
We can write a custom paper on any topic you need.

Request Removal

If you are the original author of this essay and no longer wish to have it published on the thesishelpers.org website, please click below to request its removal: