Skin Lesion Detection and Classification Using Convolutional Neural Network for Deep Feature Extraction and Support Vector Machine

— Pigmented skin lesion identification is essential for detecting harmful pathologies related to this large organ, especially cancer. An analysis of the different methods and projects developed to diagnose these illnesses throughout the years showed that they had become very useful tools to identify melanoma, dermatofibroma, and basal cell carcinoma, among other types of cancer, are seen through the use of new computer-aided technologies. The most common diagnosis is based on dermoscopy and the dermatologist expertise that can improve accuracy with image detection techniques and classification by computer. Therefore, this study aims to develop software models able to detect and classify skin cancer. The following work is based on the use of dermoscopy images obtained from the HAM10000 dataset, a database with 10000 images previously tested and validated for research use. The main process is divided into three relevant parts: image segmentation, feature extraction (FE) using ten different pre-trained Convolutional Neural Networks (CNNs), and Support Vector Machine (SVM) to establish a classification model. According to the results, the models of classification performed very well using the image segmentation step, showing average accuracies between 80.67% (Xception) and 90% (Alexnet). In contrast to the process without using image segmentation, where no method reached 60%. AlexNet plus SVM model showed the minor running time and presented the higher accuracy rate (90.34%) for the correct identification and classification of the seven categories of cutaneous lesions taken into account.


I. INTRODUCTION
Nowadays, skin cancer is one of the most hazardous forms of this disease in humans. In Ecuador, according to the Society to Fight Cancer in Ecuador (SOLCA), this illness has the highest incidence rate in men and women [1]. Skin cancer is found in various types such as melanoma, basal, or squamous cell carcinoma, among which melanoma is seen as the most unpredictable and the most aggressive, causing the major number of death cases and the highest metastatic rates [2]. If melanoma is diagnosed and treated in its early stages, it can be cured. On the other hand, if it is detected late, it can grow deeper into the skin and spread to other parts of the body which is called metastasis [2], [3]. Previously, the diagnosis of this disease was based on a complete inspection of the patient's integument without aids. Nowadays, clinicians use devices and computer vision techniques that provide useful insights [4]- [8]. Some of the techniques used are confocal laser scanning microscopy (CLSM) and optical coherence tomography (OCT), and others [2]. Dermoscopy is the most common imaging technique used to help dermatologists on their assessments, consisting of a non-invasive procedure that makes possible the visual examination of subsurface structures of the skin [2], [7], [9]. Today, it has been possible to develop databases that contain dermatoscopic images for research purposes and for the development of techniques that can improve skin cancer diagnosis accuracy. The International Skin Imaging of Collaboration (ISIC) develops repositories of these kinds of images (ISIC Archive) for clinical training and supporting research. These databases are used to develop biomedical imaging analysis tools to classify them among different diagnostic categories [10]. The Human Against Machine with 10000 training images (HAM10000) dataset is a Harvard database collection of dermatoscopic images created in the University of Vienna, acquired and stored by different modalities. This database has been tested and validated, and the pictures are ideal for training computer models. More than 50% of lesions that the data presents have been confirmed by pathology. The rest of the cases' ground truth was either follow-up, expert consensus, or confirmation in-vivo confocal microscopy [11].
Segmentation of skin lesions is crucial for identifying, classifying, and detecting melanoma by dermoscopy, intending to reduce unnecessary elements that could mislead the classification by convolutional neural networks [12]. Various pre-processing methods have been applied to precisely identify the lesions to avoid complications due to the software's poor image quality and reduce errors on CNNs. For example, some studies perform lighting correction and segmentation steps [13] or analyze regions defined by manually delimited tables [14], [15]. It is important to have a good resolution dataset for identification and classification through computational methods. Since the size of the data set is often a limitation because it influences the performance of the classification, the lack of data problem is solved by applying data augmentation [16], also applying tasks as segmentation of the lesion and detection of dermoscopic characteristics [17].
However, when dermatologists perform these methods, the accuracy of melanoma diagnosis is estimated to be about 75-84% , and it depends on the dermatologist's training [2], [9]. Due to human interpretation's difficulty and subjectivity, the development of computerized image methods is of paramount importance. It has become an important research area to minimize the errors that result from the visual interpretation and the obtention of a better accuracy [18].
Image processing allows sweeping abnormalities to get the relevant features from the original picture. Besides, to produce a more detailed image that contains more information and less noise [12], it is crucial to recognize the specific pathology that the image belongs to and, consequently, it will provide a proper diagnosis [9], [18]. An adequate segmentation leads to a better classification of the regions of interest, allowing skin pathology to be identified more precisely. It also improves image quality and consequently contributes to success in CNN's classification algorithm [12]. To carry on segmentation in this project, the Otsu method [19] consists of the extraction of the object of interest from the image background, comparing the values defined by the threshold that separates the mentioned components.
CNN's are a class of deep learning neural networks that have been widely applied to image classification tasks in medicine [20]. In the last decades, skin cancer detection and classification by CNNs have been broadly studied to achieve facile and accurate recognition [21]- [24]. However, training deep CNNs from scratch for the classification of images requires a large amount of labeled training data, extensive computational and memory resources, and a great deal of expertise to ensure proper convergence, which results in an extremely time-consuming process [20]. Considering such limitations, multiple methods have been developed to recognize and classify images, among which stands out the processes 'End-to-end Learning' and the use of CNN's as feature extractors, especially in terms of recognition and classification of these types of lesions [21]. Certainly, several scientific publications have achieved high accuracies for the classification of malignant and benign skin pathologies, as in [14], [25], [26]. However, it is known that identifying the specific category is crucial for a correct posterior medical procedure. Accordingly, the present study is based on the FE from ten pre-trained CNNs and SVM training for cutaneous cancer classification using the HAM10000 database's images previously segmented and filtered.
This article focuses on developing diagnosis software to detect and classify seven different kinds of skin lesions. Firstly, the image classification of the HAM10000 database is done, following a segmentation process of healthy and unhealthy images using the Otsu technique, which allows us to obtain better precision of the region of interest. Later, the segmented images resulting are used for FE using each of th ten pre-trained CNNs to carry on SVM training and finally achieve a highly precise skin lesion classification model (see Fig. 1. This study proposes a method based on 3 key steps for developing computational models that can detect and classify seven skin pathologies, as described in Fig. 1. First, the classification of images consists of obtaining the dermoscopic images and classifying them according to the pathology that each one presents for their posterior use. The following step consists of image processing and filtering, based on applying filters and segmentation techniques on images to delete or obtain some essential features for the next step. Finally, FE and SVM processes, where the images are classified through different models, determine which one has the best accuracy.

A. Skin lesion images
The database used to work up the neuronal network models is the HAM10000 dataset [11] for previously exposed reasons. It includes all relevant diagnostic categories of pigmented lesions. Actinic keratoses and intraepithelial carcinoma (akiec) are considered as variations of squamous cell tissue or carcinoma. These lesions are commonly non-invasive, are commonly devoid of pigment. Basal cell carcinoma (bcc), can be dangerous if is untreated. Different morphological variants appear; the lesion may be darker but still like translucent [3]. Benign keratosis-like lesions (bkl), is commonly a non-cancerous skin lesion that originally appears in the outer layer of the skin. They could have various colors, from light ones to black. Dermatofibroma (df), is a skin lesion consider benign. Its most common presentation is with lines at the periphery with a central white patch. Melanoma (mel), is a malignant cell growth derived from melanocytes that may have different variants. It can be cured in an early stage, but they also could be invasive and very harmful, generally have non defined shape [1]. Melanocytic nevi (nv), is a benign lesion made of neoplasms melanocytes and appear in different forms. Finally, vascular lesions (vasc) range from cherry angiomas to angiokeratomas and pyogenic granulomas, including hemorrhages. These are dermatoscopically characterized by red or purple color and well-defined structures [11].
The database consists of three components: two folders containing all the images and the metadata. The downloaded images were found out of order, so the first step was to order each image within each folder according to the 7 pathologies present.
The parameters describing the metadata are: 'lesion_id'; that is the numeration inside the database, 'image_id' or the unique number for each picture, 'dx': skin illness categories of the dataset, 'dx_type'; that is the method used to confirm the illness diagnostic, age, sex localization; place of the body in which the lesion appears. The important parameter for ordering the images, in this case, were 'image_id' and 'dx'. A filter was made to obtain only the specific data in each pathology category. A new CSV document was created only with the pathologies as headlines, and all the images present the illness as a column. Once the metadata is organized, an algorithm was developed in python that: in the first place, creates seven folders with the names of the seven skin lesion names of the dataset, and after that, it reads the name of each image, compares it with the names in the new .csv file, and moves each image into the correct folder, as described in Fig.  2.

B. Image Processing
The main challenges of image processing techniques are identifying specific features, eliminating noise interference, or allowing clinicians to diagnose more accurately and efficiently through the use of segmentation results [18], [27], [28]. The most common segmentation methods used are gradient vector flow (GVF) [29]; split and merge segmentation method [30]; edge-based approach [31]; fuzzybased split-and-merge algorithm (FBSM) [32], a grouping of feature spaces, the histogram threshold, the Pixel-based method and watershed technique [21], [30], [32]. The main goal of medical imaging processing is to transform them through a stable and strong recognition system that allows analysis results that coincides with the elements of interest [17], [28]. Therefore, the result of the segmentation process depends largely on the type of method used and the FE of the network analysis.
In this study, most of the images analyzed from the dataset show a notable, light color tone and significantly distinguishable features compared to the skin around the lesion. These data are of particular importance in our proposed procedure. Due to this marked difference, the Otsu technique is used as an automatic grouping-based image threshold selection. The latter is initially made based on the bimodal principle of extracting the object of interest from the image background [33]. The threshold is beneficial with a high level of contrast images, allowing to identifying specific areas affected by various skin pathologies [21]. In the process, iterations are carried out through the intensity values until the optimum threshold value is found, which is considered the separability criterion. The pixels of the skin lesions images have been separated into two classes that are foreground and background. Pixels above the separability criteria are considered in the foreground, while the remaining values go in the background.
Finally, the Otsu process applies a selection mask between the background and the plane of the affected area on the original image, allowing the regions of interest to be extracted with great precision. Therefore, the lesion is isolated, as shown in Fig. 3. C, being useful for FE and the following training of the classification models. The whole segmentation process is represented in Fig. 3.

C. Deep Feature Extraction and SVM Classification
FE consists of a feature vector acquisition from a specific deep layer of a CNN previously trained [34]. In this case, AlexNet, DenseNet201, GoogleNet, MobileNetV2, Resnet18, Resnet50, Resnet101, VGG16, VGG19, and Xception were employed. These CNNs are evaluated on more than a million images from the ImageNet database [35] and can classify images corresponding to 1000 categories. In this case, the pretrained networks were used to classify the mentioned illnesses from the settled ones. Therefore, 112 images were used from each category from the dataset for FE employing the mentioned CNNs. From this dataset, 70% and 30% were split randomly for the model training sets and validation sets, respectively.
After splitting the input dataset, a data augmentation step for training and validation sets is performed to fit each CNN's required input properties. It was performed on MATLAB using the 'augmentedImageDatastore' command, providing the needed image size (determined by coding 'net.Layers(1).InputSize') and RGB conversion, making all input images compatible with each CNNs input layers. Subsequently, the 'activations' function was applied to the various networks, allowing to carry on the FE by computing a deep learning process for specific CNN's layer detailed in Table I. Besides, a minibatch size of 32k was implemented in GPU memory, and the output feature vectors were placed in a column to fit in linear SVM training. Subsequently, the SVM training was carried out using the function 'fit class error-correcting output codes' ('fitcecoc'). This function returns fully trained multiclass error-corrected output data (ECOD) models. Then, 'fitcecoc' function employed the extracted features to train a multiclass SVM classifier utilizing a fast Stochastic Gradient Descendent (SGD) solver, setting its 'Learner' parameters to be 'Linear' and one-versus-all coding strategy. After that, the validation dataset extractions were accomplished and passed to the SVM classifier by the 'predict' function. Finally, confusion matrices were performed to evaluate the accuracy of each model. All of these steps are summarized in Fig. 4. The classification models' implementation was done in MATLAB 2019b and ran on a laptop Lenovo G50-80 Core i7 5th generation, Windows 10, 8 Gigabytes (GB) RAM, 240 GB SSD.

A. Results Description
Once performed all the steps previously described, the neural networks were validated in the Matlab program. For CNN's training, 112 images of the total amount obtained from the HAM1000 database per category were used, of which 78 images represented 70% for training and 34 images represented 30% for validation. After segmentation, the images were tested with each of the different CNNs plus SVM models to prove two important parameters: the mean accuracy of each model used and the time each one takes to run and give results.
All the HAM10000 dataset images were classified into different folders according to the pathology that each one belongs. Then, the metadata was saved in the following 7 folders with a specific category name and number of images, as shown in Table II. Comparisons of the time and accuracy for each model using images with and without segmentation is shown in Table III. They are testing each CNN with the database without segmentation show much lower accuracies than the processed data. The higher accuracy belongs to AlexNet model, with a mean of ~58%. On the other hand, image segmentation allows for better accuracies. The lower accuracy is about ~81% (Xception), and the highest value corresponds to AlexNet with 90.34%. Meanwhile, DenseNet201, GoogleNet, MobileNetV2, Resnet18, Resnet50, Resnet101, VGG16 and VGG19, performed with accuracies of 90%, 86%, 84%. 85%, 84%, 89%, 88%, 86%, 85%, and 81% respectively. From another perspective, the images without segmentation present a shorter classification time than the segmented ones, showing an increase of 38% in this parameter. As mentioned, ten CNNs were used as principal feature extractors, including AlexNet, which gets the best time and accuracy percentages. This is a supervised learning CNN that has eight clusters distributed in five convolutional three fully-connected layers. Additionally, these are some of the features used that are important to the functionality of this network. For example, AlexNet uses Rectified Linear Units (ReLU), an advantage in training time; this trains several times faster than their equivalents with tanh (hyperbolic tangent) units. It also performs the training on Multiple GPUs (graphic processing units), expanding the network's maximum size, allowing to put half of the model's neurons on one part of them and the other half in the other one. This network adds an overlapping function that establishes many max-pooling in each layer and finally suppresses the overfitting problem by data augmentation and dropout techniques; this facilitates the network to memorize instead of differentiating between one image from another, as observed in Fig. 5 [36]. The confusion matrices performed for each CNN plus SVM model show 'Predicted Class' and 'Real Class' axis with each of the seven skin pathologies. It shows that the number of images of one of the seven skin pathologies (i.e., bcc) used as input for the model is not always the same as the number of cases that the model classifies as the correct one. Fig. 6 A and B show the confusion matrices of the CNNs plus SVM model presented the highest and the one with the lowest accuracy, respectively. Once the validation process was done, the Confusion Matrix (CM) of each model was obtained. This step was performed to determine the accuracy level from each of the ten individual classification algorithms implemented, exposed in Table III. Finally, Fig. 7 A and B show the results of the two main characteristics of interest of each CNN model: the time they take to run in seconds (s) and the accuracy of each percentage method, respectively. The data is summarized in two histograms, enabling visualizing which models are the quickest to run and provide results with the best accuracy. Fig. 7 Histogram representation of A) the accuracy of each CNN model, where the lowest percentage is more than 80%, and the best value is more than 90%. B) the time each CNN takes to run with a range of 50 to 650 seconds (s).

B. Results Analysis
The separation of the images from the HAM10000 database into seven categorized folders was crucial for data management, the process described in Fig. 2. In this study, the image pre-processing step plays a significant role in the posterior prediction model construction. Image segmentation allows the greatest number of information of interest to be grouped into defined areas. Most pigmentation spots show observable and differentiable tones and textures surrounding the lesion, being essential for identifying the skin pathology.
Otsu technique has shown in work carried out an 84-90% accuracy in classifying pathologies through all the CNNs used. Concerning time, the values vary slightly from one network to another. It can be seen that a few seconds more processing is required in the case of using segmented images than with the others nonsegmented, as detailed in Table III. However, time does not represent a significant parameter taking into account the advantage in the obtained precision. Therefore, the segmentation of images represents an improvement in the image database classification, increasing an average of 35% accuracy.
It is confirmed that the selected method eliminates unnecessary characteristics in images. The accuracy variation is obtained because CNNs only require the strictly necessary information on the cutaneous pathology, not its interference. This allows the threshold selection mask to overlap on the original image, leading to proper extraction of the desired area. Furthermore, it is appropriate to mention that the implemented models obtain a high-level classification, similar to [14], [21], [25], [26], and prompt response in the time domain for the considered pathologies in this work.
FE has been reported as a suitable method that provides acceptable accuracy for classification models [20], [24]; it extracts relevant attributes from images and avoids redundant characteristics that do not contribute to the process [34]. Besides, many scientific publications have introduced Support Vector Machine systems to classify images [8], [14], [37], [38]. This study uses the SGD method for training a linear SVM to achieve efficient and accurate responses. As the results show, the proposed methods produce relatively fast responses varying from seconds to minutes. The low computational time required is because the SGD method uses a unique training case for each training set [39]. It allows rapid convergence, being in the range of 43.87 to 650.21 seconds in running time. The lower computational time required corresponds to AlexNet plus SVM training which also presents the highest accuracy, owing to its structure built by five convolutional layers and three fully-connected layers with a high-performance operation [36], [40]. On the other side, the systems corresponding to VGG19, DenseNet201, VGG16, and Xception demonstrated the slower running times corresponding to 650.21 s, 586.83 s, 517.74 s, and 447.76 s, respectively, shown in Fig. 7. B, which is a result of the profound architectural design [36].
The highest accuracy reached is 90.34% employing the AlexNet plus SVM algorithm, as shown in Fig. 7. A. A similar approach is made by Kawahara et al. [26], who reported a model for image recognition of 10 different types of skin lesions with an accuracy of 81.8%. In the mentioned work, the 'fc6' AlexNet CNN layer features are extracted to train a linear logistic regression classifier without any segmentation pre-processing. In another scientific report, Codella et al. [14] describe a method based on extracting elements, sparse coding, and SVM to classify melanoma vs. non-melanoma lesions, and melanoma vs. atypical lesions, reaching accuracies of 93.1% and 73.9%, respectively. Similarly, Khan et al. [41] describe a method for skin lesion segmentation and recognition between malignant and benign lesions, based on FE using ResNet50 and ResNet101, for a posterior feature selection using kurtosis controlled principle component (KcPCA) to feed an SVM classifier. They used HAM10000, ISBI 2017, and ISBI 2016 databases separately and reported 95.60% accuracy using the ISBI 2017 database.
In most scientific reports for skin lesion classification based on FE and SVM training, the computing time required to run each model is not included; therefore, it is difficult to compare this parameter. However, the methodology proposed in this study has achieved good accuracies in low periods.
The limitations presented in this study lie in the data amount used for training and validating the models. Here, 112 images were used for each of the seven categories, from which 70% was split randomly for training, and the rest for validation, to maintain the same conditions for all the ten CNNs plus SVM algorithms. Moreover, some images were eliminated from the dataset because they could not perform appropriate segmentation due to the differences in contrast between the lesion and skin tones. Hence, 'df' images reduced from 115 to 112, becoming 112 dermatoscopic images, the maximum amount of data used. The results obtained can be compared with the reviewed works that use FE plus SVM algorithms [14], [26] and including segmentation image preprocessing [41]. These, Validating classification effectiveness by implementing a previous segmentation phase.
From future perspectives, the authors plan to expand the dataset by combining HAM10000 with ISBI or ISIC public databases, expecting to obtain higher accuracy and expanding the classification to more than seven types of skin pathologies. It is also appropriate to carry on an exhaustive study of pretrained CNNs to develop reliable classification models using dermatoscopic images as training and validation sources. Finally, developing a local database and working together with dermatology institutes would help acquire a correct comprehension of the skin pathologies that affect the population.

IV. CONCLUSION
The HAM10000 proved to be an adequate database reaching the purpose for FE and SVM training, owing to its size and the pathologies diversity. The dermatoscopic images could be successfully classified and organized in folders according to the lesion or pathology they present for its posterior segmentation. The Otsu technique chosen showed reliability in identifying and detecting essential characteristics in the given images, reducing unnecessary elements that could mislead or decrease the accuracy of the following classification performed by the CNNs plus the SVM model.
The application of the Otsu technique for image segmentation, followed by the FE and SVM algorithm, allowed the implementation of high accuracy and fast classification models of the different skin lesion types. The mean accuracies registered ranged between 80.67% and 90.34%. The lowest value of this parameter was obtained from the Xception plus SVM model, while the highest one corresponded to the AlexNet plus SVM model. Similarly, although all the developed classification models performed rapid responses, the AlexNet model registered the fastest while the VGG19 model was the slowest. All the results are shown in Fig. 7. A and Fig. 7. B.
To sum up, it successfully developed ten different classification models, capable of detecting and classifying seven skin pathologies by using the dermatoscopic images from the HAM10000 database. The methodology proposed in this study resulted in efficient skin lesions classification methods with high accuracy and little time required for each of them.