DATA MINING & MACHINE LEARNING
UNSUPERVISED MACHINE LEARNING ANALYSIS
The unsupervised analysis section provides several methods for clustering and visualization of unlabeled data. These algorithms are designed for data exploration through revealing groups of similar samples (clusters) without information of existing groups. Using these methods, unknown classes inherent in the data samples can be revealed by analyzing their similarities based on chosen input features. There are several options to prepare input data, and examples of dealing with project-specific datasets. You can start by uploading your data; the input data can be in *TXT format with tab-separated values. After selecting the “START” button, you will be prompted to select one of the methods. Each method will have a similar pop-up with an explanation and some methods will have input parameters you can modify. To complete a pipeline, connect your method to the “END” button, name the pipeline and click on “Run Pipeline”.

Hierarchical Clustering
Agglomerative (“bottom-up”) hierarchical clustering starts with attributing each sample to a separate cluster. Then it finds the pair of most similar samples and unites them in one cluster that is further attributed as one “quasi-sample.” This procedure is repeated until coming to one cluster that contains all the samples. The result is often plotted as a dendrogram (tree) which shows the order in which clusters were merged. The height of clusters in this tree traditionally visualizes similarity level between its sub-clusters that have been merged to form this cluster (larger height or, in other words, larger distance from initial samples – tree leaves – means higher dissimilarity).
The hierarchical clustering algorithm is a process , by which a computer uses the input file where columns are objects and rows are features. It then assigns each object to a separate cluster (or group). Next, it evaluates all pairwise distances between clusters using a chosen distance metric. The algorithm constructs a distance matrix using the distance values; and then finds the pair of clusters with the shortest distance. After it merges them into a single cluster, the algorithm will repeat from step 2 until we end up with a single cluster for all samples (objects).
K-Means Clustering
in k-means we take a number of clusters k as an input parameter and randomly select k initial “centroids” in the sample space (as we cluster samples: if we cluster features, it would be centroids in the feature space). Then each sample is attributed to a cluster associated with the closest centroid. After that for each cluster its initial centroid (used in separation of samples into clusters) is replaced by a real centroid (mean) for the current set of samples forming this cluster, and the next iteration starts for a new set of centroids. The process continues until convergence or reaching a predefined stop condition (e.g., reaching maximum number of iterations). k-means does not have a visual output, but a table of samples with a cluster assigned to each sample.


SUPERVISED MACHINE LEARNING
The supervised analysis section provides several options for building and implementing supervised classification models. These methods will take labeled training data, and unlabeled test data as input. Classifiers can be constructed from the training data, and subsequently tested to determine if known groups can be automatically identified by chosen features. Methods for feature transformation, and feature set optimization are also included. There are several options to prepare input data, and examples of dealing with project-specific datasets. Input files should be tab-delimited “.txt” files. Two files should be provided that include “train.txt” and “test.txt” in the file names. Both files should contain samples represented in rows, and features in columns. The first column of the training data file should contain the class label of the respective sample. The chosen supervised models will be constructed based on the “train.txt” data, and subsequently used to classify the unlabeled “test.txt” file data.
Decision Tree
Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes. It solves the problem of machine learning by transforming the data into a tree representation. Each internal node of the tree representation denotes an attribute and each leaf node denotes a class label. A decision tree algorithm can be used to solve both regression and classification problems.
Decision Tree yields prediction statistics (confusion matrix) with true and predicted labels, decision tree plot and predicted train and test files.


Random Forest
A natural extension of the decision tree algorithm is an algorithm called “Random Forest”. Random Forest is a technique that uses decision trees to analyze smaller portions of the full dataset in a process of random initialization (“bagging” – random selection of a subset of data) and voting. Different portions of data (or samples in our case) can be separated using different thresholds of various genes.
This algorithm uses multiple instances of decision trees that are applied to portions of the data one at a time. After all of the data is analyzed in this way, the tree predictions are analyzed – and majority voting is accepted. This algorithm is extremely useful for more complex patterns, where smoother borders between classes are needed. Because of this process of multiple decision tree evaluation, the algorithm can also be used to evaluate features. The more a “feature” (or in our case a gene) appears in these decision trees, the better it “performs”. Going through many trees like that, we can evaluate feature significance. This is an important point, because Random Forest can help us identify those features that are most useful for classification accuracy and impurity. We will speak about this further in the next lesson which talks about Feature Selection.
SVM (Support Vector Machine)
Many times, it is not possible to have any linear discrimination and finding a quadratic function to delineate groups is practically impossible, which reduces prediction accuracy. In those cases, we can use another important classification method that you will see used for analysis of RNA-seq data. SVM, or support vector machine, is an algorithm that differs from other classification methods, because it is optimal to find a complex boundary of a class in an efficient way. To do so, it uses a procedure for feature space transformation with a kernel function.
Â

SVM output is all about prediction for the test dataset. If the prediction is accurate, high accuracy will be determined. If the class separation has not been accurate and the cost parameter is high, low accuracy will be reported. Changing the kernel type and reducing features in the test and train dataset will help improve accuracy.
In our case, optimal separation can be achieved after feature selection. This procedure is possible using stepwise or greedy approaches as well as tree selection by voting as we will learn about in the next section.
Get Started with Your Project Today
T-BioInfo Server
Research License
-
User-friendly Interface
-
Cloud HPC Resources
-
Reproducible Workflows
Support Service
Free Consultation
-
Experiment & Analysis Planning
-
Pipeline Modification for Best Results
-
Custom Analysis and Troubleshooting
Submit your proposal today
Fill out the form below and our team will reach out to you in the next 24 hours
Submit your proposal today
Fill out the form below and our team will reach out to you in the next 24 hours