Benchmarking Lung Cancer Prognosis Models: A Comparative Study of Machine Learning and Cox Regression on RNA-seq and Personalised Metabolic Data

  • Pradip Moon

Student thesis: Doctoral Thesis

Abstract

Lung cancer has the highest rates of cancer-related occurrences and deaths worldwide.
The National Cancer Institute projects a global surge in new cancer cases to reach 29.5
million annually by 2040, accompanied by an anticipated 16.4 million deaths related to
cancer. According to GLOBOCAN 2020, there were approximately 19.3 million cancer
cases and 10 million cancer-related fatalities globally. In this total number of cancer cases,
lung cancer cells made for 2.21 million cases, 11.4%. In terms of cumulative cancer mortality,
lung cancer cells made for 1.79 million deaths, 18% of all cancer cell fatalities. To
maximise patient survival, it is essential to understand contributing factors. Recently,
multi-omics analysis has emerged as a method for classifying patient groups based on
prognosis and obtaining a more individualised interpretation of prognosis to support personalised
medicine. Additionally, genome-scale metabolic models (GSMMs) can be used
to refine the phenotypic predictions of patients using multi-omics data. Fluxomic features
were derived from a genome-scale metabolic model based on transcriptomic data. The
transcriptomic dataset was downloaded from www.cbioportal.org. This study aimed to
investigate genome-scale metabolic modelling in a predictive setting to determine whether
it can add significant information to data when applied to prediction. We also investigated
whether traditional regression machine learning techniques are superior to semiparametric
regression machine learning methods in predicting lung cancer patient survival
exclusively, using gene expression and fluxomic datasets. Our analysis aimed to identify
novel biomarkers and pathways involved in lung adenocarcinoma, which will help identify
at-risk patients and support doctors in clinical treatment selection and outcome prediction.
Gene expression datasets were obtained to extract personalised metabolic datasets
using genome-scale metabolic modelling. We employed a personalised metabolic dataset
known as fluxomics to leverage machine-learning techniques. Our approach involves using
machine learning models, such as Decision Trees (DT), Gradient Boosting Trees (GBT),
Random Forests (RF), Support Vector Regression (SVR), and Cox Proportional Hazards
(Cox-PH). Additionally, we used Principal Component Analysis (PCA) and RF feature
selection for effective predictor identification. In terms of accuracy, SVR performed best
on both the PCA and RF feature selection techniques. This shows that SVR is a better
method for forecasting the outcome variables in these datasets. Moreover, metabolic
features provide important information when they were added to transcriptomic datasets.
Furthermore, we successfully identified 17 previously unknown biomarkers and identified
4 pathways that could be utilised to predict the survival of lung cancer patients. These
findings not only contribute to a broader understanding of lung cancer patients, but also
hold the potential to drive the advancement of personalised treatment approaches.
iv
Date of Award16 Feb 2024
Original languageEnglish
Awarding Institution
  • Teesside University
SupervisorClaudio Angione (Supervisor), Maria Angeles Juanes Ortiz (Supervisor) & The Anh Han (Supervisor)

Cite this

'