Turn off MathJax
Article Contents
Peng Chen, Zhenlei Li, Zhaolin Hong, Haoran Zheng, Rong Zeng. Tumor type classification and candidate cancer-specific biomarkers discovery via semi-supervised learning[J]. Biophysics Reports. doi: 10.52601/bpr.2023.230005
Citation: Peng Chen, Zhenlei Li, Zhaolin Hong, Haoran Zheng, Rong Zeng. Tumor type classification and candidate cancer-specific biomarkers discovery via semi-supervised learning[J]. Biophysics Reports. doi: 10.52601/bpr.2023.230005

Tumor type classification and candidate cancer-specific biomarkers discovery via semi-supervised learning

doi: 10.52601/bpr.2023.230005
More Information
  • Corresponding author: hrzheng@ustc.edu.cn (H. Zheng); zr@sibcb.ac.cn (R. Zeng)
  • Received Date: 22 March 2023
  • Accepted Date: 26 April 2023
  • Available Online: 25 May 2023
  • Identifying cancer-related differentially expressed genes provides significant information for diagnosing tumors, predicting prognoses, and effective treatments. Recently, deep learning methods have been used to perform gene differential expression analysis using microarray-based high-throughput gene profiling and have achieved good results. In this study, we proposed a new robust multiple-datasets-based semi-supervised learning model, MSSL, to perform tumor type classification and candidate cancer-specific biomarkers discovery across multiple tumor types and multiple datasets, which addressed the following long-lasting obstacles: (1) the data volume of the existing single dataset is not enough to fully exert the advantages of deep learning; (2) a large number of datasets from different research institutions cannot be effectively used due to inconsistent internal variances and low quality; (3) relatively uncommon cancers have limited effects on deep learning methods. In our article, we applied MSSL to The Cancer Genome Atlas (TCGA) and the Gene Expression Comprehensive Database (GEO) pan-cancer normalized-level3 RNA-seq data and got 97.6% final classification accuracy, which had a significant performance leap compared with previous approaches. Finally, we got the ranking of the importance of the corresponding genes for each cancer type based on classification results and validated that the top genes selected in this way were biologically meaningful for corresponding tumors and some of them had been used as biomarkers, which showed the efficacy of our method.

  • Peng Chen, Zhenlei Li, Zhaolin Hong, Haoran Zheng and Rong Zeng declare that they have no conflict of interest.
    This article does not contain any studies with human or animal subjects performed by the any of the authors.

  • loading
  • Baldi P, Long AD (2001) A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 17: 509−519 doi: 10.1093/bioinformatics/17.6.509
    Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R (2007) NCBI GEO: mining tens of millions of expression profiles — Database and tools update. Nucleic Acids Res 35: D760−D765 doi: 10.1093/nar/gkl887
    Carvalho BS, Irizarry RA (2010) A framework for oligonucleotide microarray preprocessing. Bioinformatics 26: 2363−2367 doi: 10.1093/bioinformatics/btq431
    Chapelle O, Scholkopf B, Zien A (2009) Semi-supervised learning (Chapelle O et al. Eds, 2006) [Book reviews]. IEEE T Neur Net 20: 542−542
    Chen C-R, McLachlan SM, Hubbard PA, McNally R, Murali R, Rapoport B (2018) Structure of a thyrotropin receptor monoclonal antibody variable region provides insight into potential mechanisms for its inverse agonist activity. Thyroid 28: 933−940 doi: 10.1089/thy.2018.0176
    Cheriyath V, Leaman DW, Borden EC (2011) Emerging roles of FAM14 family members (G1P3/ISG 6–16 and ISG12/IFI27) in innate immunity and cancer. J Interf Cytok Res 31: 173−181 doi: 10.1089/jir.2010.0105
    Cubuk ED, Zoph B, Mane D, Vasudevan V, Le QV (2018) Autoaugment: learning augmentation policies from data. arXiv: 180509501. https://doi.org/10.48550/arXiv.1805.09501
    da Silveira W, Palma P, Sicchieri R, Villacis RA, Mandarano L, Oliveira T, Antonio H, Andrade J, Muglia V, Rogatto S (2017) Transcription factor networks derived from breast cancer stem cells control the immune response in the basal subtype. Sci Rep 7(1): 2851. https://doi.org/10.1038/s41598-017-02761-6
    Dai W, Chang Q, Peng W, Zhong J, Li Y (2020) Network embedding the protein–protein interaction network for human essential genes identification. Genes 11: 153. https://doi.org/10.3390/genes11020153
    Danaee P, Ghaeini R, Hendrix DA (2017) A deep learning approach for cancer detection and relevant gene identification. Pacific symposium on biocomputing 2017: 219−229
    Díaz-Uriarte R, de Andres SA (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7: 3. https://doi.org/10.1186/1471-2105-7-3
    Gautier L, Cope L, Bolstad BM, Irizarry RA (2004) Affy — Analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20: 307−315 doi: 10.1093/bioinformatics/btg405
    Goldman M, Craft B, Brooks A, Zhu J, Haussler D (2018) The UCSC Xena Platform for cancer genomics data visualization and interpretation. bioRxiv: 326470. https://doi.org/10.1101/326470
    Guo F-B, Dong C, Hua H-L, Liu S, Luo H, Zhang H-W, Jin Y-T, Zhang K-Y (2017) Accurate prediction of human essential genes using only nucleotide composition and association information. Bioinformatics 33: 1758−1764
    Jafari P, Azuaje F (2006) An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors. BMC Med Inform Decis Mak 6: 27. https://doi.org/10.1186/1472-6947-6-27
    Khoshghalbvash F, Gao JX (2019) Integrative feature ranking by applying deep learning on multi source genomic data. Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. pp. 207−216. https://doi.org/10.1145/3307339.3342139
    Kuang S, Wei Y, Wang L (2021) Expression-based prediction of human essential genes and candidate lncRNAs in cancer cells. Bioinformatics 37: 396−403 doi: 10.1093/bioinformatics/btaa717
    Leary RJ, Kinde I, Diehl F, Schmidt K, Clouser C, Duncan C, Antipova A, Lee C, McKernan K, Francisco M (2010) Development of personalized tumor biomarkers using massively parallel sequencing. Sci Transl Med 2: 20ra14. https://doi.org/10.1126/scitranslmed.3000702
    Liu JJ, Cutler G, Li W, Pan Z, Peng S, Hoey T, Chen L, Ling XB (2005) Multiclass cancer classification and biomarker discovery using GA-based algorithms. Bioinformatics 21: 2691−2697 doi: 10.1093/bioinformatics/bti419
    Loshchilov I, Hutter F (2016) Sgdr: stochastic gradient descent with warm restarts. arXiv: 160803983. https://doi.org/10.48550/arXiv.1608.03983
    Lyu B, Haque A (2018) Deep learning based tumor type classification using gene expression data. Proceedings of the 2018 ACM international conference on bioinformatics, computational biology, and health informatics. pp. 89−96
    Mooney SM, Talebian V, Jolly MK, Jia D, Gromala M, Levine H, McConkey BJ (2017) The GRHL2/ZEB feedback loop — A key axis in the regulation of EMT in breast cancer. J Cell Biochem 118: 2559−2570 doi: 10.1002/jcb.25974
    Novaković S (2004) Tumor markers in clinical oncology. Radiol Oncol 38(2): 73−83 + 155
    The Cancer Genome Atlas Research Network, Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, Shmulevich I, Sander C, Stuart JM (2013) The cancer genome atlas pan-cancer analysis project. Nat Genet 45: 1113−1120 doi: 10.1038/ng.2764
    Tseng I, Yeh MM, Yang C-Y, Jeng Y-M (2015) NKX6-1 is a novel immunohistochemical marker for pancreatic and duodenal neuroendocrine tumors. Am J Surg Pathol 39: 850−857 doi: 10.1097/PAS.0000000000000435
    Wang H (2015) The distribution and expression of BAMBI in breast cancer cell lines. Open Access Library Journal 2: 1−7 doi: 10.4236/oalib.1102147
    Way GP, Greene CS (2018) Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pacific Symposium on Biocomputing 2018: Proceedings of the Pacific Symposium. World Scientific, pp. 80−91
    Xie Q, Dai Z, Hovy E, Luong M-T, Le QV (2019) Unsupervised data augmentation for consistency training. arXiv: 190412848. https://doi.org/10.48550/arXiv.1904.12848
    Yang B, Li M, Tang W, Liu W, Zhang S, Chen L, Xia J (2018) Dynamic network biomarker indicates pulmonary metastasis at the tipping point of hepatocellular carcinoma. Nat Commun 9(1): 678. https://doi.org/10.1038/s41467-018-03024-2
    Zagoruyko S, Komodakis N (2016) Wide residual networks. arXiv: 160507146. https://doi.org/10.48550/arXiv.1605.07146
    Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2017) mixup: beyond empirical risk minimization. arXiv: 171009412. https://doi.org/10.48550/arXiv.1710.09412
    Zhu H, Peng Y-G, Ma S-G, Liu H (2015) TPO gene mutations associated with thyroid carcinoma: case report and literature review. Cancer Biomark 15: 909−913 doi: 10.3233/CBM-150522
    Zhuo H, Zhao Y, Cheng X, Xu M, Wang L, Lin L, Lyu Z, Hong X, Cai J (2019) Tumor endothelial cell-derived cadherin-2 promotes angiogenesis and has prognostic significance for lung adenocarcinoma. Mol cancer 18(1): 34. https://doi.org/10.1186/s12943-019-0987-1
  • 加载中


    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(5)  / Tables(2)

    Article Metrics

    Article views (31) PDF downloads(1) Cited by()
    Proportional views


    DownLoad:  Full-Size Img  PowerPoint