A Paucity of Data in Machine Learning: Applications in Single Cell RNA Sequencing and Ranking

Varma, Umang (2022) A Paucity of Data in Machine Learning: Applications in Single Cell RNA Sequencing and Ranking. Doctoral thesis, UIN SAIZU Purwokerto.

Text
A_Paucity_of_Data_in_Machine_L.pdf
Restricted to Registered users only
Download (1MB)

Abstract

A driving force behind the development of machine learning techniques is the availability of vast amounts of data that are continuously generated and the enormous potential of using these data. It is, however, not uncommon to have a paucity of data. This paucity can come in different forms: we may have an insufficient amount of data to perform the task at hand, we may have missing entries, or we may be working with aggregate statistics that only capture a fraction of the data we seek to study. To extract meaningful conclusions from scarce data is challenging and requires creative ap�proaches that account for the scarcity in their assumptions and/or finding other means to make up for insufficient data. It is not always possible to adapt the standard algorithm for a given problem and the challenge often lies in finding the right basis from which to build a viable solution. This thesis takes on two problems within this realm, where there is a paucity of data, and develops techniques to overcome the challenges such paucities can pose. In single cell RNA-sequencing, entries of a gene expression matrix are counts of the number of molecules observed where only a small fraction of the molecules in the cell have been observed. This is particularly challenging because biologists are often concerned with whether a gene is expressed in a cell at all; however, a zero entry in a gene expression matrix only says that no such molecules were seen in the given cell, not that there were no such molecules in the given cell. In practice, a vast majority (often over 90%) of entries in gene expression matrices are zero. We focus on the problem of feature selection: we give theoretical guarantees for information-theoretic algorithms and address practical issues involved in their implementation. The problem of ranking items from pairwise comparisons is also often constrained by a paucity of data—collecting pairwise comparisons made by humans can be expensive—and we propose ways to incorporate other data to overcome a small number of pairwise comparison observations. In addition to giving a better sample complexity bound for the RankCentrality algorithm with simpler proofs that use matrix concentration inequalities, we introduce λ-regularized RankCentrality (theoretical analysis for this regularization depends on our new proofs for RankCentrality), that is capable of giving non-trivial output even when the number of observations is small, and a similarity-based regularization that can use features of the items being ranked to significantly improve performance in the small-sample regime.

Item Type:	Thesis (Doctoral)
Subjects:	500 Natural sciences and mathematics > 510 Mathematics
Divisions:	Perpustakaan
Depositing User:	sdr prakerin 22
Date Deposited:	20 May 2022 02:11
Last Modified:	20 May 2022 02:11
URI:	http://repository.uinsaizu.ac.id/id/eprint/13538

Actions (login required)

View Item