Date of Award
2-19-2025
Publication Type
Doctoral Thesis
Degree Name
Ph.D.
Department
Computer Science
Keywords
AI security, Machine learning security, Privacy-preserving machine learning
Supervisor
Dima Alhadidi
Rights
info:eu-repo/semantics/embargoedAccess
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Abstract
Modern data analysis and machine learning applications face the dual challenge of handling high-dimensional data while ensuring data privacy. What makes things more serious are the regulations to address privacy issues such as the health insurance portability and accountability act to protect health Information and the European general data protection regulation to protect all sensitive personal data. This thesis integrates advanced clustering techniques, secure computing frameworks, and federated learning optimizations to present a unified progression of methodologies for scalable and privacy-preserving machine learning. The contributions span diverse domains, offering practical solutions for social network analysis, single-cell genomics, and natural language processing (NLP) applications while laying the groundwork for future innovations in secure and efficient data analysis. The thesis presents innovative frameworks—NICASN, PPPCT, and Trustformer—that address these challenges in social networks, genomics, and federated learning contexts, respectively. Nonnegative matrix factorization and independent component analysis for clustering social networks (NICASN) introduces a hybrid clustering approach that combines nonnegative matrix factorization (NMF), independent component analysis (ICA), and $k$-means with advanced centroid initialization. NICASN effectively detects communities in large-scale social networks by reducing dimensionality and improving clustering quality while maintaining computational efficiency. The success of NICASN in handling high-dimensional social network data inspired its adaptation to biological datasets, leading to the development of privacy-reserving parallel clustering for transcriptomics data (PPPCT). PPPCT addresses the unique challenges of single-cell RNA sequencing (scRNA-seq) data: high dimensionality, sparsity, and stringent privacy concerns. PPPCT ensures privacy while achieving state-of-the-art clustering quality and scalability for sensitive genomic datasets by incorporating NMF for dimensionality reduction, parallel $k$-means clustering, and Intel Software Guard Extensions (SGX) for secure computations. PPPCT paved the way for the challenge of even more complex privacy-preserving machine learning tasks, culminating in the development of Trustformer. Trustformer builds on the foundations of PPPCT, applying its clustering and privacy-preserving strategies to address the challenges of training deep learning models on decentralized sensitive data. Trustformer is a novel Federated Learning (FL) framework designed to train large Transformer models. Unlike traditional FL methods that transmit entire model weights, Trustformer incorporates $k$-means clustering to FL aggregation mechanism, which significantly reduces communication overhead. The Trustformer applies $k$-means clustering on each layer of a deep neural network and finds centroids in each layer. Then, it aggregates the centroids of model parameters in the central server instead of full model parameters. This results in reduced communication overhead and enhanced privacy.
Recommended Citation
Abbasi Tadi, Ali, "Privacy-preserving techniques in modern machine learning models" (2025). Electronic Theses and Dissertations. 9689.
https://scholar.uwindsor.ca/etd/9689