Publications | Zhenan Fan (范喆楠)

2025

EJOR

Artificial intelligence for optimization: Unleashing the potential of parameter generation, model formulation, and solution methods

Zhenan Fan, Bissan Ghaddar†, Xinglu Wang, Linzi Xing, Yong Zhang, and Zirui Zhou

European Journal of Operational Research, 2025

Abs arXiv DOI

The rapid advancement of artificial intelligence (AI) techniques has opened up new opportunities to revolutionize various fields, including operations research (OR). This survey paper explores the integration of AI within the OR process (AI4OR) to enhance its effectiveness and efficiency across multiple stages, such as parameter generation, model formulation, and model optimization. By providing a comprehensive overview of the state-of-the-art and examining the potential of AI to transform OR, this paper aims to inspire further research and innovation in the development of AI-enhanced OR methods and tools. The synergy between AI and OR is poised to drive significant advancements and novel solutions in a multitude of domains, ultimately leading to more effective and efficient decision-making.
COLING

DeTriever: Decoder-representation-based Retriever for Improving NL2SQL In-Context Learning

Raymond Li, Yuxi Feng, Zhenan Fan, Giuseppe Carenini, Weiwei Zhang, Mohammadreza Pourreza, and Yong Zhang†

International Conference on Computational Linguistics (COLING), 2025

Abs DOI PDF

While in-context Learning (ICL) has proven to be an effective technique to improve the performance of Large Language Models (LLMs) in a variety of complex tasks, notably in translating natural language questions into Structured Query Language (NL2SQL), the question of how to select the most beneficial demonstration examples remains an open research problem. While prior works often adapted off-the-shelf encoders to retrieve examples dynamically, an inherent discrepancy exists in the representational capacities between the external retrievers and the LLMs. Further, optimizing the selection of examples is a non-trivial task, since there are no straightforward methods to assess the relative benefits of examples without performing pairwise inference. To address these shortcomings, we propose Detriever, a novel demonstration retrieval framework that learns a weighted combination of LLM hidden states, where rich semantic information is encoded. To train the model, we propose a proxy score that estimates the relative benefits of examples based on the similarities between output queries. Experiments on two popular NL2SQL benchmarks demonstrate that our method significantly outperforms the state-of-the-art baselines for the NL2SQL tasks.
ICML

Efficiently serving large multimedia models using EPD Disaggregation

Gursimran Singh, Xinglu Wang, Ivan Hu, Timothy Yu, Linzi Xing, Wei Jiang, Zhefeng Wang, Xiaolong Bai, Yi Li, Ying Xiong, Yong Zhang†, and Zhenan Fan†

International Conference on Machine Learning (ICML), 2025

Abs arXiv

Large Multimodal Models (LMMs) extend Large Language Models (LLMs) by handling diverse inputs such as images, audio, and video, but at the cost of adding a multimodal encoding stage that increases both computational and memory overhead. This step negatively impacting key Service Level Objectives (SLOs) like time to first token (TTFT) and end-to-end throughput (E2ETP). We introduce Encode-Prefill-Decode (EPD) Disaggregation, a novel framework that separates the encoding, prefill, and decode stages onto dedicated resources. Unlike current systems, which bundle encoding and prefill together, our approach decouple these steps unlocking new opportunities and optimizations. These include a new mechanism to cache multimedia tokens for efficient transfer, a novel way to parallelize encoding load within a request, a module to find the optimal resource allocation for disaggregated serving, and a novel role switching method to handle changing workload characteristics. Experimental evaluations with popular LMMs show substantial gains in memory efficiency (up to 15× less utilization), batch sizes (up to 22× larger), 10× more images/request, and 2.2× larger KV caches. Further, it leads to significant improvements in latency metrics (TTFT up to 71% reduction) and end-to-end throughput (up to 57% reduction), compared to systems that do not disaggregate.
KDD

Enhancing Learned Knowledge in LoRA Adapters Through Efficient Contrastive Decoding on Ascend NPUs

Morgan Lindsay Heisler, Linzi Xing, Ge Shi, Hanieh Sadri, Gursimran Singh, Weiwei Zhang, Tao Ye, Ying Xiong, Yong Zhang, and Zhenan Fan†

ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2025

Abs arXiv DOI PDF

Huawei Cloud users leverage LoRA (Low-Rank Adaptation) as an efficient and scalable method to fine-tune and customize large language models (LLMs) for application-specific needs. However, tasks that require complex reasoning or deep contextual understanding are often hindered by biases or interference from the base model when using typical decoding methods like greedy or beam search. These biases can lead to generic or task-agnostic responses from the base model instead of leveraging the LoRA-specific adaptations. In this paper, we introduce Contrastive LoRA Decoding (CoLD), a novel decoding framework designed to maximize the use of task-specific knowledge in LoRA-adapted models, resulting in better downstream performance. CoLD uses contrastive decoding by scoring candidate tokens based on the divergence between the probability distributions of a LoRA-adapted expert model and the corresponding base model. This approach prioritizes tokens that better align with the LoRA’s learned representations, enhancing performance for specialized tasks. While effective, a naive implementation of CoLD is computationally expensive because each decoding step requires evaluating multiple token candidates across both models. To address this, we developed an optimized kernel for Huawei’s Ascend NPU. CoLD achieves up to a 5.54% increase in task accuracy while reducing end-to-end latency by 28% compared to greedy decoding. This work provides practical and efficient decoding strategies for fine-tuned LLMs in resource-constrained environments and has broad implications for applied data science in both cloud and on-premises settings.
TechReport

xDeepServe: Model-as-a-Service on Huawei CloudMatrix384

XDS Team @ Huawei Cloud

Technical Report, 2025

Abs arXiv

The rise of scaled-out LLMs and scaled-up SuperPods signals a new era in large-scale AI infrastructure. LLMs continue to scale out via MoE, as seen in recent models like DeepSeek, Kimi, and Qwen. In parallel, AI hardware is scaling up, with Huawei’s CloudMatrix384 SuperPod offering hundreds of GB/s high-speed interconnects. Running large MoE models on SuperPod-scale hardware brings new challenges. It requires new execution models, scalable scheduling, efficient expert load balancing, and elimination of single points of failure. This paper presents xDeepServe, Huawei Cloud’s LLM serving system designed for SuperPod-scale infrastructure. At its core is Transformerless, a disaggregated architecture that decomposes transformer models into modular units–attention, feedforward, and MoE–executed independently on NPUs connected via high-speed fabric. We implement this design in two forms: disaggregated prefill-decode and disaggregated MoE-attention. This fully disaggregated setup enables independent scaling of compute and memory without sacrificing performance. To support this architecture, we propose XCCL, a communication library that leverages CloudMatrix384’s global shared memory to implement efficient point-to-point and all-to-all primitives. We also extend our serving engine FlowServe with system-level techniques, enabling scalable inference across hundreds of NPUs.
Submitted

ExpertWeave: Efficiently Serving Expert-Specialized Fine-Tuned Adapters at Scale

Ge Shi, Hanieh Sadri, Qian Wang, Yu Zhang, Ying Xiong, Yong Zhang†, and Zhenan Fan†

Submitted, 2025

Abs arXiv

Expert-Specialized Fine-Tuning (ESFT) adapts Mixture-of-Experts (MoE) large language models to enhance their task-specific performance by selectively tuning the top-activated experts for the task. Serving these fine-tuned models at scale is challenging: deploying merged models in isolation is prohibitively resource-hungry, while existing multi-adapter serving systems with LoRA-style additive updates are incompatible with ESFT’s expert-oriented paradigm. We present ExpertWeave, a system that serves multiple ESFT adapters concurrently over a single shared MoE base model, drastically reducing the memory footprint and improving resource utilization. To seamlessly integrate into existing inference pipelines for MoE models with non-intrusive modifications and minimal latency overhead, ExpertWeave introduces a virtual-memory-assisted expert weight manager that co-locates base-model and adapter experts without incurring memory overhead from fragmentation, and a fused kernel for batched rerouting to enable lightweight redirection of tokens to the appropriate experts at runtime. Our evaluations show that ExpertWeave can simultaneously serve multiple adapters of a 16B MoE model on a single accelerator where the baseline runs out of memory, or provides up to 94x more KV cache capacity and achieves up to 18% higher throughput while using comparable resources, all without compromising model accuracy. ExpertWeave maintains low overhead even when scaling to 20 adapters, with a 4-11% latency increase compared with serving the base model alone. Source code will be released soon.
Submitted

HyperFlexis: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling

Zahra Yousefijamarani, Xinglu Wang, Qian Wang, Morgan Lindsay Heisler, Taha Shabani, Niloofar Gholipour, Parham Yassini, Hong Chang, Kan Chen, Qiantao Zhang, Xiaolong Bai, Jiannan Wang, Ying Xiong, Yong Zhang†, and Zhenan Fan†

Submitted, 2025

Abs arXiv

Modern large language model (LLM) serving systems face challenges from highly variable requests with diverse lengths, priorities, and stage-specific service-level objectives (SLOs). Meeting these requires real-time scheduling, rapid and cost-effective scaling, and support for both collocated and disaggregated Prefill/Decode (P/D) architectures. We present \textbfHyperFlexis, a unified LLM serving system that integrates algorithmic and system-level innovations to jointly optimize scheduling and scaling under multiple SLOs. It features a multi-SLO-aware scheduler that leverages budget estimation and request prioritization to ensure proactive SLO compliance for both new and ongoing requests. The system supports prefill- and decode-stage multi-SLO scheduling for P/D-disaggregated architectures and KV cache transfers. It also enables cost-effective scaling decisions, prefill-decode instance linking during scaling, and rapid P/D role transitions. To accelerate scaling and reduce cold-start latency, a device-to-device (D2D) weight transfer mechanism is proposed that lowers weight loading overhead by up to \textbf19.39. These optimizations allow the system to achieve up to \textbf4.44 higher SLO attainment, \textbf65.82% lower request latency, and cost parity with state-of-the-art baselines. The code will be released soon.
Submitted

ElasticMoE: An Efficient Auto Scaling Method for Mixture-of-Experts Models

Gursimran Singh, Timothy Yu, Haley Li, Cheng Chen, Hanieh Sadri, Qintao Zhang, Yu Zhang, Ying Xiong, Yong Zhang†, and Zhenan Fan†

Submitted, 2025

Abs arXiv

Mixture-of-Experts (MoE) models promise efficient scaling of large language models (LLMs) by activating only a small subset of experts per token, but their parallelized inference pipelines make elastic serving challenging. Existing strategies fall short: horizontal scaling provisions entire replicas of the current configuration, often tens to hundreds of accelerators, leading to coarse granularity, long provisioning delays, and costly overprovisioning. Vertical scaling offers finer adjustments but typically requires instance restarts, incurring downtime. These limitations make current approaches ill-suited for the bursty, short-lived traffic patterns common in cloud deployments. We present ElasticMoE, an elastic scaling framework for MoE LLMs that achieves fine-grained, low-latency, and zero-downtime scaling. ElasticMoE decouples inference execution from memory operations, enabling scaling steps to proceed concurrently with serving. An HBM Management Module (HMM) reuses weights and KV caches via zero-copy remapping, while high-bandwidth peer-to-peer transfers bring newly added accelerators online without interrupting service. A virtual memory based expert redistribution mechanism migrates MoE experts without costly buffer reallocations, reducing peak memory usage during expert parallelism reconfiguration. Our evaluation on Ascend NPUs with three popular MoE LLMs shows that ElasticMoE achieves up to 9x lower scale-up latency, up to 2x better throughput during scaling, and significantly improves SLO attainment compared to baselines. By enabling fine-grained, concurrent scaling with minimal disruption, ElasticMoE advances the practicality of deploying massive MoE LLMs in dynamic cloud environments.

2024

ICLR

Fair and efficient contribution valuation for vertical federated learning

Zhenan Fan, Huang Fang, Xinglu Wang, Zirui Zhou, Jian Pei, Michael P Friedlander, and Yong Zhang†

International Conference on Learning Representations (ICLR), 2024

Abs PDF

Federated learning is a popular technology for training machine learning models on distributed data sources without sharing data. Vertical federated learning or feature-based federated learning applies to the cases that different data sources share the same sample ID space but differ in feature space. To ensure the data owners’ long-term engagement, it is critical to objectively assess the contribution from each data source and recompense them accordingly. The Shapley value (SV) is a provably fair contribution valuation metric originated from cooperative game theory. However, computing the SV requires extensively retraining the model on each subset of data sources, which causes prohibitively high communication costs in federated learning. We propose a contribution valuation metric called vertical federated Shapley value (VerFedSV) based on SV. We show that VerFedSV not only satisfies many desirable properties for fairness but is also efficient to compute, and can be adapted to both synchronous and asynchronous vertical federated learning algorithms. Both theoretical analysis and extensive experimental results verify the fairness, efficiency, and adaptability of VerFedSV.
TechReport

Machine Learning Insides OptVerse AI Solver: Design Principles and Applications

OptVerse Team @ Huawei Cloud

Technical Report, 2024

Abs arXiv

In an era of digital ubiquity, efficient resource management and decision-making are paramount across numerous industries. To this end, we present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud’s OptVerse AI Solver, which aims to mitigate the scarcity of real-world mathematical programming instances, and to surpass the capabilities of traditional optimization techniques. We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem. Furthermore, we introduce a training framework leveraging augmentation policies to maintain solvers’ utility in dynamic environments. Besides the data generation and augmentation, our proposed approaches also include novel ML-driven policies for personalized solver strategies, with an emphasis on applications like graph convolutional networks for initial basis selection and reinforcement learning for advanced presolving and cut selection. Additionally, we detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance. Compared with traditional solvers such as Cplex and SCIP, our ML-augmented OptVerse AI Solver demonstrates superior speed and precision across both established benchmarks and real-world scenarios, reinforcing the practical imperative and effectiveness of machine learning techniques in mathematical programming solvers.
TechReport

Sql-encoder: Improving nl2sql in-context learning through a context-aware encoder

Mohammadreza Pourreza, Davood Rafiei, Yuxi Feng, Raymond Li, Zhenan Fan, and Weiwei Zhang†

Technical Report, 2024

Abs arXiv

Detecting structural similarity between queries is essential for selecting examples in in-context learning models. However, assessing structural similarity based solely on the natural language expressions of queries, without considering SQL queries, presents a significant challenge. This paper explores the significance of this similarity metric and proposes a model for accurately estimating it. To achieve this, we leverage a dataset comprising 170k question pairs, meticulously curated to train a similarity prediction model. Our comprehensive evaluation demonstrates that the proposed model adeptly captures the structural similarity between questions, as evidenced by improvements in Kendall-Tau distance and precision@k metrics. Notably, our model outperforms strong competitive embedding models from OpenAI and Cohere. Furthermore, compared to these competitive models, our proposed encoder enhances the downstream performance of NL2SQL models in 1-shot in-context learning scenarios by 1-2% for GPT-3.5-turbo, 4-8% for CodeLlama-7B, and 2-3% for CodeLlama-13B.
COLING

Towards human-aligned evaluation for linear programming word problems

Linzi Xing, Xinglu Wang, Yuxi Feng, Zhenan Fan, Jing Xiong, Zhijiang Guo, Xiaojin Fu, Rindra Ramamonjison, Mahdi Mostajabdaveh, Xiongwei Han, Zirui Zhou, and Yong Zhang†

Conference on Computational Linguistics (COLING), 2024

Abs DOI PDF

Math Word Problem (MWP) is a crucial NLP task aimed at providing solutions for given mathematical descriptions. A notable sub-category of MWP is the Linear Programming Word Problem (LPWP), which holds significant relevance in real-world decision-making and operations research. While the recent rise of generative large language models (LLMs) has brought more advanced solutions to LPWPs, existing evaluation methodologies for this task still diverge from human judgment and face challenges in recognizing mathematically equivalent answers. In this paper, we introduce a novel evaluation metric rooted in graph edit distance, featuring benefits such as permutation invariance and more accurate program equivalence identification. Human evaluations empirically validate the superior efficacy of our proposed metric when particularly assessing LLM-based solutions for LPWP.
AAAI

Learn2Aggregate: Supervised Generation of Chvátal-Gomory Cuts Using Graph Neural Networks

Arnaud Deza, Elias B Khalil, Zhenan Fan, Zirui Zhou, and Yong Zhang†

Annual AAAI Conference on Artificial Intelligence (AAAI), 2024

Abs DOI PDF

We present Learn2Aggregate, a machine learning (ML) framework for optimizing the generation of Chvátal-Gomory (CG) cuts in mixed integer linear programming (MILP). The framework trains a graph neural network to classify useful constraints for aggregation in CG cut generation. The ML-driven CG separator selectively focuses on a small set of impactful constraints, improving runtimes without compromising the strength of the generated cuts. Key to our approach is the formulation of a constraint classification task which favours sparse aggregation of constraints, consistent with empirical findings. This, in conjunction with a careful constraint labeling scheme and a hybrid of deep learning and feature engineering, results in enhanced CG cut generation across five diverse MILP benchmarks. On the largest test sets, our method closes roughly twice as much of the integrality gap as the standard CG method while running 40$% faster. This performance improvement is due to our method eliminating 75% of the constraints prior to aggregation.

2023

ICML

Smart Initial Basis Selection for Linear Programs

Zhenan Fan*, Xinglu Wang*, Oleksandr Yakovenko*, Abdullah Ali Sivas, Owen Ren, Yong Zhang, and Zirui Zhou†

International Conference on Machine Learning (ICML), 2023

Abs PDF Code

The simplex method, introduced by Dantzig more than half a century ago, is still to date one of the most efficient methods for solving large-scale linear programming (LP) problems. While the simplex method is known to have the finite termination property under mild assumptions, the number of iterations until optimality largely depends on the choice of initial basis. Existing strategies for selecting an advanced initial basis are mostly rule-based. These rules usually require extensive expert knowledge and empirical study to develop. Yet, many of them fail to exhibit consistent improvement, even for LP problems that arise in a single application scenario. In this paper, we propose a learning-based approach for initial basis selection. We employ graph neural networks as a building block and develop a model that attempts to capture the relationship between LP problems and their optimal bases. In addition, during the inference phase, we supplement the learning-based prediction with linear algebra tricks to ensure the validity of the generated initial basis. We validate the effectiveness of our proposed strategy by extensively testing it with state-of-the-art simplex solvers, including the open-source solver HiGHS and the commercial solver OptVerse. Through these rigorous experiments, we demonstrate that our strategy achieves substantial speedup and consistently outperforms existing rule-based methods. Furthermore, we extend the proposed approach to generating restricted master problems for column generation methods and present encouraging numerical results.
OJMO

Cardinality-constrained structured data-fitting problems

Zhenan Fan*, Huang Fang*, and Michael P Friedlander†

Open Journal of Mathematical Optimization, 2023

Abs DOI PDF

A memory-efficient framework is described for the cardinality-constrained structured data-fitting problem. Dual-based atom-identification rules are proposed that reveal the structure of the optimal primal solution from near-optimal dual solutions. These rules allow for a simple and computationally cheap algorithm for translating any feasible dual solution to a primal solution that satisfies the cardinality constraint. Rigorous guarantees are provided for obtaining a near-optimal primal solution given any dual-based method that generates dual iterates converging to an optimal dual solution. Numerical experiments on real-world datasets support confirm the analysis and demonstrate the efficiency of the proposed approach.

2022

TechReport

A dual approach for federated learning

Zhenan Fan*, Huang Fang*, and Michael P Friedlander†

Technical Report, 2022

Abs arXiv Code

We study the federated optimization problem from a dual perspective and propose a new algorithm termed federated dual coordinate descent (FedDCD), which is based on a type of coordinate descent method developed by Necora et al. [Journal of Optimization Theory and Applications, 2017]. Additionally, we enhance the FedDCD method with inexact gradient oracles and Nesterov’s acceleration. We demonstrate theoretically that our proposed approach achieves better convergence rates than the state-of-the-art primal federated optimization algorithms under mild conditions. Numerical experiments on real-world datasets support our analysis.
ICDE

Improving Fairness for Data Valuation in Horizontal Federated Learning

Zhenan Fan, Huang Fang, Zirui Zhou, Jian Pei, Michael P Friedlander, Changxin Liu, and Yong Zhang†

2022

Abs arXiv DOI

Federated learning is an emerging decentralized machine learning scheme that allows multiple data owners to work collaboratively while ensuring data privacy. The success of federated learning depends largely on the participation of data owners. To sustain and encourage data owners’ participation, it is crucial to fairly evaluate the quality of the data provided by the data owners and reward them correspondingly. Federated Shapley value, recently proposed by Wang et al. [Federated Learning, 2020], is a measure for data value under the framework of federated learning that satisfies many desired properties for data valuation. However, there are still factors of potential unfairness in the design of federated Shapley value because two data owners with the same local data may not receive the same evaluation. We propose a new measure called completed federated Shapley value to improve the fairness of federated Shapley value. The design depends on completing a matrix consisting of all the possible contributions by different subsets of the data owners. It is shown under mild conditions that this matrix is approximately low-rank by leveraging concepts and tools from optimization. Both theoretical analysis and empirical evaluation verify that the proposed measure does improve fairness in many circumstances.
IEEE-TSP

Polar Deconvolution of Mixed Signals

Zhenan Fan, Halyun Jeong, Babhru Joshi, and Michael P Friedlander†

IEEE Transactions on Signal Processing, 2022

Abs arXiv DOI Code

The signal demixing problem seeks to separate a superposition of multiple signals into its constituent components. This paper describes a two-stage approach that first decompresses and subsequently deconvolves the noisy and undersampled observations of the superposition. Probabilistic error bounds are given on the accuracy with which this process approximates the individual signals. The theory of polar convolution of convex sets and gauge functions plays a central role in the analysis and solution process. If the measurements are random and the noise is bounded, this approach stably recovers low-complexity and mutually incoherent signals, with high probability and with optimal sample complexity. We develop an efficient algorithm, based on level-set and conditional-gradient methods, that solves the convex optimization problems with sublinear iteration complexity and linear space requirements. Numerical experiments on both real and synthetic data confirm the theory and the efficiency of the approach.
TechReport

Knowledge-Injected Federated Learning

Zhenan Fan, Zirui Zhou, Jian Pei, Michael P Friedlander, Jiajie Hu, Chengliang Li, and Yong Zhang†

Technical Report, 2022

Abs arXiv Code

Federated learning is an emerging technique for training models from decentralized data sets. In many applications, data owners participating in the federated learning system hold not only the data but also a set of domain knowledge. Such knowledge includes human know-how and craftsmanship that can be extremely helpful to the federated learning task. In this work, we propose a federated learning framework that allows the injection of participants’ domain knowledge, where the key idea is to refine the global model with knowledge locally. The scenario we consider is motivated by a real industry-level application, and we demonstrate the effectiveness of our approach to this application.

2021

TechReport

Achieving Model Fairness in Vertical Federated Learning

Changxin Liu*, Zhenan Fan*, Zirui Zhou, Yang Shi, Jian Pei, Lingyang Chu, and Yong Zhang†

Technical Report, 2021

Abs arXiv Code

Vertical federated learning (VFL) has attracted greater and greater interest since it enables multiple parties possessing non-overlapping features to strengthen their machine learning models without disclosing their private data and model parameters. Similar to other machine learning algorithms, VFL faces demands and challenges of fairness, i.e., the learned model may be unfairly discriminatory over some groups with sensitive attributes. To tackle this problem, we propose a fair VFL framework in this work. First, we systematically formulate the problem of training fair models in VFL, where the learning task is modelled as a constrained optimization problem. To solve it in a federated and privacy-preserving manner, we consider the equivalent dual form of the problem and develop an asynchronous gradient coordinate-descent ascent algorithm, where some active data parties perform multiple parallelized local updates per communication round to effectively reduce the number of communication rounds. The messages that the server sends to passive parties are deliberately designed such that the information necessary for local updates is released without intruding on the privacy of data and sensitive attributes. We rigorously study the convergence of the algorithm when applied to general nonconvex-concave min-max problems. We prove that the algorithm finds a δ-stationary point of the dual objective in O(δ^-4) communication rounds under mild conditions. Finally, the extensive experiments on three benchmark datasets demonstrate the superior performance of our method in training fair models.

2020

ICLR

Fast convergence of stochastic subgradient method under interpolation

Huang Fang, Zhenan Fan, and Michael Friedlander†

International Conference on Learning Representations (ICLR), 2020

Abs PDF

This paper studies the behaviour of the stochastic subgradient descent (SSGD) method applied to over-parameterized nonsmooth optimization problems that satisfy an interpolation condition. By leveraging the composite structure of the empirical risk minimization problems, we prove that when interpolation holds, SSGD converges at the same rate as the stochastic gradient descent (SGD) method applied to smooth problems. Our analysis provides a partial explanation for the empirical observation that sometimes SGD and SSGD behave similarly for training smooth and nonsmooth machine learning models. We also prove that the rates we derive are optimal for the subgradient method in the convex and interpolation setting.
AISTATS

Greed Meets Sparsity: Understanding and Improving Greedy Coordinate Descent for Sparse Optimization

Huang Fang, Zhenan Fan, Yifan Sun, and Michael Friedlander†

International Conference on Artificial Intelligence and Statistics (AISTATS), 2020

Abs PDF Code

We consider greedy coordinate descent (GCD) for composite problems with sparsity inducing regularizers, including 1-norm regularization and non-negative constraints. Empirical evidence strongly suggests that GCD, when initialized with the zero vector, has an implicit screening ability that usually selects at each iteration coordinates that at are nonzero at the solution. Thus, for problems with sparse solutions, GCD can converge significantly faster than randomized coordinate descent. We present an improved convergence analysis of GCD for sparse optimization, and a formal analysis of its screening properties. We also propose and analyze an improved selection rule with stronger ability to produce sparse iterates. Numerical experiments on both synthetic and real-world data support our analysis and the effectiveness of the proposed selection rule.
FnT-OPT

Atomic Decomposition via Polar Alignment: The Geometry of Structured Optimization

Zhenan Fan, Halyun Jeong, Yifan Sun, and Michael P. Friedlander†

Foundations and Trends in Optimization, 2020

Abs DOI PDF

Structured optimization uses a prescribed set of atoms to assemble a solution that fits a model to data. Polarity, which extends the familiar notion of orthogonality from linear sets to general convex sets, plays a special role in a simple and geometric form of convex duality. This duality correspondence yields a general notion of alignment that leads to an intuitive and complete description of how atoms participate in the final decomposition of the solution. The resulting geometric perspective leads to variations of existing algorithms effective for large-scale problems. We illustrate these ideas with many examples, including applications in matrix completion and morphological component analysis for the separation of mixtures of signals.

2019

ACSSC

Bundle methods for dual atomic pursuit

Zhenan Fan, Yifan Sun, and Michael P Friedlander†

Asilomar Conference on Signals, Systems, and Computers (ACSSC), 2019

Abs arXiv DOI

The aim of structured optimization is to assemble a solution, using a given set of (possibly uncountably infinite) atoms, to fit a model to data. A two-stage algorithm based on gauge duality and bundle method is proposed. The first stage discovers the optimal atomic support for the primal problem by solving a sequence of approximations of the dual problem using a bundle-type method. The second stage recovers the approximate primal solution using the atoms discovered in the first stage. The overall approach leads to implementable and efficient algorithms for large problems.