光華講壇——社會名流與企業家論壇第6642期
主 題:Conformal Prediction Intervals and Predictive Distributions保形預測區間與預測分布
主講人:NIAID Jing Qin教授
主持人:統計學院林華珍教授
時間:10月16日16:00-17:00
舉辦地點:柳林校區弘遠樓408會議室
主辦單位:統計研究中心和統計學院 科研處
主講人簡介:
Dr. Jing Qin is a Mathematical Statistician at the Biostatistics Research Branch of the National Institute of Allergy and Infectious Diseases (NIAID). He earned his Ph.D. in 1992 from the University of Waterloo and subsequently became an Assistant Professor at the University of Maryland, College Park. Before joining the National Institutes of Health (NIH) in 2004, Dr. Qin spent five years at the Memorial Sloan-Kettering Cancer Center. His research interests encompass a wide range of topics, including empirical likelihood methods, case-control studies, various-biased sampling problems, econometrics, survival analysis, missing data, causal inference, genetic mixture models, generalized linear models, survey sampling, and microarray data analysis. Recently, Dr. Qin’s work has focused on conformal inference for quantifying uncertainty in machine learning. In 2006, he was elected a Fellow of the American Statistical Association. He is also the author of a 2017 monograph titled
Biased Sampling, Over-identified Parametric Problems, and Beyond (Springer, ICSA Book Series in Statistics).
Qin Jing,美國國家過敏和傳染病研究所(NIAID)生物統計研究部門的一名數理統計學家。他于1992年在滑鐵盧大學獲得博士學位,隨后成為馬里蘭大學帕克分校的助理教授。在2004年加入美國國立衛生研究院(NIH)之前,秦博士在紀念斯隆-凱特琳癌癥中心工作了五年。他的研究興趣涵蓋廣泛的主題,包括經驗似然方法、病例對照研究、各種有偏抽樣問題、計量經濟學、生存分析、缺失數據、因果推斷、遺傳混合模型、廣義線性模型、抽樣調查以及基因芯片數據分析。最近,秦博士的工作重點是用于量化機器學習中不確定性的保形推斷。2006年,他被選為美國統計協會(ASA)DE Fellow。2017年出版專著《Biased Sampling, Over-identified Parametric Problems, and Beyond》(Springer出版社)。
內容簡介:
Conformal prediction (CP) is a machine learning framework for uncertainty quantification that produces statistically valid prediction regions (prediction intervals) for any underlying point predictor (whether statistical, machine, or deep learning) only assuming exchangeability of the data. Consider a scenario where we possess training data inclusive of both the feature variable X and the outcome Y . Simultaneously, we have test data that only includes the feature variable X. The objective is to construct a 95% confidence interval for the outcome Y in the test data. Lawless and Fredette (2005) addressed this challenge within parametric frameworks, employing a pivotal-based approach. Their method yields prediction intervals and predictive distributions with well-calibrated frequentist probability interpretations. However, as the dimension of the feature variable grows large, modeling the conditional distribution of Y jX becomes increasingly challenging. In this talk, we aim to extend their work by removing the parametric assumption for the predictive interval. Unfortunately, without making parametric assumptions about the conditional distribution of Y jX, obtaining an accurate estimation of conditional coverage becomes impossible. Instead, we will leverage the concept from the latest conformal inference (Vovk et al. 2005), which requires only accurate unconditional coverage. While the conformal predictive interval is inherently distribution-free, it is noteworthy that the choice of a robust working conditional model can significantly impact the resulting interval length. In essence, a well-designed conditional model contributes to the construction of shorter intervals, highlighting the practical importance of a thoughtful and effective modeling approach even in distribution-free settings. Furthermore, we will delve into the application of conformal predictive confidence intervals in more intricate scenarios. This includes situations where there is a covariate shift between training and test data, as well as cases where the outcome Y might be right-censored.
保形預測(Conformal prediction, CP)是一種用于不確定性量化的機器學習框架,它可以為任何底層點預測器(無論是統計學習、機器學習還是深度學習)生成具有統計有效性的預測區間(預測間隔),僅假設數據的可交換性。設想一種情景,擁有包括特征變量X和結果Y的訓練數據,同時還有僅包含特征變量X的測試數據。目標是為測試數據中的結果Y構建一個95%的置信區間。Lawless和Fredette(2005)在參數框架下解決了這一問題,采用基于樞軸的方式。該方法生成的預測區間和預測分布具有良好的頻率學概率解釋。然而,隨著特征變量維度的增加,對條件分布P(Y|X)進行建模變得愈發困難。
在本次討論中,主講人旨在通過移除預測區間的參數假設,來擴展他們的工作。然而,如果不對P(Y|X)的條件分布作出參數假設,就無法準確估計條件覆蓋率。取而代之的是,主講人將借鑒最新的保形推斷(Vovk等人,2005)的概念,該方法只需要精確的無條件覆蓋率。盡管保形預測區間本質上是分布無關的,但值得注意的是,選擇一個穩健的條件模型能夠顯著影響預測區間的長度。簡單來說,設計良好的條件模型有助于構建更短的預測區間,突出了即使在分布無關的設置中,精心且有效的建模方法仍然具有實際重要性。
此外,主講人還將探討在更復雜情境下保形預測置信區間的應用,包括訓練數據與測試數據之間存在協變量漂移的情況,以及結果Y可能被右刪失的情形。