Big Data Statistics Workshop: Methods, Theory and Applications (2014.11.01-2014.11.04)
主办单位: 上海财经大学统计与管理学院 上海数学中心(复旦大学)
承办单位: 上海财经大学统计与管理学院 复旦大学管理学院
程序委员会: 沈晓彤、应志良、周勇、黄坚、张新生、朱冀、 朱利平、冯兴东
Nov 1:
Morning:
Statistical computing and biostatistics
9:45am-10:25am, Li Ping , Rutgers University
10:25am-11:05am, Wang Yazhen, University of Wiscosnsin-Madison
11:05am-11:45am, Lin Huazhen, Southwestern University of Finance and
Economics
Afternoon:
Machine learning and biostatistics
2:00pm-2:40pm, Zhu Ji, University of Michigan
2:40pm-3:20pm, Pan Wei, University of Minnesota
Break
3:40pm-4:20pm, Hu Feifang, George Washington University
4:20pm-5:00pm, Zou Guohua, Chinese Academy of Sciences
Nov 2:
Morning:
Finance and business statistics, and machine learning
9:00am-9:40am, Fan Jianqing, Princeton University
9:40am-10:20m, Shen Haipeng, University of North Carolina at Chapel Hill
Break
10:40am-11:20am, Yang Lijian,, Suzhou University
11:20am-12:00am, Shen Xiaotong, University of Minnesota
Nov 3:
Morning: Machine learning and Computation
9:00am-9:40am, Liu Chuanhai, Purdue University
9:40am-10:20am, Liu Yufeng, University of North Carolina at Chapel Hill
Break
10:40am-11:20am, Zhang Hao, University of Arizona
11:20am-12:00am, Zhang Chunming, University of Wisconsin-Madison
Afternoon:
High-dimensional data
2:00pm-2:40pm, Zhu Hongtu, University of North Carolina at Chapel Hill
2:40pm-3:20pm, Feng Yang, Columbia University
Break
3:40pm-4:20pm, TBA
4:20pm-5:00pm, TBA
Nov 4:
Morning: Business statistics and high-dimensional data
9:00am-9:40am, Lin Dennis, The Pennsylvania State University
9:40am-10:20am, Qu Peiyong , University of Illinois at Urbana-Champaign
10:20am-11:00am, Wang Junhui, City University of Hong Kong
Reference
Speaker: Jianqing Fan, Princeton University
Title: Multi-task quantile regression under the transnormal model
Abstract: We consider estimating multi-task quantile regression under the
transnormal model, with focus on high-dimensional setting. We derive a
surprisingly simple closed-form solution through rank-based covariance
regularization. In particular, we propose the rank-based $\ell_1$ penalization with
positive definite constraints for estimating sparse covariance matrices, and the
rank-based banded Cholesky decomposition regularization for estimating banded
precision matrices. By taking advantage of alternating direction method of multipliers,
nearest correlation matrix projection is introduced that inherits sampling properties of
the unprojected indefinite matrix. Our work combines strengths of quantile regression
and rank-based covariance regularization to simultaneously deal with nonlinearity,
nonnormality and high dimensionality for high-dimensional regression. Furthermore,
the proposed method strikes a nice balance between robustness and efficiency, and
achieves the
oracle''-like convergence rate under the high-dimensional setting where
dimension is at a nearly exponential rate to sample size. The finite-sample
performance of the proposed method is also examined. The superior performance of
our proposed rank-based method is demonstrated in a real application to analyze the
call center arrival data.
Speaker: Yang Feng, Columbia University
Title: Model Selection in High-Dimensional Misspecified Models
Abstract: Model selection is indispensable to high-dimensional sparse modeling in
selecting the best set of covariates among a sequence of candidate models. Most
existing work assumes implicitly that the model is correctly specified or of fixed
dimensions. Yet model misspecification and high dimensionality are common in real
applications. In this paper, we investigate two classical Kullback-Leibler divergence
and Bayesian principles of model selection in the setting of high-dimensional
misspecified models. Asymptotic expansions of these principles reveal that the effect
of model misspecification is crucial and should be taken into account, leading to the
generalized AIC and generalized BIC in high dimensions. With a natural choice of
prior probabilities, we suggest the generalized BIC with prior probability
whichinvolves a logarithmic factor of the dimensionality in penalizing model
complexity. We further establish the consistency of the covariance contrast matrix
estimator in a general setting. Our results and new method are supported by numerical
studies.
Speaker: Feifang Hu, George Washington University
Title: Personalized Medicine and Big Data:Some Statistical Challenges
Abstract: With today’s modern technology, it becomes easier and easier to collect
(big) data. Personalized medicine is a medical model emphasizing the systematic use
of information about an individual patient to select or optimize that patient's
preventative and therapeutic care. Three main steps to develop personalized medicine:
(i) Identify important biomarkers that could be related with certain diseases:
Bio-informatics, genomics,
proteomics, and metabolomics, etc. (ii) Well designed clinical studies to confirm the
significance of biomarkers to certain diseases and treatments, identify suitable (new)
treaments (drugs), then approved by FDA; (iii) Implement to healthcare. In this
presentation, I will focus on some statistical challenges: (1) how to identify important
biomarkers from bio-informatics studies (big data); (2) how to design a good
clinical trails with many important covaraites; and (3) statistical inference of clinical
studies for personalized medicine.
Speaker: Ping Li, Rutgers University
Title: BigData: Hashing Algorithms for Large-Scale Search and Learning
Abstract: The talk will begin with a fun story about Cauchy distribution. Consider
two data vectors, u and v, and a vector R of i.i.d. Cauchy variables. Pr{sgn(<u,R>) =
sgn(<v,R>)} is essentially a monotonic function of the chi-square similarity (a
nonlinear kernel) between u and v. This observation leads to useful bigdata (LINEAR)
algorithms for building large-scale statistical models and searching for near neighbors,
in terms of the chi-square similarity (kernel). Chi-square similarity has been known
for its superb performance in data generated from histograms (e.g., computer vision
and NLP).
Modern applications of internet search and machine learning routinely encounter
datasets with (hundreds of) billions of examples in billion or even billion square
dimensions (e.g., documents represented by high-order n-grams). Developing novel
algorithms for efficient search and machine learning has become an active area of
research. Hashing can be very useful in many scenarios, for example,
(1) Some device only has limited computing/storage/power resources;
(2) To achieve higher accuracy, we may want to explicitly consider pairwise
or 3-way interactions (or high-order n-grams) in linear models.
(3) We hope to reduce the complexity of learning models (e.g., deep nets) by hashing
the inputs or hashing the outputs.
(4) Hashing is an effective way of indexing (and space partitioning), which allows
efficient sub-linear time near neighbor search.
(5) Perhaps surprisingly, our newest research can show that, if designed carefully,
hashing (which naturally leads to linear algorithms) can also model the nonlinear
effect (e.g., nonlinear kernels). Examples of such kernels include resemblance,
chi-square, and CoRE kernels.
This talk will cover a variety of hashing algorithms including sign Cauchy projections,
b-bit minwise hashing, one permutation, and densified one permutation hashing, etc.
Speaker: Dennis Lin, Department of Statistics,The Pennsylvania State University
Title: Statistics for Big Data
Abstract:After noting the relative absence of statisticians from the community of p
ractice engaged with big data, we explain what big data is, how it's done, and who's w
orking with it. Statisticians have much more to contribute in both the intellectual vita
lity and the practical utility of big data. At the same time, big data challenges stati
sticians to move out of some familiar habits to engage less structured problems, to bec
ome more comfortable with ambiguity, and to engage computer scientists in a more fr
uitful discussion of what the various parties can bring to this new mode of investiga
tion. In this talk, we propose some potential directions for future research.
Speaker: Huazhen Lin, Southwestern University of Finance and Economics
Title: Semiparametric latent variable transformation models for multiple mixed
outcomes
Abstract: The surge of technological advances that allow multiple outcomes to be
routinely collected has brought up a high demand of valid statistical methods that can
summarize and study the latent variables underlying them. Mixed outcome data, e.g.
those with continuous and ordinal components, present further statistical challenges.
Addressing to these challenges, we develop a new class of semiparametric latent
variable transformation models to summarize the multiple correlated outcomes of
mixed types in a data-driven way. We propose a series of estimating equation-based
and likelihood-based procedures for estimation and inference. The resulting
estimators are shown to be $n^{1/2}$-consistent (even for the nonparametric link
functions) and asymptotically normal. Simulations suggest robustnees as well as
high efficiency, and the proposed approach is applied to assess the effectiveness of
recombinant tissue plasminogen activator on ischemic stroke patients.
Speaker: Chuanhai Liu, Department of Statistics, Purdue University
Title: Fisher, Tukey, and Beyond: Scientific Inference with Big Data
Abstract: As the tool for the science that converts experience, in the form of
observed data, to knowledge about unknown quantities of interest, Statistics will be
fundamental to the ultimate success of “bigdata science”. In this talk I will focus on (1)
scientific issues in big data analysis froma Fisherian point of view, (2) theTukey
school on exploratory dataanalysis toward model building, (3) theimportance of
computing and the best part of the development ofcomputational statistics in the last
half century, (4) some of my currentbig-data and scientific inference research projects
with collaborators,including reasoning with uncertainty, large-scale multinomial
inference andits application in genome-wide association study, parallel iterative
andsimulation methods for statistical analysis of massive data, and statisticaltheory
and methods for Divide-and-Recombine analysis of large complex data,and (5) future
research topics on scientific inference with big data.
Speaker: Yufeng Liu,University of North Carolina at Chapel Hill
Title: Sparse Regression Incorporating Graphical Structure Among Predictors
Abstract: With the abundance of high dimensional data in various disciplines, sparse
regularized techniques are very popular these days. In this talk, we use the structure
information among predictors to improve sparse regression models. Typically, such
structure information can be modeled by the connectivity of an undirected graph.
Most existing methods use this graph edge-by-edge to encourage the regression
coefficients of corresponding connected predictors to be similar. However, such
methods may require expensive computation when the predictor graph has many
edges. Furthermore, they do not directly utilize the neighborhood information. In this
work, we incorporate the graph information node-by-node instead of edge-by-edge.
Our proposed method is quite general and it includes adaptive Lasso, group Lasso and
ridge regression as special cases. Both theoretical study and numerical study
demonstrate the effectiveness of the proposed method for simultaneous estimation,
prediction and model selection.
This talk is based on joint work with Guan Yu at UNC-Chapel Hill.
Speaker: Annie Qu, University of Illinois at Urbana-Champaign
Title: Weak Signal Identification and Inference in Penalized Model Selection
Abstract: Penalized model selection methods are developed to select variables and
estimate coefficients simultaneously, which is useful in high-dimensional variable
selection. However, identification and inference for weak signals are still quite
challenging and are not well-studied. Existing inference procedures for the penalized
estimators are mainly focused on strong signals. This motivates use to investigate
finite sample behavior for weak signal inference. We propose an identification
procedure for weak signals in finite samples, and provide a transition phase
in-between noise and strong signal strengths. A new two-step inferential method is
introduced to construct better inference for the weak signals being identified. Our
simulation studies show that the proposed method leads to better confidence
coverage for weak signals, compared with those using asymptotic inference,
perturbation and bootstrap resampling approaches. We also illustrate our method for
HIV antiretroviral drug susceptibility data to identify genetic mutations associated
with HIV drug resistance. This is joint work with Peibei Shi.
Speaker: Haipeng Shen,University of North Carolina at Chapel Hill
Title: Big Data Opportunities and Challenges in Business Analytics
Abstract: Big data are becoming increasingly common in our modern digital business
world. Businesstransaction data are being collected with ever‐increasing volume,
dimensionality, and complexity, across multiple channels. I shall use real business
examples from financial service systems, healthcare, and mobile marketing, to discuss
opportunities and challenges offered by big data in business analytics.
Speaker: Xiaotong Shen,University of Minnesota
Title: Ordinal classification with unstructured predictors
Abstract: Unstructured data refers to information that lacks certain structures and
cannot beorganized in a predefined fashion, which involve words, graphs, objects
ormultimedia types of files. In this presentation, I will focus on classification for
unstructured word predictors with ordered class categories, where imprecise
informationconcerning strengths between predictors is available for predicting the
class labels. However, the imprecise information s expressed in terms of a directed
graph, with each node representing a predictor and directed edge containing
pairwisestrengths between two nodes. One of targeted applications for unstructured
data arises from sentiment analysis. Large margin ordinal classifierswill be introduced,
which integrate the imprecise predictor relations into linear relationalconstraints over
classification function coefficients subject to many linearconstraints. This work is
joint with J. Wang (UIC), P. Qu (UIUC) and Y. Sun (UMN).
Speaker: Junhui Wang,City University of Hong Kong
Title: Model-free Variable Selection via LearningGradients
Abstract: In recent years, variable selection has attracted enormous attention from
statistics community. A wide spectrum of variable selection algorithms have been
proposed based on various model assumption. In this talk, we will propose a general
model-free variable selection framework. As opposed to existing algorithms, the key
advantage of the proposed framework is that it assumes no distributional model,
admits general predictor effects, allows for ecient computation, and attains desirable
theoretical properties. The proposed framework is formulated in the form of gradient
learning in a reproducing kernel Hilbert space, which enjoys the power of and
extended representor theorem and thus enables ecient learning of sparse gradients.
The proposed framework is implemented via a scalable block coordinate descent
algorithm. The advantage is demonstrated in a variety of simulated experiments as
well as real datasets. If time permits, the asymptotic consistencies will be discussed.
Speaker: Yazhen Wang,Department of Statistics, University of Wisconsin-Madison
Title: Statistics in Quantum Paradigm
Abstract: Quantum computation and quantum information are of great current
interest in computerscience, mathematics, physical sciences and engineering. They
will likely lead to anew wave of technological innovations in communication,
computation and cryptography.As the theory of quantum physics is fundamentally
stochastic, randomness anduncertainty are deeply rooted in quantum computation,
quantum information andquantum simulation. Thus statistics can play an important
role in quantum computation andquantum simulation, which in turn offer great
potential to revolution sing as well asstatistical modeling and analysis of quantum
computing experimental data.
Speaker: Chunming Zhang, University of Wisconsin-Madison
Title: Single-index modulated multiple testing
Abstract: In the context of large-scale multiple testing, hypotheses are often
accompanied with certain prior information. In this paper, we present a single-index
modulated (SIM) multiple testing procedure, which maintains control of the false
discovery rate (FDR) while incorporating prior information, by assuming the
availability of a bivariate p-value, (p_1,p_2), for each hypothesis, where p_1 is a
preliminary p-value from prior information and p_2 is the primary p-value for the
ultimate analysis. To find the optimal rejection region for the bivariate p-value, we
propose a criteria based on the ratio of probability density functions of (p_1,p_2)
under the true null and non-null. This criteria in the bivariate normal setting further
motivates us to project the bivariate p-value to a single-index, p(θ), for a wide range
of directions θ. The true null distribution of p(θ) is estimated via parametric and
nonparametric approaches, leading to two procedures for estimating and controlling
the FDR. To derive the optimal projection direction θ, we propose a new approach
based on power comparison, which is further shown to be consistent under some mild
conditions. Simulation evaluations indicate that the SIM multiple testing procedure
improves the detection power significantly while controlling the FDR. Analysis of a
genomic dataset will be illustrated.
Speaker: Hao Zhang, University of Arizona
Title: Component Selection and Estimation for Functional Additive Models
Abstract: Functional additive model provides a flexible yet simple framework for
regressions involving functional predictors. The utilization of data-driven basis in an
additive rather than linear structure naturally extends the classical functional linear
model. However, the critical issue of selecting nonlinear additive components has
been less studied. In this work, we propose a new regularization framework for joint
component selection and estimation in the context of the Reproducing Kernel Hilbert
Space. The proposed approach takes advantage of the functional principal
components which greatly facilitates the implementation and the theoretical analysis.
The selection and estimation are achieved by penalized least squares using a penalty
which encourages the sparse structure of the additive components. Theoretical
properties, such as the existence and the rate of convergence are investigated. The
empirical performance is demonstrated through simulation studies and a real data
application.
Speaker: Hongtu Zhu, UNC-Chapel Hill Biostatistics and Biomedical Research Imaging Center
Title: A Statistician’s Experience in CompletingAlzheimers Disease Big Data DREAM Challenge
Abstract: The goal of the Alzheimer's Disease Big Data DREAM Challenge #1
(AD#1) is to apply an open science approach to rapidly identify accurate predictive
AD biomarkers that can be used by the scientific, industrial and regulatory
communities to improve AD diagnosis and treatment. AD#1 will be the first in a
series of AD Data Challenges to leverage genetics and brain imaging in combination
with cognitive assessments, biomarkers and demographic information from cohorts
ranging from cognitively normal to mild cognitively impaired to individuals with AD.
We will review a series of statistical learning methods including our new methods for
building prediction models based on ultra-high dimensional imaging and genetic
features. We will discuss the pros and cons of various methods and their performance
in all three challenges in AD DREAM challenge.
Speaker: Ji Zhu, University of Michigan
Title: Link Prediction for Partially Observed Networks
Abstract: Link prediction is one of the fundamental problems in
network analysis. In many applications, notably in genetics, a
partially observed network may not contain any negative examples of absent edges,
which creates a difficulty for many existing supervised learning approaches. We
develop a new method which treats the observed network as a sample of the true
network with different sampling rates for positive and negative examples. We obtain a
relative ranking of potential links by their probabilities, utilizing information on node
covariates as well as on network topology. Empirically, the method performs well
under many settings, including when the observed network is sparse. We apply the
method to a protein-protein interaction network and a school friendship network.
Speaker: 邹国华, 中国科学院数学与系统科学研究院,首都师范大学
Title: 企业薪酬抽样调查方法与数据处理研究
Abstract: 我国当前对企业员工的薪酬(工资)状况没有全面的统计报表,为了掌握全国
分地区分行业企业各类人员的薪酬状况,采用抽样调查是一种节省时间和费用的方法。本报
告主要介绍我们课题组设计的抽样调查方案以及相应的数据处理方法。为了与公务员工资水
平进行比较,我们也提出了一种新的衡量指数,克服了拉斯佩尔指数的缺陷。
详细请参考链接:
http://ssm.shufe.edu.cn/structure/sy/tzggxx?infid=164111&categoryid=
http://www.scms.fudan.edu.cn/Workshop17/