Big Data Statistics Workshop: Methods, Theory and Applications. (推广)

durandalk

Big Data Statistics Workshop: Methods, Theory and Applications (2014.11.01-2014.11.04)

主办单位: 上海财经大学统计与管理学院上海数学中心(复旦大学)

承办单位: 上海财经大学统计与管理学院复旦大学管理学院

程序委员会: 沈晓彤、应志良、周勇、黄坚、张新生、朱冀、朱利平、冯兴东

Nov 1:

Morning:

Statistical computing and biostatistics

9:45am-10:25am, Li Ping , Rutgers University

10:25am-11:05am, Wang Yazhen, University of Wiscosnsin-Madison

11:05am-11:45am, Lin Huazhen, Southwestern University of Finance and

Economics

Afternoon:

Machine learning and biostatistics

2:00pm-2:40pm, Zhu Ji, University of Michigan

2:40pm-3:20pm, Pan Wei, University of Minnesota

Break

3:40pm-4:20pm, Hu Feifang, George Washington University

4:20pm-5:00pm, Zou Guohua, Chinese Academy of Sciences

Nov 2:

Morning:

Finance and business statistics, and machine learning

9:00am-9:40am, Fan Jianqing, Princeton University

9:40am-10:20m, Shen Haipeng, University of North Carolina at Chapel Hill

Break

10:40am-11:20am, Yang Lijian,, Suzhou University

11:20am-12:00am, Shen Xiaotong, University of Minnesota

Nov 3:

Morning: Machine learning and Computation

9:00am-9:40am, Liu Chuanhai, Purdue University

9:40am-10:20am, Liu Yufeng, University of North Carolina at Chapel Hill

Break

10:40am-11:20am, Zhang Hao, University of Arizona

11:20am-12:00am, Zhang Chunming, University of Wisconsin-Madison

Afternoon:

High-dimensional data

2:00pm-2:40pm, Zhu Hongtu, University of North Carolina at Chapel Hill

2:40pm-3:20pm, Feng Yang, Columbia University

Break

3:40pm-4:20pm, TBA

4:20pm-5:00pm, TBA

Nov 4:

Morning: Business statistics and high-dimensional data

9:00am-9:40am, Lin Dennis, The Pennsylvania State University

9:40am-10:20am, Qu Peiyong , University of Illinois at Urbana-Champaign

10:20am-11:00am, Wang Junhui, City University of Hong Kong

Reference

Speaker: Jianqing Fan, Princeton University

Title: Multi-task quantile regression under the transnormal model

Abstract: We consider estimating multi-task quantile regression under the

transnormal model, with focus on high-dimensional setting. We derive a

surprisingly simple closed-form solution through rank-based covariance

regularization. In particular, we propose the rank-based $\ell_1$ penalization with

positive definite constraints for estimating sparse covariance matrices, and the

rank-based banded Cholesky decomposition regularization for estimating banded

precision matrices. By taking advantage of alternating direction method of multipliers,

nearest correlation matrix projection is introduced that inherits sampling properties of

the unprojected indefinite matrix. Our work combines strengths of quantile regression

and rank-based covariance regularization to simultaneously deal with nonlinearity,

nonnormality and high dimensionality for high-dimensional regression. Furthermore,

the proposed method strikes a nice balance between robustness and efficiency, and

achieves the oracle''-like convergence rate under the high-dimensional setting where

dimension is at a nearly exponential rate to sample size. The finite-sample

performance of the proposed method is also examined. The superior performance of

our proposed rank-based method is demonstrated in a real application to analyze the

call center arrival data.

Speaker: Yang Feng, Columbia University

Title: Model Selection in High-Dimensional Misspecified Models

Abstract: Model selection is indispensable to high-dimensional sparse modeling in

selecting the best set of covariates among a sequence of candidate models. Most

existing work assumes implicitly that the model is correctly specified or of fixed

dimensions. Yet model misspecification and high dimensionality are common in real

applications. In this paper, we investigate two classical Kullback-Leibler divergence

and Bayesian principles of model selection in the setting of high-dimensional

misspecified models. Asymptotic expansions of these principles reveal that the effect

of model misspecification is crucial and should be taken into account, leading to the

generalized AIC and generalized BIC in high dimensions. With a natural choice of

prior probabilities, we suggest the generalized BIC with prior probability

whichinvolves a logarithmic factor of the dimensionality in penalizing model

complexity. We further establish the consistency of the covariance contrast matrix

estimator in a general setting. Our results and new method are supported by numerical

studies.

Speaker: Feifang Hu, George Washington University

Title: Personalized Medicine and Big Data:Some Statistical Challenges

Abstract: With today’s modern technology, it becomes easier and easier to collect

(big) data. Personalized medicine is a medical model emphasizing the systematic use

of information about an individual patient to select or optimize that patient's

preventative and therapeutic care. Three main steps to develop personalized medicine:

(i) Identify important biomarkers that could be related with certain diseases:

Bio-informatics, genomics,

proteomics, and metabolomics, etc. (ii) Well designed clinical studies to confirm the

significance of biomarkers to certain diseases and treatments, identify suitable (new)

treaments (drugs), then approved by FDA; (iii) Implement to healthcare. In this

presentation, I will focus on some statistical challenges: (1) how to identify important

biomarkers from bio-informatics studies (big data); (2) how to design a good

clinical trails with many important covaraites; and (3) statistical inference of clinical

studies for personalized medicine.

Speaker: Ping Li, Rutgers University

Title: BigData: Hashing Algorithms for Large-Scale Search and Learning

Abstract: The talk will begin with a fun story about Cauchy distribution. Consider

two data vectors, u and v, and a vector R of i.i.d. Cauchy variables. Pr{sgn(<u,R>) =

sgn(<v,R>)} is essentially a monotonic function of the chi-square similarity (a

nonlinear kernel) between u and v. This observation leads to useful bigdata (LINEAR)

algorithms for building large-scale statistical models and searching for near neighbors,

in terms of the chi-square similarity (kernel). Chi-square similarity has been known

for its superb performance in data generated from histograms (e.g., computer vision

and NLP).

Modern applications of internet search and machine learning routinely encounter

datasets with (hundreds of) billions of examples in billion or even billion square

dimensions (e.g., documents represented by high-order n-grams). Developing novel

algorithms for efficient search and machine learning has become an active area of

research. Hashing can be very useful in many scenarios, for example,

(1) Some device only has limited computing/storage/power resources;

(2) To achieve higher accuracy, we may want to explicitly consider pairwise

or 3-way interactions (or high-order n-grams) in linear models.

(3) We hope to reduce the complexity of learning models (e.g., deep nets) by hashing

the inputs or hashing the outputs.

(4) Hashing is an effective way of indexing (and space partitioning), which allows

efficient sub-linear time near neighbor search.

(5) Perhaps surprisingly, our newest research can show that, if designed carefully,

hashing (which naturally leads to linear algorithms) can also model the nonlinear

effect (e.g., nonlinear kernels). Examples of such kernels include resemblance,

chi-square, and CoRE kernels.

This talk will cover a variety of hashing algorithms including sign Cauchy projections,

b-bit minwise hashing, one permutation, and densified one permutation hashing, etc.

Speaker: Dennis Lin, Department of Statistics,The Pennsylvania State University

Title: Statistics for Big Data

Abstract:After noting the relative absence of statisticians from the community of p

ractice engaged with big data, we explain what big data is, how it's done, and who's w

orking with it. Statisticians have much more to contribute in both the intellectual vita

lity and the practical utility of big data. At the same time, big data challenges stati

sticians to move out of some familiar habits to engage less structured problems, to bec

ome more comfortable with ambiguity, and to engage computer scientists in a more fr

uitful discussion of what the various parties can bring to this new mode of investiga

tion. In this talk, we propose some potential directions for future research.

Speaker: Huazhen Lin, Southwestern University of Finance and Economics

Title: Semiparametric latent variable transformation models for multiple mixed

outcomes

Abstract: The surge of technological advances that allow multiple outcomes to be

routinely collected has brought up a high demand of valid statistical methods that can

summarize and study the latent variables underlying them. Mixed outcome data, e.g.

those with continuous and ordinal components, present further statistical challenges.

Addressing to these challenges, we develop a new class of semiparametric latent

variable transformation models to summarize the multiple correlated outcomes of

mixed types in a data-driven way. We propose a series of estimating equation-based

and likelihood-based procedures for estimation and inference. The resulting

estimators are shown to be $n^{1/2}$-consistent (even for the nonparametric link

functions) and asymptotically normal. Simulations suggest robustnees as well as

high efficiency, and the proposed approach is applied to assess the effectiveness of

recombinant tissue plasminogen activator on ischemic stroke patients.

Speaker: Chuanhai Liu, Department of Statistics, Purdue University

Title: Fisher, Tukey, and Beyond: Scientific Inference with Big Data

Abstract: As the tool for the science that converts experience, in the form of

observed data, to knowledge about unknown quantities of interest, Statistics will be

fundamental to the ultimate success of “bigdata science”. In this talk I will focus on (1)

scientific issues in big data analysis froma Fisherian point of view, (2) theTukey

school on exploratory dataanalysis toward model building, (3) theimportance of

computing and the best part of the development ofcomputational statistics in the last

half century, (4) some of my currentbig-data and scientific inference research projects

with collaborators,including reasoning with uncertainty, large-scale multinomial

inference andits application in genome-wide association study, parallel iterative

andsimulation methods for statistical analysis of massive data, and statisticaltheory

and methods for Divide-and-Recombine analysis of large complex data,and (5) future

research topics on scientific inference with big data.

Speaker: Yufeng Liu,University of North Carolina at Chapel Hill

Title: Sparse Regression Incorporating Graphical Structure Among Predictors

Abstract: With the abundance of high dimensional data in various disciplines, sparse

regularized techniques are very popular these days. In this talk, we use the structure

information among predictors to improve sparse regression models. Typically, such

structure information can be modeled by the connectivity of an undirected graph.

Most existing methods use this graph edge-by-edge to encourage the regression

coefficients of corresponding connected predictors to be similar. However, such

methods may require expensive computation when the predictor graph has many

edges. Furthermore, they do not directly utilize the neighborhood information. In this

work, we incorporate the graph information node-by-node instead of edge-by-edge.

Our proposed method is quite general and it includes adaptive Lasso, group Lasso and

ridge regression as special cases. Both theoretical study and numerical study

demonstrate the effectiveness of the proposed method for simultaneous estimation,

prediction and model selection.

This talk is based on joint work with Guan Yu at UNC-Chapel Hill.

Speaker: Annie Qu, University of Illinois at Urbana-Champaign

Title: Weak Signal Identification and Inference in Penalized Model Selection

Abstract: Penalized model selection methods are developed to select variables and

estimate coefficients simultaneously, which is useful in high-dimensional variable

selection. However, identification and inference for weak signals are still quite

challenging and are not well-studied. Existing inference procedures for the penalized

estimators are mainly focused on strong signals. This motivates use to investigate

finite sample behavior for weak signal inference. We propose an identification

procedure for weak signals in finite samples, and provide a transition phase

in-between noise and strong signal strengths. A new two-step inferential method is

introduced to construct better inference for the weak signals being identified. Our

simulation studies show that the proposed method leads to better confidence

coverage for weak signals, compared with those using asymptotic inference,

perturbation and bootstrap resampling approaches. We also illustrate our method for

HIV antiretroviral drug susceptibility data to identify genetic mutations associated

with HIV drug resistance. This is joint work with Peibei Shi.

Speaker: Haipeng Shen，University of North Carolina at Chapel Hill

Title: Big Data Opportunities and Challenges in Business Analytics

Abstract: Big data are becoming increasingly common in our modern digital business

world. Businesstransaction data are being collected with ever‐increasing volume,

dimensionality, and complexity, across multiple channels. I shall use real business

examples from financial service systems, healthcare, and mobile marketing, to discuss

opportunities and challenges offered by big data in business analytics.

Speaker: Xiaotong Shen，University of Minnesota

Title: Ordinal classification with unstructured predictors

Abstract: Unstructured data refers to information that lacks certain structures and

cannot beorganized in a predefined fashion, which involve words, graphs, objects

ormultimedia types of files. In this presentation, I will focus on classification for

unstructured word predictors with ordered class categories, where imprecise

informationconcerning strengths between predictors is available for predicting the

class labels. However, the imprecise information s expressed in terms of a directed

graph, with each node representing a predictor and directed edge containing

pairwisestrengths between two nodes. One of targeted applications for unstructured

data arises from sentiment analysis. Large margin ordinal classifierswill be introduced,

which integrate the imprecise predictor relations into linear relationalconstraints over

classification function coefficients subject to many linearconstraints. This work is

joint with J. Wang (UIC), P. Qu (UIUC) and Y. Sun (UMN).

Speaker: Junhui Wang，City University of Hong Kong

Title: Model-free Variable Selection via LearningGradients

Abstract: In recent years, variable selection has attracted enormous attention from

statistics community. A wide spectrum of variable selection algorithms have been

proposed based on various model assumption. In this talk, we will propose a general

model-free variable selection framework. As opposed to existing algorithms, the key

advantage of the proposed framework is that it assumes no distributional model,

admits general predictor effects, allows for ecient computation, and attains desirable

theoretical properties. The proposed framework is formulated in the form of gradient

learning in a reproducing kernel Hilbert space, which enjoys the power of and

extended representor theorem and thus enables ecient learning of sparse gradients.

The proposed framework is implemented via a scalable block coordinate descent

algorithm. The advantage is demonstrated in a variety of simulated experiments as

well as real datasets. If time permits, the asymptotic consistencies will be discussed.

Speaker: Yazhen Wang，Department of Statistics, University of Wisconsin-Madison

Title: Statistics in Quantum Paradigm

Abstract: Quantum computation and quantum information are of great current

interest in computerscience, mathematics, physical sciences and engineering. They

will likely lead to anew wave of technological innovations in communication,

computation and cryptography.As the theory of quantum physics is fundamentally

stochastic, randomness anduncertainty are deeply rooted in quantum computation,

quantum information andquantum simulation. Thus statistics can play an important

role in quantum computation andquantum simulation, which in turn offer great

potential to revolution sing as well asstatistical modeling and analysis of quantum

computing experimental data.

Speaker: Chunming Zhang, University of Wisconsin-Madison

Title: Single-index modulated multiple testing

Abstract: In the context of large-scale multiple testing, hypotheses are often

accompanied with certain prior information. In this paper, we present a single-index

modulated (SIM) multiple testing procedure, which maintains control of the false

discovery rate (FDR) while incorporating prior information, by assuming the

availability of a bivariate p-value, (p_1,p_2), for each hypothesis, where p_1 is a

preliminary p-value from prior information and p_2 is the primary p-value for the

ultimate analysis. To find the optimal rejection region for the bivariate p-value, we

propose a criteria based on the ratio of probability density functions of (p_1,p_2)

under the true null and non-null. This criteria in the bivariate normal setting further

motivates us to project the bivariate p-value to a single-index, p(θ), for a wide range

of directions θ. The true null distribution of p(θ) is estimated via parametric and

nonparametric approaches, leading to two procedures for estimating and controlling

the FDR. To derive the optimal projection direction θ, we propose a new approach

based on power comparison, which is further shown to be consistent under some mild

conditions. Simulation evaluations indicate that the SIM multiple testing procedure

improves the detection power significantly while controlling the FDR. Analysis of a

genomic dataset will be illustrated.

Speaker: Hao Zhang, University of Arizona

Title: Component Selection and Estimation for Functional Additive Models

Abstract: Functional additive model provides a flexible yet simple framework for

regressions involving functional predictors. The utilization of data-driven basis in an

additive rather than linear structure naturally extends the classical functional linear

model. However, the critical issue of selecting nonlinear additive components has

been less studied. In this work, we propose a new regularization framework for joint

component selection and estimation in the context of the Reproducing Kernel Hilbert

Space. The proposed approach takes advantage of the functional principal

components which greatly facilitates the implementation and the theoretical analysis.

The selection and estimation are achieved by penalized least squares using a penalty

which encourages the sparse structure of the additive components. Theoretical

properties, such as the existence and the rate of convergence are investigated. The

empirical performance is demonstrated through simulation studies and a real data

application.

Speaker: Hongtu Zhu, UNC-Chapel Hill Biostatistics and Biomedical Research Imaging Center

Title: A Statistician’s Experience in CompletingAlzheimers Disease Big Data DREAM Challenge

Abstract: The goal of the Alzheimer's Disease Big Data DREAM Challenge #1

(AD#1) is to apply an open science approach to rapidly identify accurate predictive

AD biomarkers that can be used by the scientific, industrial and regulatory

communities to improve AD diagnosis and treatment. AD#1 will be the first in a

series of AD Data Challenges to leverage genetics and brain imaging in combination

with cognitive assessments, biomarkers and demographic information from cohorts

ranging from cognitively normal to mild cognitively impaired to individuals with AD.

We will review a series of statistical learning methods including our new methods for

building prediction models based on ultra-high dimensional imaging and genetic

features. We will discuss the pros and cons of various methods and their performance

in all three challenges in AD DREAM challenge.

Speaker: Ji Zhu, University of Michigan

Title: Link Prediction for Partially Observed Networks

Abstract: Link prediction is one of the fundamental problems in

network analysis. In many applications, notably in genetics, a

partially observed network may not contain any negative examples of absent edges,

which creates a difficulty for many existing supervised learning approaches. We

develop a new method which treats the observed network as a sample of the true

network with different sampling rates for positive and negative examples. We obtain a

relative ranking of potential links by their probabilities, utilizing information on node

covariates as well as on network topology. Empirically, the method performs well

under many settings, including when the observed network is sparse. We apply the

method to a protein-protein interaction network and a school friendship network.

Speaker: 邹国华, 中国科学院数学与系统科学研究院，首都师范大学

Title: 企业薪酬抽样调查方法与数据处理研究

Abstract: 我国当前对企业员工的薪酬（工资）状况没有全面的统计报表，为了掌握全国

分地区分行业企业各类人员的薪酬状况，采用抽样调查是一种节省时间和费用的方法。本报

告主要介绍我们课题组设计的抽样调查方案以及相应的数据处理方法。为了与公务员工资水

平进行比较，我们也提出了一种新的衡量指数，克服了拉斯佩尔指数的缺陷。

详细请参考链接：

http://ssm.shufe.edu.cn/structure/sy/tzggxx?infid=164111&categoryid=

http://www.scms.fudan.edu.cn/Workshop17/

pyqpinbo1

看起来好高大上的样子，楼主是复旦的还是上财的？