HAL : derniers dépôts du SAMM
jeudi 22 septembre 2016
[hal-01310409] Bayesian Variable Selection for Globally Sparse Probabilistic PCA
Sparse versions of principal component analysis (PCA) have imposed themselves as simple, yet powerful ways of selecting relevant features of high-dimensional data in an unsupervised manner. However, when several sparse principal components are computed, the interpretation of the selected variables is difficult since each axis has its own sparsity pattern and has to be interpreted separately. To overcome this drawback, we propose a Bayesian procedure called globally sparse probabilistic PCA (GSPPCA) that allows to obtain several sparse components with the same sparsity pattern. This allows the practitioner to identify the original variables which are relevant to describe the data. To this end, using Roweis' prob-abilistic interpretation of PCA and a Gaussian prior on the loading matrix, we provide the first exact computation of the marginal likelihood of a Bayesian PCA model. To avoid the drawbacks of discrete model selection, a simple relaxation of this framework is presented. It allows to find a path of models using a variational expectation-maximization algorithm. The exact marginal likelihood is then maximized over this path. This approach is illustrated on real and synthetic data sets. In particular, using unlabeled microarray data, GSPPCA infers much more relevant gene subsets than traditional sparse PCA algorithms.
mardi 20 septembre 2016
[hal-01207009] Weighted interpolation inequalities: a perturbation approach
We study optimal functions in a family of Caffarelli-Kohn-Nirenberg inequalities with a power-law weight, in a regime for which standard symmetrization techniques fail. We establish the existence of optimal functions, study their properties and prove that they are radial when the power in the weight is small enough. Radial symmetry up to translations is true for the limiting case where the weight vanishes, a case which corresponds to a well-known subfamily of Gagliardo-Nirenberg inequalities. Our approach is based on a concentration-compactness analysis and on a perturbation method which uses a spectral gap inequality. As a consequence, we prove that optimal functions are explicit and given by Barenblatt-type profiles in the perturbative regime.
samedi 17 septembre 2016
[hal-01367308] A Class of Random Field Memory Models for Mortality Forecasting
This article proposes a parsimonious alternative approach for modeling the stochastic dynamics of mortality rates. Instead of the commonly used factor-based decomposition framework , we consider modeling mortality improvements using a random field specification with a given causal structure. Such a class of models introduces dependencies among adjacent cohorts aiming at capturing, among others, the cohort effects and cross generations correlations. It also describes the conditional heteroskedasticity of mortality. The proposed model is a generalization of the now widely used AR-ARCH models for random processes. For such class of models, we propose an estimation procedure for the parameters. Formally, we use the quasi-maximum likelihood estimator (QMLE) and show its statistical consistency and the asymptotic normality of the estimated parameters. The framework being general, we investigate and illustrate a simple variant, called the three-level memory model, in order to fully understand and assess the effectiveness of the approach for modeling mortality dynamics.
mardi 30 août 2016
[hal-01356993] Discovering Patterns in Time-Varying Graphs: A Triclustering Approach
This paper introduces a novel technique to track structures in time varying graphs. The method uses a maximum a posteriori approach for adjusting a three-dimensional co-clustering of the source vertices, the destination vertices and the time, to the data under study, in a way that does not require any hyper-parameter tuning. The three dimensions are simultaneously segmented in order to build clusters of source vertices, destination vertices and time segments where the edge distributions across clusters of vertices follow the same evolution over the time segments. The main novelty of this approach lies in that the time segments are directly inferred from the evolution of the edge distribution between the vertices, thus not requiring the user to make any a priori quantization. Experiments conducted on artificial data illustrate the good behavior of the technique, and a study of a real-life data set shows the potential of the proposed approach for exploratory data analysis.
samedi 20 août 2016
[hal-01354235] Modeling the Influence of Local Environmental Factors on Malaria Transmission in Benin and Its Implications for Cohort Study
Malaria remains endemic in tropical areas, especially in Africa. For the evaluation of new tools and to further our understanding of host-parasite interactions, knowing the environmental risk of transmission—even at a very local scale—is essential. The aim of this study was to assess how malaria transmission is influenced and can be predicted by local climatic and environmental factors. As the entomological part of a cohort study of 650 newborn babies in nine villages in the Tori Bossito district of Southern Benin between June 2007 and February 2010, human landing catches were performed to assess the density of malaria vectors and transmission intensity. Climatic factors as well as household characteristics were recorded throughout the study. Statistical correlations between Anopheles density and environmental and climatic factors were tested using a three-level Poisson mixed regression model. The results showed both temporal variations in vector density (related to season and rainfall), and spatial variations at the level of both village and house. These spatial variations could be largely explained by factors associated with the house’s immediate surroundings, namely soil type, vegetation index and the proximity of a watercourse. Based on these results, a predictive regression model was developed using a leave-one-out method, to predict the spatiotemporal variability of malaria transmission in the nine villages. This study points up the importance of local environmental factors in malaria transmission and describes a model to predict the transmission risk of individual children, based on environmental and behavioral characteristics.
mercredi 10 août 2016
[hal-01352438] Variables selection by the LASSO method. Application to malaria data of Tori-Bossito (Benin)
This work deals with prediction of anopheles number using environmental and climate variables. The variables selection is performed by GLMM (Generalized linear mixed model) combined with the Lasso method and simple cross validation. Selected variables are debiased while the prediction is generated by simple GLMM. Finally, the results reveal to be qualitatively better, at selection, the prediction point of view than those obtained by the reference method.
vendredi 5 août 2016
lundi 11 juillet 2016
[hal-01308517] On the Krein-Milman-Ky Fan theorem for convex compact metrizable sets.
The Krein-Milman theorem (1940) states that every convex compact subset of a Hausdorff locally convex topological space, is the closed convex hull of its extreme points. In 1963, Ky Fan extended the Krein-Milman theorem to the general framework of $\Phi$-convexity. Under general conditions on the class of functions $\Phi$, the Krein-Milman-Ky Fan theorem asserts then, that every compact $\Phi$-convex subset of a Hausdorff space, is the $\Phi$-convex hull of its $\Phi$-extremal points. We prove in this paper that, in the metrizable case the situation is rather better. Indeed, we can replace the set of $\Phi$-extremal points by the smaller subset of $\Phi$-exposed points. We establish under general conditions on the class of functions $\Phi$, that every $\Phi$-convex compact metrizable subset of a Hausdorff space, is the $\Phi$-convex hull of its $\Phi$-exposed points. As a consequence we obtain that each convex weak compact metrizable (resp. convex weak$^*$ compact metrizable) subset of a Banach space (resp. of a dual Banach space), is the closed convex hull of its exposed points (resp. the weak$^*$ closed convex hull of its weak$^*$ exposed points). This result fails in general for compact $\Phi$-convex subsets that are not metrizable.
vendredi 1er juillet 2016
lundi 27 juin 2016
[hal-01337476] Resolvent of non autonomous linear delay functional differential equations.
The aim of this paper is to give a complete proof of the formula for the resolvent of a non autonomous linear delay functional differential equations given in the book of Hale and Verduyn Lunel under the assumption alone of the continuity of the right-hand side with respect to the time, when the notion of solution is a differentiable function at each point, which satisfies the equation at each point, and when the initial value is a continuous function.
samedi 25 juin 2016
[hal-01336316] Regression Trees and Random forest based feature selection for malaria risk exposure prediction
This paper deals with prediction of anopheles number, the main vector of malaria risk, using environmental and climate variables. The variables selection is based on an automatic machine learning method using regression trees, and random forests combined with stratified two levels cross validation. The minimum threshold of variables importance is accessed using the quadratic distance of variables importance while the optimal subset of selected variables is used to perform predictions. Finally the results revealed to be qualitatively better, at the selection, the prediction , and the CPU time point of view than those obtained by GLM-Lasso method.
mardi 21 juin 2016
[hal-01279326] Weighted fast diffusion equations (Part I): Sharp asymptotic rates without symmetry and symmetry breaking in Caffarelli-Kohn-Nirenberg inequalities
In this paper we consider a family of Caffarelli-Kohn-Nirenberg interpolation inequalities (CKN), with two radial power law weights and exponents in a subcritical range. We address the question of symmetry breaking: are the optimal functions radially symmetric, or not ? Our intuition comes from a weighted fast diffusion (WFD) flow: if symmetry holds, then an explicit entropy - entropy production inequality which governs the intermediate asymptotics is indeed equivalent to (CKN), and the self-similar profiles are optimal for (CKN). We establish an explicit symmetry breaking condition by proving the linear instability of the radial optimal functions for (CKN). Symmetry breaking in (CKN) also has consequences on entropy - entropy production inequalities and on the intermediate asymptotics for (WFD). Even when no symmetry holds in (CKN), asymptotic rates of convergence of the solutions to (WFD) are determined by a weighted Hardy-Poincaré inequality which is interpreted as a linearized entropy - entropy production inequality. All our results rely on the study of the bottom of the spectrum of the linearized diffusion operator around the self-similar profiles, which is equivalent to the linearization of (CKN) around the radial optimal functions, and on variational methods. Consequences for the (WFD) flow will be studied in Part II of this work.
[hal-01279327] Weighted fast diffusion equations (Part II): Sharp asymptotic rates of convergence in relative error by entropy methods
This paper is the second part of the study. In Part~I, self-similar solutions of a weighted fast diffusion equation (WFD) were related to optimal functions in a family of subcritical Caffarelli-Kohn-Nirenberg inequalities (CKN) applied to radially symmetric functions. For these inequalities, the linear instability (symmetry breaking) of the optimal radial solutions relies on the spectral properties of the linearized evolution operator. Symmetry breaking in (CKN) was also related to large-time asymptotics of (WFD), at formal level. A first purpose of Part~II is to give a rigorous justification of this point, that is, to determine the asymptotic rates of convergence of the solutions to (WFD) in the symmetry range of (CKN) as well as in the symmetry breaking range, and even in regimes beyond the supercritical exponent in (CKN). Global rates of convergence with respect to a free energy (or entropy) functional are also investigated, as well as uniform convergence to self-similar solutions in the strong sense of the relative error. Differences with large-time asymptotics of fast diffusion equations without weights will be emphasized.
samedi 18 juin 2016
[hal-01333611] A Banach-Stone type Theorem for invariant metric groups
Given an invariant metric group $(X,d)$, we prove that the set $Lip^1_+(X)$ of all nonnegative and $1$-Lipschitz maps on $(X,d)$ endowed with the inf-convolution structure is a monoid which completely determine the group completion of $(X,d)$. This gives a Banach-Stone type theorem for the inf-convolution structure in the group framework.
mercredi 11 mai 2016
[hal-01122393] The dynamic random subgraph model for the clustering of evolving networks
In recent years, many clustering methods have been proposed to extract information from networks. The principle is to look for groups of vertices with homogenous connection profiles. Most of these techniques are suitable for static networks, that is to say, not taking into account the temporal dimension. This work is motivated by the need of analyzing evolving networks where a decomposition of the networks into subgraphs is given. Therefore, in this paper, we consider the random subgraph model (RSM) which was proposed recently to model networks through latent clusters built within known partitions. Using a state space model to characterize the cluster proportions, RSM is then extended in order to deal with dynamic networks. We call the latter the dynamic random subgraph model (dRSM). A variational expectation maximization (VEM) algorithm is proposed to perform inference. We show that the variational approximations lead to an update step which involves a new state space model from which the parameters along with the hidden states can be estimated using the standard Kalman filter and Rauch-Tung-Striebel (RTS) smoother. Simulated data sets are considered to assess the proposed methodology. Finally, dRSM along with the corresponding VEM algorithm are applied to an original maritime network built from printed Lloyd's voyage records.
mardi 10 mai 2016
[hal-01312596] Exact ICL maximization in a non-stationary temporal extension of the stochastic block model for dynamic networks
The stochastic block model (SBM) is a flexible probabilistic tool that can be used to model interactions between clusters of nodes in a network. However, it does not account for interactions of time varying intensity between clusters. The extension of the SBM developed in this paper addresses this shortcoming through a temporal partition: assuming interactions between nodes are recorded on fixed-length time intervals, the inference procedure associated with the model we propose allows to cluster simultaneously the nodes of the network and the time intervals. The number of clusters of nodes and of time intervals, as well as the memberships to clusters, are obtained by maximizing an exact integrated complete-data likelihood, relying on a greedy search approach. Experiments on simulated and real data are carried out in order to assess the proposed methodology.
[hal-01312590] Mean Absolute Percentage Error for regression models
We study in this paper the consequences of using the Mean Absolute Percentage Error (MAPE) as a measure of quality for regression models. We prove the existence of an optimal MAPE model and we show the universal consistency of Empirical Risk Minimization based on the MAPE. We also show that finding the best model under the MAPE is equivalent to doing weighted Mean Absolute Error (MAE) regression, and we apply this weighting strategy to kernel regression. The behavior of the MAPE kernel regression is illustrated on simulated data.
mercredi 13 avril 2016
[halshs-01301794] Politique salariale et mode de rémunération dans la Fonction publique en France depuis le début des années 2000 : mutations et enjeux.
La politique salariale de l’Etat a connu des inflexions importantes au cours de la dernière décennie. Des ajustements paramétriques (gel du point d'indice, indexation de fait des bas salaires au SMIC) et des mesures partielles (requalifications de certaines catégories) ont été adoptés, mais des réformes plus structurelles du mode de rémunération, même si elles ont été souhaitées par l’Etat, n’ont pas réellement abouti. La politique salariale de l'Etat s'est faite en même temps plus catégorielle. Au-delà des effets limités sur le pouvoir d’achat moyen, ces changements ont eu des conséquences importantes, en termes de hiérarchies salariales et de carrière, et contribuent à expliquer la montée d’un mécontentement salarial important. L'ensemble de ces évolutions interpellent les organisations syndicales, dont les stratégies, à divers niveaux (central ou local) varient entre opposition et accompagnement.
vendredi 8 avril 2016
[hal-01299161] The Stochastic Topic Block Model for the Clustering of Networks with Textual Edges
Due to the significant increase of communications between individuals via social medias (Face-book, Twitter) or electronic formats (email, web, co-authorship) in the past two decades, network analysis has become a unavoidable discipline. Many random graph models have been proposed to extract information from networks based on person-to-person links only, without taking into account information on the contents. In this paper, we have developed the stochastic topic block model (STBM) model, a probabilistic model for networks with textual edges. We address here the problem of discovering meaningful clusters of vertices that are coherent from both the network interactions and the text contents. A classification variational expectation-maximization (C-VEM) algorithm is proposed to perform inference. Simulated data sets are considered in order to assess the proposed approach and highlight its main features. Finally, we demonstrate the effectiveness of our model on two real-word data sets: a communication network and a co-authorship network.
samedi 13 février 2016
[hal-01270963] On combining wavelets expansion and sparse linear models for Regression on metabolomic data and biomarker selection
Wavelet thresholding of spectra has to be handled with care when the spectra are the predictors of a regression problem. Indeed, a blind thresholding of the signal followed by a regression method often leads to deteriorated predictions. The scope of this article is to show that sparse regression methods, applied in the wavelet domain, perform an automatic thresholding: the most relevant wavelet coefficients are selected to optimize the prediction of a given target of interest. This approach can be seen as a joint thresholding designed for a predictive purpose. The method is illustrated on a real world problem where metabolomic data are linked to poison ingestion. This example proves the usefulness of wavelet expansion and the good behavior of sparse and regularized methods. A comparison study is performed between the two-steps approach (wavelet thresholding and regression) and the one-step approach (selection of wavelet coefficients with a sparse regression). The comparison includes two types of wavelet bases, various thresholding methods, and various regression methods and is evaluated by calculating prediction performances. Information about the location of the most important features on the spectra was also obtained and used to identify the most relevant metabolites involved in the mice poisoning.
[hal-01265147] Limited operators and differentiability
We characterize the limited operators by differentiability of convex continuous functions. Given Banach spaces $Y$ and $X$ and a linear continuous operator $T: Y \longrightarrow X$, we prove that $T$ is a limited operator if and only if, for every convex continuous function $f: X \longrightarrow \R$ and every point $y\in Y$, $f\circ T$ is Fr\'echet differentiable at $y\in Y$ whenever $f$ is G\^ateaux differentiable at $T(y)\in X$.
mercredi 10 février 2016
[hal-01263540] Modelling time evolving interactions in networks through a non stationary extension of stochastic block models
The stochastic block model (SBM) describes interactions between nodes of a network following a probabilistic approach. Nodes belong to hidden clusters and the probabilities of interactions only depend on these clusters. Interactions of time varying intensity are not taken into account. By partitioning the whole time horizon, in which interactions are observed, we develop a non stationary extension of the SBM, allowing us to simultaneously cluster the nodes of a network and the fixed time intervals in which interactions take place. The number of clusters as well as memberships to clusters are finally obtained through the maximization of the complete-data integrated likelihood relying on a greedy search approach. Experiments are carried out in order to assess the proposed methodology.
mardi 9 février 2016
[hal-01270293] Is the corporate elite disintegrating? Interlock boards and the Mizruchi hypothesis
This paper proposes an approach for comparing interlocked board networks over time to test for statistically significant change. In addition to contributing to the conversation about whether the Mizruchi hypothesis (that a disintegration of power is occurring within the corporate elite) holds or not, we propose novel methods to handle a longitudinal investigation of a series of social networks where the nodes undergo a few modifications at each time point. Methodologically, our contribution is twofold: we extend a Bayesian model hereto applied to compare two time periods to a longer time period, and we define and employ the concept of a hull of a sequence of social networks, which makes it possible to circumvent the problem of changing nodes over time.
mercredi 3 février 2016
mercredi 27 janvier 2016
[hal-01261122] Country-scale Exploratory Analysis of Call Detail Records through the Lens of Data Grid Models
Call Detail Records (CDRs) are data recorded by telecommunications companies, consisting of basic informations related to several dimensions of the calls made through the network: the source, destination , date and time of calls. CDRs data analysis has received much attention in the recent years since it might reveal valuable information about human behavior. It has shown high added value in many application domains like e.g., communities analysis or network planning. In this paper, we suggest a generic methodology based on data grid models for summarizing information contained in CDRs data. The method is based on a parameter-free estimation of the joint distribution of the variables that describe the calls. We also suggest several well-founded criteria that allows one to browse the summary at various granularities and to explore the summary by means of insightful visualizations. The method handles network graph data, temporal sequence data as well as user mobility data stemming from original CDRs data. We show the relevance of our methodology on real-world CDRs data from Ivory Coast for various case studies, like network planning strategy and yield management pricing strategy.