HAL : derniers dépôts du SAMM



vendredi 24 février 2017

  • [hal-01469509] Application du coclustering à l'analyse exploratoire d'une table de données
    La classification croisée est une technique d'analyse non supervisée qui permet d'extraire la structure sous-jacente existante entre les individus et les variables d'une table de données sous forme de blocs homogènes. Cette technique se limitant aux variables de même nature, soit numériques soit catégo-rielles, nous proposons de l'étendre en proposant une méthodologie en deux étapes. Lors de la première étape, toutes les variables sont binarisées selon un nombre de parties choisi par l'analyste, par discrétisation en fréquences égales dans le cas numérique ou en gardant les valeurs les plus fréquentes dans le cas catégoriel. La deuxième étape consiste à utiliser une méthode de coclustering entre individus et variables binaires, conduisant à des regroupements d'indivi-dus d'une part, et de parties de variables d'autre part. Nous appliquons cette méthodologie sur plusieurs jeux de donnée en la comparant aux résultats d'une analyse par correspondances multiples ACM, appliquée aux même données bi-narisées.

  • [hal-01469546] Co-clustering de données mixtes à base des modèles de mélange
    La classification croisée (co-clustering) est une technique non super-visée qui permet d'extraire la structure sous-jacente existante entre les lignes et les colonnes d'une table de données sous forme de blocs. Plusieurs approches ont été étudiées et ont démontré leur capacité à extraire ce type de structure dans une table de données continues, binaires ou de contingence. Cependant, peu de travaux ont traité le co-clustering des tables de données mixtes. Dans cet article, nous étendons l'utilisation du co-clustering par modèles à blocs latents au cas des données mixtes (variables continues et variables binaires). Nous évaluons l'efficacité de cette extension sur des données simulées et nous discutons ses limites potentielles.

mercredi 22 février 2017

vendredi 17 février 2017

  • [hal-01468548] Block modelling in dynamic networks with non-homogeneous Poisson processes and exact ICL
    We develop a model in which interactions between nodes of a dynamic network are counted by non homogeneous Poisson processes. In a block modelling perspective, nodes belong to hidden clusters (whose number is unknown) and the intensity functions of the counting processes only depend on the clusters of nodes. In order to make inference tractable we move to discrete time by partitioning the entire time horizon in which interactions are observed in fixed-length time sub-intervals. First, we derive an exact integrated classification likelihood criterion and maximize it relying on a greedy search approach. This allows to estimate the memberships to clusters and the number of clusters simultaneously. Then a maximum-likelihood estimator is developed to estimate non parametrically the integrated intensities. We discuss the over-fitting problems of the model and propose a regularized version solving these issues. Experiments on real and simulated data are carried out in order to assess the proposed methodology.

  • [hal-01468083] Country-Scale Exploratory Analysis of Call Detail Records Through the Lens of Data Grid Models
    Call Detail Records (CDRs) are data recorded by telecommunications companies, consisting of basic informations related to several dimensions of the calls made through the network: the source, destination , date and time of calls. CDRs data analysis has received much attention in the recent years since it might reveal valuable information about human behavior. It has shown high added value in many application domains like e.g., communities analysis or network planning. In this paper, we suggest a generic methodology based on data grid models for summarizing information contained in CDRs data. The method is based on a parameter-free estimation of the joint distribution of the variables that describe the calls. We also suggest several well-founded criteria that allows one to browse the summary at various granularities and to explore the summary by means of insightful visualizations. The method handles network graph data, temporal sequence data as well as user mobility data stemming from original CDRs data. We show the relevance of our methodology on real-world CDRs data from Ivory Coast for various case studies, like network planning strategy and yield management pricing strategy.

jeudi 16 février 2017

  • [hal-01465340] Efficient interpretable variants of online SOM for large dissimilarity data
    Self-organizing maps (SOM) are a useful tool for exploring data. In its original version, the SOM algorithm was designed for numerical vectors. Since then, several extensions have been proposed to handle complex datasets described by (dis)similarities. Most of these extensions represent prototypes by a list of (dis)similarities with the entire dataset and suffer from several drawbacks: their complexity is increased-it becomes quadratic instead of linear-, the stability is reduced and the interpretability of the prototypes is lost. In the present article, we propose and compare two extensions of the stochastic SOM for (dis)similarity data: the first one takes advantage of the online setting in order to maintain a sparse representation of the prototypes at each step of the algorithm, while the second one uses a dimension reduction in a feature space defined by the (dis)similarity. Our contributions to the analysis of (dis)similarity data with topographic maps are thus twofolds: first, we present a new version of the SOM algorithm which ensures a sparse representation of the prototypes through online updates. Second, this approach is compared on several benchmarks to a standard dimension reduction technique (K-PCA), which is itself adapted to large datasets with the Nyström approximation. Results demonstrate that both approaches lead to reduce the prototypes dimensionality while providing accurate results in a reasonable computational time. Selecting one of these two strategies depends on the dataset size, the need to easily interpret the results and the computational facilities available. The conclusion tries to provide some recommendations to help the user making this choice.

samedi 11 février 2017

  • [hal-01464489] Feature Extraction over Multiple Representations for Time Series Classification
    no abstract

  • [hal-01464487] Clustering de séquences d'évènements temporels
    Nous proposons une nouvelle méthode de clustering et d'analyse de séquences temporelles basée sur les modèles en grille à trois dimensions. Les séquences sont partitionnées en clusters, la dimension temporelle est discrétisée en intervalles et la dimension évènement est partitionnée en groupes. La grille de cellules 3D forme ainsi un estimateur non-paramétrique constant par morceaux de densité jointe des séquences et des dimensions des évènements temporels. Les séquences d'un cluster sont ainsi groupés car elles suivent une distribution similaire d'évènements au cours du temps. Nous proposons aussi une méthode d'exploitation du clustering par simplification de la grille ainsi que des indicateurs permettant d'interpréter les clusters et de caractériser les séquences qui les composent. Les expériences sur des données artificielles ainsi que sur des données réelles issues de DBLP démontrent le bien-fondé de notre approche.

vendredi 3 février 2017

  • [hal-01447605] Laguerre estimation under constraint at a single point
    This paper presents a general methodology for nonparametric estimation of a function s related to a nonnegative real random variable X, under a constraint of type s(0) = c. Three dierent examples are investigated: the direct observations model (X is observed), the multiplicative noise model (Y = XU is observed, with U following a uniform distribution) and the additive noise model (Y = X + V is observed where V is a nonnegative nuisance variable with known density). When a projection estimator of the target function is available, we explain how to modify it in order to obtain an estimator which satises the constraint. We extend risk bounds from the initial to the new estimator. Moreover if the previous estimator is adaptive in the sense that a model selection procedure is available to perform the squared bias/variance trade-o, we propose a new penalty also leading to an oracle type inequality for the new constrained estimator. The procedure is illustrated on simulated data, for density and survival function estimation.

  • [hal-01420747] La Substance de Spinoza ou la Nature comme Treillis. Une histoire de point fixe.
    On propose un modèle mathématique basé sur deux axiomes et la théorie des ensembles pour approcher la problématique développé par le philosophe B. Spinoza, dans l'Ethique. On utilise ensuite le Théorème de Knaster-Tarski pour démontrer l'existence et l'unicité de la Substance affirmée par Spinoza.

vendredi 27 janvier 2017

  • [hal-01442314] MARKOV AND THE DUCHY OF SAVOY: SEGMENTING A CENTURY WITH REGIME-SWITCHING MODELS
    Studying time is at the core of the historian's research. For the statistician, time is usually a supplementary parameter or a supplementary variable which the models he develops have to take into account. This manuscript is the result of a collaboration between historians and mathematicians, having time as the starting point of a joint work. We are interested in studying a specific time-series, reporting the production of law related to military logistics of the Duchy of Savoy, during the XVIth and XVIIth centuries. The expected outcome is a better understanding of the temporality and of the functioning of the state. Two models based on hidden Markov chains and taking into account the specificities of the data are introduced. They are then estimated on the historical data and provide interesting results, which either confirm the existing historical hypothesis on the subject or bring new insights on the studied period.

samedi 21 janvier 2017

mardi 17 janvier 2017

  • [tel-01432630] Détection d'outliers. Modélisation et prédiction. Application aux données de véhicules d'occasion.
    La société Autobiz édite et diffuse de l'information sur le secteur automobile. Cette thèse contribue à l'enrichissement de cette information et à une meilleure compréhension du marché de l'occasion par l'élaboration des modèles de prédiction du prix des véhicules et du délai de vente qui leur est associé. Nous avons eu à notre disposition une base de données réelles constituée d'annonces de sources diverses induisant une nombre considérable d'outliers. Ainsi, la première partie de travail s'est consacrée à la construction de méthodes de détection d'outliers incluant aussi bien de simples règles empiriques qu'un test statistique dont les propriétés asymptotiques ont été étudiées. Partant d'un état de l'art sur la prédiction des prix des véhicules d'occasion, il est apparu que les études existantes soulèvent le besoin de fonder une méthodologie d'analyse plus rigoureuse. Cette méthodologie a été développée dans un objectif de proposer des solutions automatisables et adaptées aux contraintes imposées par les experts. Nous faisons alors l'hypothèse que les prix des véhicules d'une même version se déprécient en fonction de l'âge et du kilométrage selon une forme qui lui est propre. La dernière partie du travail est dédiée à l'analyse des délais de vente. Dans un premier temps, nous caractérisons la variable associée aux délais de vente. Ensuite nous proposons une modélisation de cette variable par une régression à l'échelle d'un segment correspondant à l'arborescence marque-modèle-carrosserie-énergie en fonction des variables liées au kilométrage, au prix et à l'âge. Enfin, nous discutons de la possibilité de modéliser le nombre de véhicules vendus dans une période donnée selon une loi binomiale négative.

samedi 14 janvier 2017

  • [hal-01430717] Multiple change points detection and clustering in dynamic network
    The increasing amount of data stored in the form of dynamic interactions between actors necessitates the use of methodologies to automatically extract relevant information. The interactions can be represented by dynamic networks in which most existing methods look for clusters of vertices to summarize the data. In this paper, a new framework is proposed in order to cluster the vertices while detecting change points in the intensities of the interactions. These change points are key in the understanding of the temporal interactions. The model used involves non homogeneous Poisson point processes with cluster dependent piecewise constant intensity functions and common discontinuity points. A variational expectation maximization algorithm is derived for inference. We show that the pruned exact linear time method, originally developed for univariate time series, can be considered for the maximization step. This allows the detection of both the number of change points and their location. Experiments on artificial and real datasets are carried out and the proposed approach is compared with related methods. Keywords Dynamic networks · non homogeneous Poisson point processes · stochastic block model · variational EM · PELT M. Corneli

samedi 7 janvier 2017

mercredi 14 décembre 2016

  • [tel-01413985] Statistical analysis of networks and applications in Social Sciences
    Over the last two decades, network structure analysis has experienced rapid growth with its construction and its intervention in many fields, such as: communication networks, financial transaction networks, gene regulatory networks, disease transmission networks, mobile telephone networks. Social networks are now commonly used to represent the interactions between groups of people; for instance, ourselves, our professional colleagues, our friends and family, are often part of online networks, such as Facebook, Twitter, email. In a network, many factors can exert influence or make analyses easier to understand. Among these, we find two important ones: the time factor, and the network context. The former involves the evolution of connections between nodes over time. The network context can then be characterized by different types of information such as text messages (email, tweets, Facebook, posts, etc.) exchanged between nodes, categorical information on the nodes (age, gender, hobbies, status, etc.), interaction frequencies (e.g., number of emails sent or comments posted), and so on. Taking into consideration these factors can lead to the capture of increasingly complex and hidden information from the data. The aim of this thesis is to define new models for graphs which take into consideration the two factors mentioned above, in order to develop the analysis of network structure and allow extraction of the hidden information from the data. These models aim at clustering the vertices of a network depending on their connection profiles and network structures, which are either static or dynamically evolving. The starting point of this work is the stochastic block model, or SBM. This is a mixture model for graphs which was originally developed in social sciences. It assumes that the vertices of a network are spread over different classes, so that the probability of an edge between two vertices only depends on the classes they belong to.

vendredi 9 décembre 2016

  • [tel-01407529] Optimal control in discrete-time framework and in infinite horizon
    This thesis contains original contributions to the optimal control theory in the discrete-time framework and in infinite horizon following the viewpoint of Pontryagin. There are 5 chapters in this thesis. In Chapter 1, we recall preliminary results on sequence spaces and on differential calculus in normed linear space. In Chapter 2, we study a single-objective optimal control problem in discrete-time framework and in infinite horizon with an asymptotic constraint and with autonomous system. We use an approach of functional analytic for this problem after translating it into the form of an optimization problem in Banach (sequence) spaces. Then a weak Pontyagin principle is established for this problem by using a classical multiplier rule in Banach spaces. In Chapter 3, we establish a strong Pontryagin principle for the problems considered in Chapter 2 using a result of Ioffe and Tihomirov. Chapter 4 is devoted to the problems of Optimal Control, in discrete time framework and in infinite horizon, which are more general with several different criteria. The used method is the reduction to finite-horizon initiated by J. Blot and H. Chebbi in 2000. The considered problems are governed by difference equations or difference inequations. A new weak Pontryagin principle is established using a recent result of J. Blot on the Fritz John multipliers. Chapter 5 deals with the multicriteria optimal control problems in discrete time framework and infinite horizon. New weak and strong Pontryagin principles are established, again using recent optimization results, under lighter assumptions than existing ones.

vendredi 2 décembre 2016

vendredi 25 novembre 2016

  • [hal-01399154] Modalités d’une réforme éducative : le numérique peut-il participer à des changements de pratiques pédagogiques?
    Le numérique change simultanément le cadre de travail des enseignants, la manière dont ils travaillent entre eux et les attentes institutionnelles, que ce soit sur le plan des compétences techniques ou pédagogiques (Baron Bruillard 2000, Maroy 2006). 261 enseignants français du secondaire ont répondu fin 2014 à un questionnaire en ligne sur l’impact des TICEs sur le métier d’enseignant, considérant que la pédagogie et donc le métier de l’enseignant est profondément transformé par ce changement de contexte (Feyfant 2009). La présente communication vise à montrer que l’appropriation des outils numériques a été très disciplinaire et contextuelle. Ainsi, malgré des évolutions pédagogiques d’ensemble notables, notamment sur la forme scolaire, le numérique risque de structurer durablement des façons de travailler en silo, en contradiction avec les injonctions récentes à l’interdisciplinarité.

jeudi 17 novembre 2016

  • [tel-01395290] Évaluation hors-ligne d'un modèle prédictif : application aux algorithmes de recommandation et à la minimisation de l'erreur relative moyenne
    L'évaluation hors-ligne permet d'estimer la qualité d'un modèle prédictif à partir de données historiques. En pratique, cette approche estime la qualité d'un modèle avant sa mise en production, sans interagir avec les clients ou utilisateurs. Pour qu'une évaluation hors-ligne soit pertinente, il est nécessaire que les données utilisées soient sans biais, c'est-à-dire représentatives des comportements observés une fois le modèle en production. Dans cette thèse, nous traitons le cas où les données à disposition sont biaisées. A partir d'expériences réalisées au sein de Viadeo nous proposons une nouvelle procédure d'évaluation hors-ligne d'un algorithme de recommandation. Cette nouvelle approche réduit l'influence du biais sur les résultats de l'évaluation hors-ligne. Nous introduisons ensuite le contexte d' Explanatory Shift, qui correspond à une situation dans laquelle le biais réside dans la distribution de la variable cible. Des expériences menées sur les données du site de e-commerce Cdiscount et la base de données Newsgroup montrent alors que, sous certaines hypothèses, il est possible d'inférer la distribution de la variable cible afin de corriger la non-représentativité de l'échantillon d'apprentissage à disposition. De façon plus théorique, nous nous intéressons ensuite au rôle de la fonction de perte utilisée pour la sélection d'un modèle à partir de la méthode de minimisation du risque empirique. Plus précisément, nous détaillons le cas particulier de la minimisation de l'erreur relative moyenne et nous introduisons le concept de régression MAPE (Mean Absolute Percentage Error). Les travaux réalisés dans ce cadre portent alors sur la consistance de l'estimateur de minimisation du risque empirique pour la régression MAPE, et sur la régression MAPE régularisée en pratique. Les expériences menées sur des données simulées ou extraites du réseau social professionnel Viadeo montrent les avantages de la régression MAPE et permettent d'illustrer des propriétés théoriques de l'estimateur obtenu.

mardi 11 octobre 2016

jeudi 22 septembre 2016

  • [hal-01310409] Bayesian Variable Selection for Globally Sparse Probabilistic PCA
    Sparse versions of principal component analysis (PCA) have imposed themselves as simple, yet powerful ways of selecting relevant features of high-dimensional data in an unsupervised manner. However, when several sparse principal components are computed, the interpretation of the selected variables is difficult since each axis has its own sparsity pattern and has to be interpreted separately. To overcome this drawback, we propose a Bayesian procedure called globally sparse probabilistic PCA (GSPPCA) that allows to obtain several sparse components with the same sparsity pattern. This allows the practitioner to identify the original variables which are relevant to describe the data. To this end, using Roweis' prob-abilistic interpretation of PCA and a Gaussian prior on the loading matrix, we provide the first exact computation of the marginal likelihood of a Bayesian PCA model. To avoid the drawbacks of discrete model selection, a simple relaxation of this framework is presented. It allows to find a path of models using a variational expectation-maximization algorithm. The exact marginal likelihood is then maximized over this path. This approach is illustrated on real and synthetic data sets. In particular, using unlabeled microarray data, GSPPCA infers much more relevant gene subsets than traditional sparse PCA algorithms.

mardi 20 septembre 2016

  • [hal-01207009] Weighted interpolation inequalities: a perturbation approach
    We study optimal functions in a family of Caffarelli-Kohn-Nirenberg inequalities with a power-law weight, in a regime for which standard symmetrization techniques fail. We establish the existence of optimal functions, study their properties and prove that they are radial when the power in the weight is small enough. Radial symmetry up to translations is true for the limiting case where the weight vanishes, a case which corresponds to a well-known subfamily of Gagliardo-Nirenberg inequalities. Our approach is based on a concentration-compactness analysis and on a perturbation method which uses a spectral gap inequality. As a consequence, we prove that optimal functions are explicit and given by Barenblatt-type profiles in the perturbative regime.

samedi 17 septembre 2016

  • [hal-01367308] A Class of Random Field Memory Models for Mortality Forecasting
    This article proposes a parsimonious alternative approach for modeling the stochastic dynamics of mortality rates. Instead of the commonly used factor-based decomposition framework , we consider modeling mortality improvements using a random field specification with a given causal structure. Such a class of models introduces dependencies among adjacent cohorts aiming at capturing, among others, the cohort effects and cross generations correlations. It also describes the conditional heteroskedasticity of mortality. The proposed model is a generalization of the now widely used AR-ARCH models for random processes. For such class of models, we propose an estimation procedure for the parameters. Formally, we use the quasi-maximum likelihood estimator (QMLE) and show its statistical consistency and the asymptotic normality of the estimated parameters. The framework being general, we investigate and illustrate a simple variant, called the three-level memory model, in order to fully understand and assess the effectiveness of the approach for modeling mortality dynamics.

mardi 30 août 2016

  • [hal-01356993] Discovering Patterns in Time-Varying Graphs: A Triclustering Approach
    This paper introduces a novel technique to track structures in time varying graphs. The method uses a maximum a posteriori approach for adjusting a three-dimensional co-clustering of the source vertices, the destination vertices and the time, to the data under study, in a way that does not require any hyper-parameter tuning. The three dimensions are simultaneously segmented in order to build clusters of source vertices, destination vertices and time segments where the edge distributions across clusters of vertices follow the same evolution over the time segments. The main novelty of this approach lies in that the time segments are directly inferred from the evolution of the edge distribution between the vertices, thus not requiring the user to make any a priori quantization. Experiments conducted on artificial data illustrate the good behavior of the technique, and a study of a real-life data set shows the potential of the proposed approach for exploratory data analysis.

samedi 20 août 2016

  • [hal-01354235] Modeling the Influence of Local Environmental Factors on Malaria Transmission in Benin and Its Implications for Cohort Study
    Malaria remains endemic in tropical areas, especially in Africa. For the evaluation of new tools and to further our understanding of host-parasite interactions, knowing the environmental risk of transmission—even at a very local scale—is essential. The aim of this study was to assess how malaria transmission is influenced and can be predicted by local climatic and environmental factors. As the entomological part of a cohort study of 650 newborn babies in nine villages in the Tori Bossito district of Southern Benin between June 2007 and February 2010, human landing catches were performed to assess the density of malaria vectors and transmission intensity. Climatic factors as well as household characteristics were recorded throughout the study. Statistical correlations between Anopheles density and environmental and climatic factors were tested using a three-level Poisson mixed regression model. The results showed both temporal variations in vector density (related to season and rainfall), and spatial variations at the level of both village and house. These spatial variations could be largely explained by factors associated with the house’s immediate surroundings, namely soil type, vegetation index and the proximity of a watercourse. Based on these results, a predictive regression model was developed using a leave-one-out method, to predict the spatiotemporal variability of malaria transmission in the nine villages. This study points up the importance of local environmental factors in malaria transmission and describes a model to predict the transmission risk of individual children, based on environmental and behavioral characteristics.

mercredi 10 août 2016

  • [hal-01352438] Variables selection by the LASSO method. Application to malaria data of Tori-Bossito (Benin)
    This work deals with prediction of anopheles number using environmental and climate variables. The variables selection is performed by GLMM (Generalized linear mixed model) combined with the Lasso method and simple cross validation. Selected variables are debiased while the prediction is generated by simple GLMM. Finally, the results reveal to be qualitatively better, at selection, the prediction point of view than those obtained by the reference method.

vendredi 5 août 2016

Filtre

Agenda

<<

2017

>>

<<

Mai

>>

Aujourd'hui

LuMaMeJeVeSaDi
1234567
891011121314
15161718192021
22232425262728
2930311234

Annonces

ESANN 2016 : European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning


STATLEARN 2016


ICOR 2016


Brèves

6 septembre 2016 - Imprimantes

En salle C20.02, vous trouverez deux imprimantes et une photocopieuse imprimante. L’imprimante (...)