{"Title":"Spatial Durbin Model for Poverty Mapping and Analysis","Author":"Thomas Achia and Atinuke Adebanji and John Owino and Anne Wangombe","Session":"kal-ts-1-4","Keywords":"spatial, econometrics","Abstract":"The use of spatial regression models for describing and explaining\nspatial data variation in poverty mapping has become an increasingly\nimportant tool. This study considered the spatial Durbin model (SDM) in\nidentifying possible causes of poverty in Bari region of Somalia using\nSomalia settlement census data. Data properties were identified using\nexploratory spatial data analysis (ESDA) and the output ESDA provided\ninput into the spatial Durbin model. Parameter estimation and hypotheses\ntesting and assessment of goodness of fit were carried out for the\nspecified model. Dissimilarity of neighbouring settlements in North West\nSomalia and similarity of neighbouring settlements in North East and\nSouth Central Somalia with respect to the variables of interest were\nobserved using the Global and Local Moran's I test statistic. The\nproportion of families who cannot afford two meals per day was taken as\na proxy indicator for poverty level and the implication of the findings\non policy decision making for development planning are discussed."} {"Title":"Large atomic data in R: package 'ff'","Author":"Daniel Adler and Jens Oehlschlägel and Oleg Nenadic and Walter Zucchini","Session":"kal-highperf_con-1-1","Keywords":"high performance computing-large memory","Abstract":"A proof of concept for the ff package has won the large data competition\nat useR!2007 with its C++ core implementing fast memory mapped access to\nflat files. In the meantime we have complemented memory mapping with other\ntechniques that allow fast and convenient access to large atomic data\nresiding on disk. ff stores index information efficiently in a packed\nformat, but only if packing saves RAM. HIP (hybrid index preprocessing)\ntransparently converts random access into sorted access thereby avoiding\nunnecessary page swapping and HD head movements. The subscript C-code\ndirectly works on the hybrid index and takes care of mixed\npacked/unpacked/negative indices in ff objects; ff also supports\ncharacter and logical indices. Several techniques allow performance\nimprovements in special situations. ff arrays support optimized physical\nlayout for quicker access along desired dimensions: while matrices in\nthe R standard have faster access to columns than to rows, ff can create\nmatrices with a row-wise layout and arbitrary ‘dimorder’ in the general\narray case. Thus one can for example quickly extract bootstrap samples\nof matrix rows. In addition to the usual [ subscript and assignment [<-\noperators, ff supports a swap method that assigns new values and returns\nthe corresponding old values in one access operation - saving a separate\nsecond one. Beyond assignment of values, the [<- and swap methods allow\nadding values (instead of replacing them). This again saves a second\naccess in applications like bagging which need to accumulate votes. ff\nobjects can be created, stored, used and removed, almost like standard R\nram objects, but with hybrid copying semantics, which allows virtual\nviews on a single ff object. This can be exploitet for dramatic\nperformance improvements, for example when a matrix multiplication\ninvolves a matrix and it’s (virtual) transpose. The exact behavior of ff\ncan be customized through global and local options, finalizers and\nmore. The supported range of storage types was extended since the first\nrelease of ff, now including support for atomic types raw, logical,\ninteger and double and ff data structures vector and array. A C++\ntemplate framework has been developed to map a broader range of signed\nand unsigned types to R storage types and provide handling of overflow\nchecked operations and NAs. Using this we will support the packed types\n‘boolean’ (1 bit), ‘quad’ (2 bit), ‘nibble’ (4 bit), ‘byte’ and\n‘unsigned byte’ (8 bit), ‘short’, ‘unsigned short’ (16 bit) and ‘single’\n(32bit float) as well as support for (dense) symmetric matrices with free\nand fixed diagonals. These extensions should be of some practical use,\ne.g. for efficient storage of genomic data (AGCT as.quad) or for working\nwith large distance matrices (i.e. symmetric matrices with diagonal fixed\nat zero)."} {"Title":"Robust Inference in Generalized Linear Models","Author":"Claudio Agostinelli","Session":"foc-rob-1-1","Keywords":"robust","Abstract":"The weighted likelihood approach is used to perform robust inference on\nthe parameters in a generalized linear models. We distinguish the case\nof replicated observations of the dependent variable for each\ncombination of the explanatory variables, common in the design of\nexperiment framework, and the case of one observation for each\ncombination of the explanatory variables, very common in observational\nstudies. We provide some theoretical results on the behavior of the\nintroduced estimators and we evaluate their performance by Monte Carlo\nexperiment. A non exhaustive comparison with the methods already\npresented in the literature is presented. Illustration of the proposed\nmethods in R is provided by examples on real datasets."} {"Title":"mvna, a R-package for the Multivariate Nelson-Aalen Estimator in Multistate Models","Author":"Arthur Allignol and Jan Beyersmann and Martin Schumacher","Session":"foc-biostat_surv-1-3","Keywords":"biostatistics-survival","Abstract":"The multivariate Nelson–Aalen estimator of cumulative transition hazards\nis the fundamental nonparametric estimator in event history analysis\n(Andersen et al., 1993, chap. IV). However, and to the best of our\nknowledge, there is not yet a multivariate Nelson– Aalen R-package (R\nDevelopment Core Team, 2007) available, and the same appears to be true\nfor SAS and Stata. Therefore, we have developed the mvna package\n(Allignol et al.) with convenient functions for estimating and plotting\nthe Nelson–Aalen estimates in any multistate model, possibly subject to\nindependent right–censoring and left–truncation. The usefulness of this\npackage is illustrated with two important data examples from event\nhistory analysis: competing risks and time–dependent covariates, in\nwhich displaying estimates of the cumulative transition hazards provides\nuseful insights and straighforwardly illustrates results from standard\nCox analyses."} {"Title":"BARD: Better Automated Redistricting","Author":"Micah Altman and Michael McDonald","Session":"kal-app-1-2","Keywords":"political science","Abstract":"BARD is the first (and currently only) open source software package for\nredistricting and redistricting analysis. BARD is a program that makes political\nredistricting more accessible and understandable by providing methods to create,\ndisplay, compare, edit, automatically refine, evaluate, and profile political\ndistricting plans. BARD supports both scientific analysis of existing\nredistricting plans and citizen participation in creating new plans. BARD\nfacilitates map creation and refinement through command-line, gui, and automatic\nmethods. Since redistricting is a computationally complex partitionaling problem\nnot amenable to an exact optimization solution, BARD makes use of a variety of\nselectable metaheuristics, including genetic algorithms, GRASP, and simulated\nannealing, that can be use to refine existing or randomly-generated\nredistricting plans based on user-determined criteria\n\nFurthermore, BARD supports the ability to randomly generate redistricting plans,\nand to generate profiles of plans for different scoring weights. This\nfunctionality can be used both to explore trade-offs among criteria and to make\ninferences regarding the intent behind existing redistricting plans. \n\nBecause of the computational intensity of these methods, performance is an\nimportant criterion in the design and implementation of BARD. The program\nimplements performance enhancement such as evaluation caching, explicit memory\nmanagement, and distributed computing across snow clusters."} {"Title":"The 'deltaR' package: a flexible way to compare regression models on independent samples using a bootstrap approach","Author":"Gianmarco Altoè","Session":"foc-mod_ext-1-3","Keywords":"psychometrics","Abstract":"A frequently asked question in the social and behavioral sciences\nconcerns the statistical comparison of different regression models\nperformed on independent samples. This comparison may be useful to: (1)\ndirectly compare the goodness of fit of one or more models in\nindependent samples; (2) explore the behavior of different models in\ndifferent groups in order to conduct more complex analyses (e.g.,\nmultigroup analyses). The aim of this paper is to present a flexible\nmethod to test the difference between explained variance (∆ R-squares)\nof two multiple linear regression models in two independent samples. The\nmethod is based upon a stratified, non-parametric bootstrap\napproach. The consistency and efficiency of “deltaR” is illustrated via\nMonte Carlo simulations, and a case study based on real data will be\npresented. The discussion will focus on the usefulness of this method,\nwith a special emphasis on its applications in the social and behavioral\nsciences."} {"Title":"Automatic construction of graphical outputs of common multivariate analyses with a special reference to predictive biplots","Author":"M. Rui Alves and M. Beatriz Oliveira","Session":"foc-multi-1-2","Keywords":"multivariate, chemometrics","Abstract":"Predictive biplots [1] have several aspects which are very important in\nmultivariate analyses, mainly because they enable an easier\ninterpretation of multivariate analyses outputs by relating sample\nconfigurations to the initial, declared variables, without losing the\nmodulation aspects so characteristic of these statistical\nmethodologies. The disadvantages of biplots are mostly related to\nsoftware limitations causing difficulties in obtaining the final\ngraphical solutions and to difficulties in deciding how relevant a given\nbiplot axis is and how many plots are necessary to accurately describe\navailable data. The latter difficulties are also experienced by users of\nnormal statistical methodologies when it comes to interpret multivariate\noutputs, i.e., how important initial variables are to explain latent\nvariables and how many dimensions are necessary to describe data. One\nproblem is that there is a huge degree of freedom in the way this can be\ndone, and easy methods to help taking these decisions are very\nimportant, mainly for inexperienced statisticians. This presentation\nfollows our previous works on the subject [e.g., 2,3] and demonstrates\nhow a decision on whether a biplot axis is or not drawn in two-way plots\ncan be carried out automatically [4]. The method, based on the\npredictive power of variables and on a specially defined tolerance\nvalue, enables R to evaluate the interest of each of the initial\nvariables and draw the predictive biplots automatically. Also, the final\nnumber of plots is also automatically decided by R. Since the method may\nbe made fully automatic, inexperienced users can take profit of all the\nR facilities, carrying out multivariate analysis and final\ninterpretations, and at the same time being protected from common\nover-fitting problems, difficulties in interpretations of multivariate\noutputs, etc. The methods can be applied to several statistical\nmethodologies, and examples are provided for principal components\nanalysis and canonical variate analysis (in chemistry) and three-way\nTucker-1 common loadings analysis (in the field of sensory analysis). It\nis also shown that the method devised to produce biplots in an automatic\nway can be diverted to common outputs, enabling R to provide users of\nmultivariate analyses with automatic interpretations of results,\nincluding the decision on the number of dimensions to retain and their\nrespective interpretations."} {"Title":"R in Automation: Accessing Real-time-data","Author":"Thomas Baier","Session":"foc-highperf-3-3","Keywords":"high performance computing","Abstract":"Industrial Automation in general and in particular PLCs (Programmable\nLogic Controllers) and embedded devices are a rapidly growing\nmarket. Embedded devices are found in small devices, like, e.g., watches\nor mobile phones, are used in everyday life as for example ABS systems\nor engine-monitoring systems in cars. In larger applications these\nsystem are typically called PLCs and used to control assembly lines,\nrolling mills or power plants. Depending on the requirements on\navailability of automation systems on the one hand or safety\nconsiderations on the other hadn, more and more effort is put into\nmonitoring the system during its whole life time. Typical aproaches for\nmonitoring are either rule-based systems or open-loop control\nscnearios. In rule-based systems data is collected and processed\naccording to statically defined rules (e.g., issuing an emergency\nshutdown if some safety-related devices fails). Open-loop are designed\nto collect data and present the results to an operator. The operator\nthen has to decide on further actions (or if the operator fails to\nacknowledge an alarm message, an automatic procedure brings the whole\nautomation system into a fail-stop or fail-safe operation mode). In\naddition to these methods, we are suggesting an alternative method to\ncapture the “big picture” of the automation device and allow to apply\nstatistical methods on the process data. This analyses will be the input\nfor further optimization of operation of the automation system in better\nplanning of device/system maintenence (so-called “predictive\nmaintenance”). Fortunately, the automation industry has decided on\nstandard means for data acquisition which is typically used by\nvisualization and data collection software. OPC (formerly known as OLE\nfor Process Control, nowadays OPC is marketed as Openess, Productiviy\nand Collaboaration) provides standardized mechanisms for accessing\nreal-time data. This data can either be a PLC’s or embedded device’s\ninternal state or “real” process data from the sensor/actor level (e.g.,\nstate of switches, valves or drives). In our short presentation we will\nshow how R can access this data using OPC DA (OPC Data Acquisition),\nwhich allows to connect to nearly every PLC or embedded device used in\nautomation industry. OPC DA is based on Microsoft’s COM technology,\nwhich currently is easiest to use with R for Windows. In addition to OPC\nDA we will shortly discuss current developments in the field of OPC which\nwill also enable to access OPC data from non-Windows systems."} {"Title":"RSTAR: A Package for Smooth Transition Autoregressive Modeling Using R","Author":"Mehmet Balcilar","Session":"foc-econom-2-2","Keywords":"econometrics","Abstract":"In the last few years, numerous improvements have been made for\nstatistical inference in threshold autoregressive models. Particularly,\nnew tests are developed and methods are proposed for diagnostic control,\nforecasting, and impulse response analysis. These developments are\nexamined in Granger and Terasvirta (1993), Terasvirta (1998), Potter\n(1999), and van Dijk, Terasvirta and Franses (2002). This study develops\na comprehensive R package for testing, estimating, diagnostic checking,\nforecasting, and further analysis of smooth transition autoregressive\nmodels (STAR). The package is designed around the empirical modeling\ncycle for STAR models devised by Terasvirta (1994), and van Dijk, et\nal. (2002). This modeling approach consists of specification, estimation\nand evaluation stages and, thus, is similar to the modeling cycle for\nlinear models of Box and Jenkins (1976). In the testing stage, the\npackage emphasis LM-type test and implements all tests proposed in the\nliterature (see Luukkonen, Saikkonen and Terasvirta (1988), Granger and\nTerasvirta (1993), van Dijk, et al. (2002)). The package allows\nestimation of logistic and exponential STAR models using analytical\ngradients. Very extensive diagnostic control techniques are implemented\nin the package. All aspects of the diagnostic tests examined discussed\nin Eitrheim and Terasvirta (1996), van Dijk and Franses (1999) and\nLundbergh, Terasvirta and van Dijk (2000) are fully implemented. The\nR-STAR package allows robust estimation methods for all tests in order\nto guard against influence of possible outliers. RSTAR emphasizes aspects\nsuch as model evaluation by means of out-of-sample forecasting and\nimpulse response analysis, and the influence of possible outliers on the\nanalysis of smooth transition type nonlinearity. Forecasts and impulse\nresponses are calculated using Monte Carlo or bootstrap methods with\ncode highly optimized for speed. We also incorporate recently introduced\nextensions of the basic smooth transition model. On the programming\nside, R-STAR needs no programming experience. Although all commands have\ncontrol over all aspects, default values are provided and only one or\ntwo options need to be passed, if needed at all. We take advantages of\nobject-oriented programming, S4 methods, and vectorization provided by\nthe R environment."} {"Title":"Tree-based and GA tools for optimal sampling design","Author":"Marco Ballin and Giulio Barcaroli","Session":"foc-social-1-1","Keywords":"social sciences","Abstract":"The optimality of a sample design can be defined in terms of costs\n(associated to fieldwork: number of units to be interviewed) and\naccuracy (sampling variance related to target estimates). Bethel\nproposed an algorithm (Bethel, 1985) able to determine total sample size\nand allocation of units in strata, so to minimise costs under the\nconstraints of defined precision levels of estimates, in the\nmultivariate case (more than one estimate). Input to this algorithm is\ngiven by the information on distributional characteristics (total and\nvariance) of target variables in the population strata. Under this\napproach, population stratification, i.e. the partition of the sampling\nframe obtained by cross-classifying units by means of potential\nstratification variables, is given. But stratification has a great\nimpact on the optimal solution determined by Bethel algorithm and, in\ngeneral, it must be defined in the first steps of a survey planning. If\na frame with a set of potential variables for stratification is\navailable, the survey planner has to choose the “best” auxiliary\nvariable cross product (partition of the frame). Among the possible\npartitions, the one with the maximum number of strata, given by the\nCartesian product of all auxiliary variables, does not always yield the\noptimal sample size. In fact, organisational considerations, and the\nnecessity to define a minimum amount of units per stratum, oblige not to\nincrease the number of strata beyond a certain limit. In that case, how\nto determine the best partition among all partitions obtainable\ncombining the auxiliary variables (what auxiliary variables? what values\nfor each of them to take into consideration?) has to be considered as a\npart of the whole problem. Until recently, on the contrary, the problem\nof determining the optimal size and allocation of units in strata has\nbeen solved considering the stratification of population as given; and,\nconversely, the definition of an optimal stratification has been\ninvestigated independently by the optimisation problem of sampling size\nand allocation. An interesting proposal has been advanced in the recent\npast (Benedetti et. al 2005), offering a joint solution to both\nproblems: it is based on a tree search in the space of possible strata\nconfigurations, solving for each visited node the corresponding\nmultivariate allocation problem accordingly to Bethel algorithm. At each\nlevel, the node that is the best in terms of sample size reduction, is\nchosen as the branching node. This tree-based approach is deterministic\nand very fast, but it may heavily suffer for the presence of local\nminima and, consequently, solutions can be far from optimality. Together\nwith this tree-based approach, we propose a non deterministic\nevolutionary approach, based on the genetic algorithm (GA)\nparadigm. Under the GA approach, each solution (i.e. a particular\npartition in strata of the sampling frame) is an individual in a\npopulation, whose fitness is evaluated by calculating the sampling size\nsatisfying accuracy constraints on the target estimates; crossover and\nmutation carried out along each iteration ensure an increase of average\nfitness. In general, the characteristic of GA are such that the risk of\nlocal minima is lower than in the tree search, though processing time is\nnoticeably higher. Our proposal is the following: in complex situations\n(characterised by a high number of stratification alternative\nconfigurations and/or a high number of target variables and domains),\nfirst the tree-based algorithm is applied, in order to individuate a\nsolution. This solution is then introduced in the GA initial population,\nin order to speed its convergence to a better solution. Our experiments\nshow an improvement of the tree-based solution, and encourage the\nadoption of this procedure. The whole system can be thought of as a\n“toolkit”, composed by a series of instruments, all implemented and\noperating in the R environment. Main scripts are: 1. strataTree.R\nimplementing the tree-based algorithm; 2. strataGenalg.R that implements\nthe GA approach, making use of “genalg” package (Willighagen, 2002) in a\nslightly modified version; 3. Bethel.R implementing Bethel\nalgorithm. These programs can be run directly in the R environment, but,\nas and an additional facility, a simple web interface has been developed\nusing Rwui that enables the user to carry out the processing without\nbeing obliged to be acquainted with R language or even R environment."} {"Title":"ArDec: Autoregressive-based time series decomposition in R","Author":"Susana Barbosa","Session":"foc-ts-1-2","Keywords":"time series","Abstract":"The extraction of trend and periodic components from an observed time\nseries is a topic of considerable practical importance. Most time series\nmethods require the assumption of stationarity to be met, and therefore\nthe removal of any trend-like or seasonal signals from the\ndata. Furthermore, in many applications such signals are often of\ninterest in themselves. Flexible methods are therefore required for the\ndecomposition of a time series into physically-relevant components. The\nR package ArDec implements the autoregressive-based time series\ndecomposition of West (1997). The method is based on the dynamic linear\nrepresentation for an autoregressive process from which results a\nconstructive approach for the decomposition of an observed time series\ninto latent constituent sub-series. The approach and the usage of\npackage ArDec are illustrated through an example of decomposition of\nsea-level time series."} {"Title":"Visualizing multivariate categorical and continuous data from epidemiologic studies: An expanded scatter plot matrix","Author":"Benjamin Barnes and Karen Steindorf","Session":"kal-visual-1-4","Keywords":"visualization","Abstract":"Epidemiologic datasets often contain a mix of categorical and continuous\nvariables. Understanding the interrelationships among these variables\nis vital for subsequent analysis and can be aided by graphical\npresentation. Scatter plot matrices, produced in R using functions such\nas pairs() and splom(), are useful for graphically displaying\nmultivariate continuous data. For displaying multivariate categorical\ndata, the Visualizing Categorical Data (vcd) package offers many\nflexible options, including the pairs.table() function. However, these\nfunctions are not readily compatible with one another, making visual\npresentation of mixed epidemiologic data difficult. With this in mind,\nthe scope of the splom() function was expanded to include visualization\nof categorical data. Furthermore, a novel panel function compatible\nwith splom() was created to visualize categorical-categorical data using\na mosaic plot. Continuous-continuous data was plotted using existing\nscatter plot and level plot panel functions. Existing panel functions\nwere also used to produce box-and-whisker plots for\ncategorical-continuous data as well as stacked bar charts for\ncategorical-categorical data. With these modifications and the new\nmosaic function, categorical and continuous data can be viewed in a\nunified plot matrix. An example of such a plot matrix was created using\nsimulated data inspired by a study investigating the effects of\nlifestyle and anthropometric factors on insulin-like growth factor\n(IGF)-I and IGF binding protein (IGFBP)-3. These two proteins are\nsuspected of playing a role in breast cancer development, and current\nresearch focuses on identifying modifiable lifestyle factors that\ninfluence their concentrations in blood. The expanded scatter plot\nmatrix described here improves visualization of mixed datasets and can\nbe further enhanced to visualize bivariate linear models, chi-squared\ntests, and other bivariate statistical test results."} {"Title":"Understanding product integration","Author":"Jan Beyersmann and Arthur Allignol and Martin Schumacher","Session":"foc-biostat_surv-1-2","Keywords":"biostatistics-survival","Abstract":"Product integration is a very powerful, but somewhat neglected topic in\napplied survival analysis: Survival data are usually incompletely\nobserved, the most important example being independent\nright-censoring. This leads to survival analysis being based on hazards,\nbecause the hazard of seeing an event is undisturbed by censoring. The\nKaplan-Meier estimator of the survival function is a finite product over\none minus empirical hazards, and it approaches e to the negative true\ncumulative hazard. This result is not very intuitive, but it is much\nbetter understood using product integration: A product integral is a\n‘continuous time product’, like a usual integral is a ‘continuous time\nsum’. The product integral over one minus the true hazard is a ‘product’\nover infinitesimal conditional survival probabilities, and therefore\nequal to the survival probability. The product integral over one minus\nthe empirical hazard equals the Kaplan-Meier estimator. The ‘e to the\nnegative true cumulative hazard’-formula is then seen to simply be the\nsolution of a product integral. We explore these connections in R, where\none function prodint both approximates the true survival function\narbitrarily close and results in the Kaplan-Meier estimate when applied\nto data based empirical hazards. Both theory and the R implementation\ngeneralize to the matrix-valued case that is important for multistate\nmodels: Here one individual may experience a possibly random number of\nevents, e.g. transitions between ‘healthy’ and ‘ill’ before\ndying. Closed formulae for the transition probabilities will in general\nnot be available anymore, but a matrix-valued version of prodint may\nstill be used for numerical approximation. It also results in the\nAalen-Johansen estimator of the matrix of transition probabilities, a\ngeneralization of the Kaplan-Meier estimator, when applied to empirical\ntransition hazards. Empirical transition hazards can be obtained in R\nusing the mvna-package."} {"Title":"FluxEs: An 'R' Framework for Parameter Estimation in Biological Networks","Author":"Thomas Binsl and Jaap Heringa and David Alders and Hans van Beek","Session":"foc-bioinf_systems-1-1","Keywords":"bioinformatics-systems","Abstract":"Parameter estimation in biological networks is a difficult task and many\ncomputer programs were developed for this purpose. However, available\ncomputer methods suffer from lack of easy implementation of new\nbiological pathways. Hence, we have designed a new framework called\nFluxEs using an object-oriented programming approach, implemented using\nS4 classes in R. It particularly addresses distribution of isotopes\nbetween metabolites in a carbon-transition network useful for\nquantifying metabolic fluxes. The developed package provides a simple\nway to specify the topological information of the network as well as the\nprecise transitions of carbon atoms between molecules in plain text\nfiles, and guides the user through the optimization process. For the\npurpose of parameter estimation, FluxEs automatically derives the\nmathematical representation of the formulated network, and assembles a\nset of ordinary differential equations (ODEs). Afterwards, it fits\nexperimentally measured Nuclear Magnetic Resonance (NMR) multiplet\nintensities with the metabolic model result, by continuously solving the\nODEs numerically, scanning parameter space to obtain optimal parameter\nestimates. A test was performed by applying FluxEs to fit a model of the\ntricarboxylic acid (TCA) cycle to simulated 13C NMR data, including\nrealistic measurement noise. Flux values could be re-estimated with\nsignificant precision. Subsequent flux estimation on experimental NMR\ndata of animal heart biopsies showed good correspondence with\nindependent chemical measurements."} {"Title":"Towards a Java Framework for Rapid Development of Graphical User Interfaces for Statistical Applications based on R","Author":"Bernd Bischl and Kornelius Rohmeyer","Session":"foc-gui_build-1-3","Keywords":"user interfaces-java","Abstract":"Many users from a non-statistical background are not programmers and\noften are not up to the task of using R for their statistical\nproblems. Therefore, specific and intuitive applications need to be\nprovided, which hide much of the complexity of the underlying R system,\nin order to enable the users to solve their problems at hand. While it\nis possible to either control R from different programming languages or\nto interface Java or C++ from R, it is not very efficient to create the\nabove mentioned applications from scratch by any of these\nalternatives. Because many characteristics are shared, these should also\nbe encapsulated in shared code. Hence we believe that a toolbox is\nnecessary which helps the developers of statistical software to\nconveniently and flexibly design graphical applications in their area of\nexpertise. Our open source and platform independent framework, which is\nimplemented in Java and builds upon JRI (http://www.rforge.net/JRI) and\nRserve (http://www.rforge.net/Rserve), aims to achieve that goal by\ninserting an abstraction layer between the business logic of the\napplication and these two packages. Thereby we can create the same\napplication as a local variant (employing the user’s already installed\nversion of R) or as a web oriented application with minimal local\nrequirements (which is automatically installed via Java Webstart and\nperforms all computations on an R server, thus not forcing the user to\nhave R installed at all). Currently, there are utility classes and\nrespective GUI elements to import data from XLS or CSV files, create\ndialogs to perform statistical analysis, generate, display and save\nplots and print output to PDF files. We also provide a A basic L TEX\nsupport for tables. Our framework has evolved from two major projects:\nOne application to estimate dose-response models for the University of\nCopenhagen and one for the Leibniz Universit¨t Hannover to do quality\nassessment and novel statistical a analysis for toxicological data. The\nlast one also includes convenience classes to generate dialogs regarding\ndifferent assays of toxicology. These elements either act as a tutorial\nor provide a guided walk through the analysis of the user’s own\ndataset. In our presentation we will compare our own approach to\nexisting frameworks for building statistical tools based on R by\nhighlighting their general advantages and disadvantages. A short demo of\nthe framework and the lookand-feel of an implemented application will be\ngiven. We are looking forward to receiving feedback and discussing\nfurther features and improvements with attending researchers, users and\ndevelopers from the field."} {"Title":"Simulating Games on Networks with R. Application to coordination in dynamic social network under heterogeneity.","Author":"Michal Bojanowski","Session":"foc-networks-1-2","Keywords":"networks","Abstract":"Most of the existing theoretical contributions to understanding\nmechanisms of coevolution of social networks and individual behavior\nassume that actors are homogeneous (e.g. Buskens et al., 2008; Jackson\nand Watts, 2002). The consequences of relaxing this assumption (Galeotti\net al., 2006) are not yet fully understood. Under which conditions will\nthe differences between actors result in higher segregation levels than\nin the homogeneous case? In this paper we study the interrelated dynamics\nof social networks and behavior when actors’ interests differ. As a\nframework for analysis we propose a baseline model in which actors\nsimultaneously choose their behavior and manage their personal relations\nwith others. The population of actors is composed of two types and\ninteractions are modeled with asymmetric two-person games. The\nheterogeneity is represented by three elements: 1. The degree to which\nactors’ interests behavioral options differ. 2. The severance of\n“mis-coordinating”. 3. Complementary or substitutable character of\nrelations with actors of the other type. To address the posed problems\nand evaluate the role of the three above mentioned components we employ\nboth analytical and computer simulation methods. This paper presents the\nresults of computer simulation study prepared and executed in R. The\nimplementation relies on the framework proposed in package simecol\n(Petzoldt and Rinke, 2007) which was fine-tuned for use in our setting.\nThe results identify stable network architectures that emerge if actors\nactively try to improve their position by making behavioral and\nrelational choices. We also investigate the dynamics of selected\nstructural characteristics which, among other, include network\nsegregation, centralization and transitivity. Examples of the dynamics of\nthe system are shown with network visualizer SoNIA\n(http://www.stanford.edu/group/sonia/)."} {"Title":"Using R for time series analysis and spatial-temporal distribution of global burnt surface multi-year product","Author":"Jedrzej Bojanowski and César Carmona-Moreno","Session":"foc-environ-1-3","Keywords":"environmetrics","Abstract":"Fires are one of the most significant components in the workings of the global\necosystem. There is no doubt that the global fires regime has a major\ninfluence on climate, carbon cycle, pollution, etc. Modeling those phenomena\nhas been complicated because of the lack of exhaustive databases concerning past\nfires distribution. For this reason, JRC is working on the concatenation of two\nexisting independent global multi-year burnt area products: GBS (1982-1999) and\nL3JRC (2000-2007). Since both time series are produced using different\nsatellite data with different spatial and temporal resolution and algorithms,\nthe main objective is to develop a statistically coherent database. Combining\nGBS and L3JRC products requires dissecting both of them - analyzing their\nvariations in time and space. In this paper, we present a few R applications\nwhich were applied in our research. RNetCDF package was implemented to handle\nthe big amount of spatial data. Afterwards, we applied different predefined\nmethods for time series analysis, using those from stats package, as well as\nfrom the zoo, tseries or lmtest packages. We introduce several decomposition\ntheorems, tests, stochastic models as well as some graphics dedicated to time\nseries objects. The principal components analysis allows us the description of\nthe differences in fires regimes derived from GBS and L3JRC algorithms. Based\non this, we propose a visualization technique to evaluate the spatial temporal\ncoherence of the 26 years of these global burnt area products. Rgl package was\nused to present the data in principal components space. We also carried out some\nnew methods to present spatial-temporal distribution of the data. We used\nvariogram analysis and kriging method based on spatial package."} {"Title":"A Maximum Likelihood estimator of a Markov model for disease activity in chronic diseases that alternate between relapse and remission, for\nannually aggregated partial observations","Author":"Sixten Borg","Session":"foc-biostat_model-1-4","Keywords":"biostatistics-modeling","Abstract":"Background: Crohn's disease (CD) and ulcerative colitis (UC) are chronic\ninflammatory bowel diseases that have a remitting, relapsing nature. Relapses\nare treated with drugs or surgery. No drug can be considered a curative\ntreatment. In CD, surgery is not curative, and may need to be performed many\ntimes, since the disease may reappear. For UC, curative surgery is possible\nafter which it cannot relapse again. We needed a discrete-time Markov model for\nthe disease activity of relapse and remission with a cycle length of one month,\nin order to study the effect of shortening or post-poning relapses. Our data\nconsisted of yearly observations of the individual patients. Each year, the\nnumber of relapses and surgical operations were recorded. There were no data on\nthe time points at which relapses started or ended.\n\nMethod: The disease activity model is a Markov chain with four states: 1) first\nmonth of remission, 2) subsequent months of remission, 3) first month of\nrelapse, and 4) subsequent months of relapse. A period of remission is defined\nas an unbroken sequence of cycles spent in states 1 and/or 2. A relapse is\ndefined as an unbroken sequence of cycles spent in state 3 and/or 4. Surgery can\noccur in states 3 and 4. An exact maximum likelihood estimator was used, that\ntranslated the yearly observations into monthly probabilities of transition\nbetween remission and relapse, and surgery. The probability of remission depends\non time since start of relapse, as does the probability of relapse since the\nstart of remission, due to the model structure. The parameters themselves do not\nchange over time in our context. The initial implementation of the estimator was\nslow, counting through all possible pathways of the model. Many paths have a\nzero likelihood, and not all are unique in how their likelihood depend on the\nparameters. We created a list of profiles, with the values necessary to evaluate\nthe likelihood of each unique pathway, given the parameter values. We thus\noptimized our estimator.\n\nResults: The maximum likelihood estimator appears to work well. Simulated\ntraining datasets result in reasonable estimates. The estimator initially took\nover three hours to complete. Optimization reduced this time to around one\nminute. The estimated disease activity model fits well to observed data and has\ngood face validity, in the absence of curative surgery. Presence of curative\nsurgery imposes a transient nature to the disease which makes the disease\nactivity model unsuitable.\n\nConclusions: The disease activity model and its estimator work well. Presence of\ncurative surgery calls for further development of the model, the estimator and\nits use of profiles."} {"Title":"MCPMod - An R Package for the Design and Analysis of Dose-Finding Studies","Author":"Björn Bornkamp and José Pinheiro and Frank Bretz","Session":"kal-bio-1-1","Keywords":"pharmacokinetics","Abstract":"In this presentation the MCPMod package for the R programming environment will\nbe introduced. It implements a recently developed methodology for dose-response\nanalysis that combines aspects of multiple comparison procedures and modeling\napproaches (Bretz et al., 2005, Pinheiro et al., 2006). The MCPMod package\nprovides tools for the analysis of dose finding trials as well as a variety of\ntools necessary to plan an experiment to be analysed using the MCP-Mod\nmethodology. Both design and analysis capabilities of the package will be\nillustrated with examples."} {"Title":"Use R! for estimating forest parameters based on Airborne Laser Scanner Data","Author":"Johannes Breidenbach","Session":"kal-environ-1-2","Keywords":"environmetrics-forests","Abstract":"Forest parameters such as timber volume, diameter distributions, tree height and\ntree species are important information for a sustainable forest management and\nplanning issues in the wood-working industry. Additionally, the amount of carbon\nstocks in woody biomass has become an crucial parameter due to international\nreporting commitments (e.g., the Kyoto-protocol). Conventionally, these\ninformation are surveyed in sample plot inventories. However, terrestrial sample\nplot inventories usually cannot provide estimates on the stand-scale (A usual\nforest stand in southern Germany has an area of 1-3 ha and comprises of trees\nwith more or less the same species and age.). Furthermore, as a result to their\nhigh costs, they are repeated in a decennial cycle. Therefore, one aim of the\nresearch project MatchWood (www. matchwood.de) is to develop methods to\nregionalize forest parameters based on remotely sensed data. Since many\nvariables of interest are correlated with the structural characteristics of the\ncanopy, airborne laser scanning data were used as auxiliary variable. Airborne\nlaser scanning (ALS) or light detection and ranging (lidar) is an active remote\nsensing technique that comprises scanning and navigation units. In an ALS\nsystem, a laser pulse is projected on a scanning mirror and sent to the surface.\nSince the position and orientation of the aircraft is known, the time-of-flight\nof the laser pulse can be used to determine the position of the reflection on\nthe earth’s surface. ALS provides a high resolution 3D representation of the\ncanopy and the terrain surface in one overflight. R was used to derive height-\nand density metrics of the lidar-derived vegetation height for inventory plots\nand to develop statistical models for the response variables. The presentation\nwill show the application of different R methods and libraries for estimation\nand regionalizing the above mentioned forest parameters. For example:\n• Calling external command-line tools (FUSION) for handling the huge amount\n(about 500.000 returns km−2 ) of lidar raw data.\n• Mixed-effects models (library nlme) for estimating timber volume and biomass\nby accounting for the spatial correlation of the inventory plots and\nheteroscedasticity. \n• Generalized additive models for location, scale and shape (library GAMLSS)\nfor estimation of the Weibull distributed response variable diameter. \n• RandomForests for non-parametric estimation of timber volume by species. \n• Generating maps using the maptools library."} {"Title":"Tricks and Traps for Young Players","Author":"Ray Brownrigg","Session":"foc-teach-1-3","Keywords":"teaching","Abstract":"This presentation will illustrate for new users to R some of its very useful\nfeatures that are frequently overlooked, and some frequently misunderstood\nfeatures. Emphasis will be on achieving results efficiently, so there may be\nsome value for (moderately) seasoned users as well as beginners. Many of the\nfeatures discussed will be illustrated by following the development of an actual\nsimulation project. Issues to be discussed include:\n• Using a matrix to index an array\n• Vectorisation\n– user-defined functions (using curve(), optimi[zs]e())\n– pseudo vectorisation\n– multi-dimensional\n• Matrices, lists and dataframes, which are most efficient?\n• Local versions of standard functions\n• Resolution of pdf graphs\n• .Rhistory\n• get()\n• file.choose()\n• sort(), order() and rank()"} {"Title":"Exploring Financial system Convergence in 8 OECD countries by means of the plm package","Author":"Giuseppe Bruno","Session":"foc-econom-1-1","Keywords":"econometrics","Abstract":"The relevance of financial systems for collecting resources from saving\nhouseholds to funds constrained firms is widely recognized. Taking advantage of\na dataset covering the financial accounts for 8 of the main OECD economies we\nrun some experiments of β− and σ− convergence for the main components of\nthe household financial assets. These experiments have been carried out with\nthe R package equipped with the plm package for panel data estimation. The\nempirical literature on β− and σ− convergence is typically based on\nregression models where the average growth rate of per capita income is assumed\ndependent on its initial level and possibly on other exogenous variables used to\ncontrol for country idiosyncracies: 1 log(yi,t+T /yi,t ) = α + βlog(yi,t ) +\nγxi,t + T i,t In this model we say there is conditional β−convergence if we\nfind β < 0. In other words, in presence of β−convergence poor economies\ntend to grow faster, and therefore to catch up richer countries. Sala-i-Martin\n(1996) proposed the concept of σ−convergence defined as follows: a group of\neconomies satisfy σ−convergence if the dispersion of their per capita income\nlevels decreases over time: σt+T < σt where σt = ΣN (log(yi,t − yt )2 .\nUsing a dataset on financial ¯ i=1 accounts produced in 2007 by the OECD,\nPioneer G.A.M. and some National Central Banks, we have carried out a thorough\nconvergence analysis in the R environment. In this paper we explored the\nbehaviour of the total financial assets held by household and four of their\nmain components: currency and deposits, securities other than shares, shares and\nother equities, insurance technical reserves. The main economic conclusions\ndrawn from the analysis are: a) it is found evidence of β− and σ−\nconvergence for the household total financial assets, shares and other equity,\nand insurance product; b) often no convergence is found for currency and\ndeposits and securities other than shares; c) the intensity of banking\ndisintermediation for deposits shows marked difference among the OECD\ncountries. These kind of empirical applications are usually carried out with\ncommercial econometric packages. In this work we compared the numerical results\nproduced by R with those achieved with three well know packages such as E-Views,\nLIMDEP and Stata. Some interesting results can be drawn from this comparison: a)\nthe coefficients estimates do always agree among the different packages for the\npooled OLS and the fixed effects model; b) some differences arise among the\nnumerical values of the standard errors for the fixed effects model; c) in\nsome particular situation the random effects model generates the same estimates\nof the pooled OLS model; d) model definition might be improved with the help of\nan inline symbolic lag/lead operation."} {"Title":"Computationally Tractable Methods for High-Dimensional Data","Author":"Peter Bühlmann","Session":"invited","Keywords":"invited","Abstract":"Many applications nowadays involve high-dimensional data with p variables (or\ncovariates), sample size n and the relation that p n. We focus on penalty-based\nestimation methods which are computationally feasible and have provable\nstatistical and numerical properties. The Lasso (Tibshirani, 1996), an 1\n-penalty method, became very popular in recent years for estimation in\nhigh-dimensional generalized linear models. Extensions to other models or\ndata-types call for more flexible convex penalty functions, for example to\nhandle categorical data or for improved control of smoothness in additive\nmodels. The Group-Lasso (Yuan and Lin, 2006) and a new sparsity-smoothness\npenalty are general and useful penalty functions for many high-dimensional\nmodels beyond GLM’s. Fast coordinatewise descent algorithms can be used for\nsolving the corresponding convex optimization problems which allow to easily\ndeal with large dimensionality p (e.g. p ≈ 106 , n ≈ 103 ). The talk\nincludes: (i) a review of Lasso-type methods; (ii) new flexible penalty\nfunctions, fast algorithms (R package grplasso) and some comparisons with\nboosting; and (iii) some illustrations for bio-molecular data."} {"Title":"An Automatic Recommendation System using R: Project Thank You eMail","Author":"Christopher Byrd","Session":"kal-app-1-1","Keywords":"business","Abstract":"Throughout the guest experience, at an IHG brand hotel, chances are you will\nreceive an email communication that contains an offer. For example, “Earn\n5,000 miles on your next stay”. Using only Open Source software, resident\nstatisticians begin the task of building an automatic recommendation system,\nwith R as a core component. The pilot project for this endeavor is named,\n“Thank You eMail”. The name is fitting since the plan is to deliver offer\nrecommendations through the Thank You email; which are distributed 24-72hrs\nafter a guest has checked out. Given the high volume of email transactions the\nteam must meet standard IT constraints e.g. > 1msec response time. This\npresentation will show the role of R in the enterprise wide solution, and how\nJava it used to integrate it with other very popular and easily accessible\ntools. Packages: R Sessions, FlexMix (latent class regression), Bayesm, and more"} {"Title":"washAlign: a GC-MS Data Alignment Tool Using Iterative Block-Shifting of Peak Retention Times Based on Mass-Spectral Data","Author":"Minho Chae and John Thaden and Steven Jennings and Robert Shmookler Reis","Session":"foc-chemo-1-1","Keywords":"chemometrics","Abstract":"In GC-MS, a gas chromatograph (GC) resolves chemicals by time of elution from a\ncoated capillary through which gas flows; a mass spectrometer (MS) resolves ions\n(produced upon fragmentation of eluates) by mass/charge (m/z) ratio; and an\nacquisition program records ion intensity as a function of m/z and elution,\nyielding spectra and chromatograms, respectively. A problem when comparing\nrecords in an experiment is that elution times will vary. washAlign has been\ndeveloped in R to address this problem. It warps regions between peaks that it\nhas shifted, thereby aligning those peaks to spectrally matched peaks in a\nreference chromatogram while preserving their shape and area. Through pair-wise\ncomparisons of all records to one arbitrarily selected reference record, all\nrecords in a large experiment can be aligned for subsequent processing, e.g., by\nthree-way methods, including those such as PARAFAC that assume mathematical\ntrilinearity. In washAlign, (a) ion chromatograms are extracted for a subset of\nthose m/z channels with the five highest ion intensities in any of the\nconsecutive MS scans that define “a region of the sample and reference\nchromatograms that exhibit a peak on the total intensity chromatogram”; (b)\npeaks are detected in them, and key peaks are matched between sample and\nreference through a procedure involving iterative localization with spectral\ncorrelation, to produce for each sample and the reference a peak list for\nalignment; and (c) the key sample peaks are shifted toward the matching peaks in\nthe reference run, and nonpeak regions are warped, i.e., linearly interpolated,\nto join the shifted peak regions. Users can visually inspect the chromatograms\nbefore and after alignment of a pair of chromatograms, through an interactive\nselection of matched peaks. Taking an iterative block-shift approach makes it\npossible to not only reveal strongly matching peaks at early stages but also to\nreduce the risk of mismatching chemically different peaks."} {"Title":"Scaling and Robustifciation of ARMA Models with GARCH/APARCH Errors Using R/Rmetrics","Author":"Yohan Chalabi and Michal Miklovic and Diethelm Würtz","Session":"foc-finance-1-1","Keywords":"finance","Abstract":"This presentation explores concepts and methods to implement extensions of ARMA\nmodels with GARCH/APARCH errors introduced by Ding, Granger and Engle. It is\nnowadays common to estimate GARCH/APARCH models of financial time series. They\nplay an essential role in risk management and volatility forecasting. Although\nthese models are well studied, numerical problems may arise in the estimation of\nseries with extreme events. In this talk, we present how to explore the\ndifferent behavior in the upper and lower tails of the financial return series\ndistribution. Generalized hyperbolic skew Student’s tdistributions can explain\nextreme polynomial losses and exponential decaying gains. We follow ideas from\nrobust estimation, appropriate parameter scaling from optimization, and present\ntheir implementation in R/Rmetrics."} {"Title":"tdm - A Tool of Therapeutic Drug Monitoring in R","Author":"Miao-ting Chen and Yung-jin Lee","Session":"foc-pharma-1-2","Keywords":"pharmacokinetics","Abstract":"Introduction Therapeutic drug monitoring (TDM) aims to optimize individual\npatient’s drug therapy through monitoring the plasma/serum concentrations of\nthe target drug, as well as the observed clinical responses. However, there are\nusually only few blood samples that can be collected and analyzed. Usually there\nis even only one single blood sample available. Therefore, it becomes very\nimportant to accurately estimate individual pharmacokinetic (PK) parameters with\nlimited observations. Bayesian estimation is a very suitable algorithm for this\nsituation. In contrast to minimizing an objective function, Bayesian estimation\nwith Markov-chain Monte-Carlo (MCMC) simulation (integration) using Gibbs\nsampler technique (BUGS) might be worth to implement and apply. Hence, the\nobjective of this study was to develop a TDM tool using BUGS for R. Methods and\nMaterials We chose OpenBUGS, an open-source version of BUGS for Windows (through\nits R interface package BRugs), to develop this tool under R. Each drug model\nwas divided into two parts: the probability distribution of population PK\nparameter (as priors), and the probability distribution of observed drug\nserum/plasma concentration or observed clinical response (as the conditional\nprobability or the likelihood function). This tool was validated with simulated\ndata obtained from the published PK parameters within the range of 2*s.d.. And\nthe accuracy of PK parameters was evaluated with percent prediction error (PE\n%). Results and Discussion We named this tool as tdm. Seventeen drug models\nincluding one PK/PD model (warfarin) and sixteen PK models were built in tdm. It\ncan be used to estimate individual PK/PD parameters with one or more\nobservations obtained from a single subject, as well as multiple subjects at the\nsame time. Other than one drug, imatinib, PK or PD parameters of all other drugs\nare estimated at their steady-state. Furthermore, tdm also provides dosage\nadjustment function. Based on the results of estimation validation, we found PEs\nof PK parameters of built drugs were similar to those using nonlinear regression\nobtained from other computer software, JPKD. Conclusion tdm has been released on\nNov. 2006 and can be downloaded and installed from R mirror websites. The latest\nversion is 2.2.1. Currently tdm is only available for Windows, because BRugs has\nnot been available for other platforms yet."} {"Title":"The Virtual R Workbench, towards an open platform for R based e-Science","Author":"Karim Chine","Session":"foc-gui_build-1-2","Keywords":"user interfaces-java","Abstract":"Biocep frameworks and tools make it possible to use R as a Java object-oriented\ntoolkit or as an RMI server. Calls to R functions from java locally or remotely\ncope with local and distributed R objects. Stateless and stateful JAX-WS web\nservices can be generated and deployed on demand for R packages. An\ninfrastructure with a large number of R servers running on an heterogeneous set\nof machines can be deployed and used for multithreaded web applications and web\nservices, for distributed and parallel computing, for thin web clients dynamic\ncontent generation including graphics and for R virtualization in a shared\ncomputation resources context. The virtualization is based on a universal\nadvanced GUI for R (virtual R workbench) that can be used also to control\nself-managed R servers. A dedicated HTTP gateway enables the control of R\nservers running behind firewalls. The workbench includes a powerful and\neasy-to-use docking framework, advanced script editors, spreadsheet views fully\nconnected to R, R objects inspectors views, data storage views, a highly\ninteractive zooming system for exploring complex visual data and several new R\nGraphics interactors. It can run as an applet, via Java Web Start or as a\ncross-platform desktop application. The virtual workbench is capable of creating\nR servers on any remote machine having R accessible from the command line\nwithout any extra pre-installation/pre-configuration. It enables collaborative R\nSessions (one session, multiple simultaneous users, console and devices content\nbroadcasting). It has built-in distributed computing facilities accessible via\nthe API or directly from the R Console. The functions available are similar to\nwhat has been defined within the snow package (makeCluster, clusterEvalQ,\nclusterExport, clusterApply, stopCluster..) and do not require any\nconfiguration. Biocep has built-in Python scripting facilities both on server\nand on client sides. The bridging of R and Python is bidirectional, R objects\ncan be exported to Python and Python objects imported to R. Scripting with R as\na component becomes easier than ever by using the Biocep API or from within the\nworkbench via the R / Python Consoles and via the embedded jEdit based script\neditor. The virtual workbench is designed to be an open platform: on one hand,\nit allows users to acquire an R computational resource in different ways either\nby creating an R server on intranet machines or by connecting to public grids\nexposing a virtualized infrastructure via a HTTP or a SOAP front-end. On the\nother hand, it has a plugin architecture that enables the integration of new\nGUIs designed for end users as new views and perspectives. The creation of those\nviews can be done programmatically (Java/Swing) or visually via a bean builder\n(Netbeans Matisse) and various Java beans are available as GUI components\nmapping standard R objects and devices. The plugin architecture handles the\nnotification and the synchronization of the views with the R objects, changes\ndone to the data in the views become effective within the R session and changes\nmade on R objects are visible on real time in the different views. Several\navailable interactive statistical software for data analysis (KLIMT, iPlots,\nMondrian..) would become in the future plugins among others available on a\nweb-accessible central repository. The virtual workbench would enhance the user\nexperience and the productivity of anyone working with R directly or indirectly.\nThe openness would leverage the range of software available for statistical\ncomputing and statistical data visualization/exploration. The interoperability\ncoupled with a large-scale deployment of virtualization infrastructures on\nvarious grids would democratise R based HPC and enable users from within their\nbrowsers to compute with R and visualize data with unprecedented flexibility and\nperformance.\n\nBiocep is a project hosted by R-Forge and it is released under the Apache 2.0\nLicense Biocep home: www.biocep.co.uk"} {"Title":"Statistical Cartoons","Author":"Ewan Crawford and Adrian Bowman","Session":"kal-gui_teach-1-3","Keywords":"teaching","Abstract":"The advent of interactive controls in R has allowed researchers to construct\nvery convenient mechanisms to explore data and models but has also allowed\nlecturers to produce animated graphics to explain ideas. This talk focusses on\nthe latter activity and aims to draw associations with both meanings of the word\n‘cartoon’. From one perspective these are sketches of the real thing which\naid experimentation and understanding while, from another, animated drawings\nhave the potential to raise a smile in the classroom. Both of these are helpful.\nIn between the command line interface of R and the gui interfaces such as R\nCommander (Fox, 2005), panel controls offer a very useful mode of control in\nboth classroom and laboratory settings. This talk aims to illustrate and discuss\n‘cartoons’ of this type, using a variety of illustrations from the context\nof teaching and learning at elementary, advanced and practical levels. The\nillustrations have been built using the rpanel package which was designed to\nprovide a quick and easy means of building control panels. The basic design of\nthe package, which is based on the rtcltk (Dalgaard, 2001) extensive set of gui\ntools, will also be outlined."} {"Title":"An Alternative Package for Estimating Multivariate Generalised Linear Mixed Models in R","Author":"Robert Crouchley and Damon Berridge and Dan Grose","Session":"foc-mod_mixed-1-2","Keywords":"modeling-mixed","Abstract":"There are several packages at [1] that have been specially written for\nestimating Generalised Linear Mixed Models in R, these include, lme4 [2] and\nnpmlreg [3]. There are also commercial systems that have algorithms for the same\nclass of models, see e.g. Stata [4], gllamm [5] and SAS [6]. In this\npresentation we compare the performance of these systems with our alternative\n(sabreR, to be available from [7]) on some standard small to medium sized data\nsets and show that our alternative is very much faster. We also present a grid\nenabled version of the software (SabreRgrid), this shows how easy it has become\nto submit grid jobs from the desktop PC and the extra speed up that can be\nobtained by going parallel on a High Performance Computer on the grid or\notherwise. This extra speed up is particularly important for estimating complex\nmodels on large and very large data sets. SabreR is a program for the\nstatistical analysis of multi-process event/response sequences. These responses\ncan take the form of binary, ordinal, count and linear recurrent events. The\nresponse sequences can also be of different types (e.g. linear (wages) and\nbinary (trade union membership)). Such multi-process data is common in many\nresearch areas, e.g. in the analysis of work and life histories from the British\nHousehold Panel Survey or the German Socio-Economic Panel Study where\nresearchers often want to disentangle state dependence (the effect of previous\nresponses or related outcomes) from any omitted effects that might be present in\nrecurrent behaviour (e.g. unemployment). Understanding of the need to\ndisentangle these generic substantive issues dates back to the study of accident\nproneness (Bates and Neyman, 1952) and has been discussed in many applied areas,\nincluding consumer behaviour (Massy et al, 1980) and voting behaviour (Davies\nand Crouchley, 1985) SabreR can also be used to model collections of single\nsequences such as may occur in medical trials, e.g. headaches and epileptic\nseizures (Crouchley and Davies, 1999, 2001), or in single equation descriptions\nof cross sectional clustered data such as the educational attainment of children\nin schools. We call the class of models that can be estimated by sabreR,\nMultivariate Generalised Linear Mixed Models. These models have special features\nadded to the basic models to help them disentangle state dependence from the\nincidental parameters (omitted or unobserved effect). The incidental parameters\ncan be treated as random or fixed, the random effects models being estimated\nusing normal Gaussian quadrature or Adaptive Gaussian quadrature. ‘End\neffects' can also be added to the models to accommodate ‘stayers’ or ‘non\nsusceptibles’. The fixed effects algorithm we have developed uses code for\nlarge sparse matrices from the Harwell Subroutine Library, see [8]. SabreR and\nSabreRgrid also includes the option to undertake all of the calculations using\nincreased accuracy. This is important because numerical underflow and overflow\noften occur in the estimation process for models with incidental parameters.\nThis feature does not seem to be available is other similar software [2, 3, 4,\n5, 6].\n\nLinks:\n[1] http://cran.r-project.org/\n[2] http://cran.r-project.org/web/packages/lme4/index.html\n[3] http://cran.r-project.org/web/packages/npmlreg/index.html\n[4] http://www.stata.com/\n[5] http://www.gllamm.org/\n[6] http://www.sas.com/\n[7] http://sabre.lancs.ac.uk/\n[8] http://www.cse.scitech.ac.uk/nag/hsl/"} {"Title":"igraph - a package for network analysis","Author":"Gabor Csardi","Session":"foc-networks-1-1","Keywords":"networks","Abstract":"The igraph R package is an interface to the C library with the same name,\ndeveloped for implementing graph algorithms. As many graph algorithms are\nalready included in igraph, it is also a handy tool for (exploratory) network\nanalysis. Main igraph features:\n• igraph uses a simple, flat data structure for graph representation, this\nallows handling graphs with millions of edges and/or vertices.\n• It is possible to assign attributes to the vertices or edges of the graph,\nor to the graph itself, the attributes can be arbitrary R objects.\n• Graph visualization, both interactive and non-interactive, using 1)\ntraditional R graphics, 2) Tcl/Tk or 3) OpenGL via rgl.\n• A variety of classic and recent graph algorithms are implemented in igraph:\n◦ Shortest paths and shortest path based measures, e.g. diameter. ◦ Weakly\nand strongly connected components, biconnected components and articulation\npoints. ◦ Maximum flows and minimum cuts, edge and vertex connectivity. ◦\nVarious centrality measures: degree, closeness, betweenness, Burt’s\nconstraints, Page Rank, eigenvector centrality, Kleinberg’s hub and authority\nscores. ◦ Fast graph and subgraph isomorphism algorithms. ◦ Cliques and\nindependent vertex sets. ◦ Graph motifs. ◦ Community structure detection\nbased on many recently published heuristics. ◦ K-cores, transitivity, minimum\nspanning trees, toplogical sorting, etc.\n• Graphs can be created in various ways: ◦ From data frames, edge lists,\nadjacency matrices, from a simple R formula notation. ◦ From a list of famous\ngraphs, predefined structures like rings, stars, trees, etc. or from the Graph\nAtlas. ◦ Using random graph models, like preferential attachment, or the\nsmall-world model.\n• igraph supports many commonly used file formats for storing graphs, like\nGraphML, GML or the format used by Pajek.\nIn this lecture I will show several practical examples on how to turn data into\nigraph graphs, how to calculate various graph properties: vertex centrality and\ncommunity structure, and graph visualization."} {"Title":"Quantitative approach to Entropy weighting methodology in MADM","Author":"Mohammad Ali Dashti","Session":"foc-business-2-1","Keywords":"business","Abstract":"It seems that those MADM matrices whose scattering of the alternative values\ndistribution have different importance. The Entropy technique would give\ncompletely irrelevant weights in comparison to other techniques such as external\nweighting. We propose a method to resolve this apparent disagreement. In this\nmethod we shall first convert all quantitative values to qualitative values\nusing DM judgment and then we will apply the Entropy technique as before. An\nexample is presented to illustrate the proposal."} {"Title":"RiDMC: an R package for the numerical analysis of dynamical systems","Author":"Antonio Di Narzo and Marji Lines","Session":"kal-ts-1-5","Keywords":"numerics","Abstract":"RiDMC is an R package for the numerical analysis of discreteand continuous-time\ndynamical systems. With RiDMC the user can easily encode a model in the simple,\ninterpreted LUA language, and immediately perform numerical analysis with a\nvariety of algorithms. The LUA language gives maximum flexibility in model\nspecification, and allows for the introduction of stochastic components in a\nvery natural way. Or if the user wants to work with an existing model, he may\nchoose from a large number of well-known dynamical systems already available in\nthe package. Once a model is loaded, the user can compute trajectories,\nbifurcation diagrams, Lyapunov spectra, basins of attraction and periodic\ncycles. For each analytical routine there is an associated plotting function,\nwith reasonable default settings (axes labelling, font sizes, etc.), so that\npublication-quality plots can be produced directly with almost no additional\neffort. Moreover, plots are based on the grid system, so that full plot\ncustomization, manipulation and reuse is possible for more expert R users. RiDMC\nuses the idmclib C library for interpreting user-supplied models and for doing\ncore numerical computing. The idmclib library, released with sources under the\nGPL-v2 license, is small, welldocumented and easy to understand for anyone\ndesiring a closer look at the internal numerical algorithms. A set of\ninteresting case studies is presented as a demonstration of the package\ncapabilities."} {"Title":"Cracking the Nut: Introducing R to a Department","Author":"Will Dubyak","Session":"kal-gui_teach-1-5","Keywords":"teaching","Abstract":"Making the best computing and most sophisticated methods accessible to the\nSocial Science Undergrad. This is a paper presents strategy for introducing R to\na social science department not necessarily ready to embrace it. It about not\nsimply teaching R, but finding a mechanism to insert it into the core of a\ndepartment. It suggests the naïve assumption of R’s ‘selfsellability’\ninvites enormous frustration and almost certain failure. It argues that\nspringing R on a department should be a campaign based on principles of military\nplanning. It draws its occasionally offbeat lessons from the generally\nsuccessful effort to integrate R in the political science department at the\nUnited States Naval Academy. The key point is this: a successful R introduction\ndoes not resemble the entry of conquering heroes to the exuberant welcome of a\nliberated population. It is far more like an insurgency, fought fiercely behind\nthe scenes. For accomplished R users this is mystifying. We enthusiastically\ncelebrate R’s computational power, its magnificent graphics, and its explosive\nincrease in functionality through tailored add-on libraries. Sometimes, though,\nwe forget how intimidating it all looks to a novice uncertain about even loading\nthe data. Social science generally has been revolutionized by advances in\nsophisticated methods, and the disciplines have been forever changed. However,\nthese developments have not altered the type of undergraduate selecting social\nscience as a discipline. Nor have they necessarily spread to established faculty\nwhose training predates this revolution, or whose interests are not easily\naddressed by a data set. As a result, scholars selling R are greeted not by\nadoring throngs seeking leadership into the world of cutting edge social\nscience, but rather by intensely skeptical stakeholders with strong interest in\npreserving the status quo. Frontal assaults on this position are futile; the\npush-back is overwhelming. As a model for a successful introduction, this paper\nsuggests following guidelines for military planning. Generations of the best\nmilitary minds have devised strategies for asymmetrical match ups. Students in\nwar colleges and service academies are challenged to examine lessons of the past\nwhen facing contemporary threat environments. They are encouraged to consider\ntheir strengths, and those of the enemy, and to bring maximum force to bear on a\nproblem with supreme economy of effort. This paper suggests that the problem of\nintroducing R are partly tactical (using graphs and simulated quantities of\ninterest to make complex findings accessible), and partly strategic (building\nalliances, scoring and exploiting visible public victories, and when all else\nfails, deceiving) until victory belongs not to the entrenched, but to the\ndeserving. This paper tells how it was done here, and offers it as a model for\nothers facing similar struggles."} {"Title":"dynGraph: interactive visualization of 'factorial planes' integrating numerical indicators","Author":"Julien Durand and Julie Josse and François Husson and Sébastien Lê","Session":"kal-visual-1-5","Keywords":"user interfaces, visualization","Abstract":"dynGraph is a visualization software that has been initially developed for the\nFactoMineR package, an R package dedicated to multivariate exploratory methods\nsuch as principal components analysis, (multiple) correspondence analysis and\nmultiple factor analysis (http://factominer.free.fr); dynGraph has been extended\nto allow the visualisation of data frames. The main objective of dynGraph is to\nallow the user to explore interactively graphical outputs provided by\nmultidimensional methods by visually integrating numerical indicators. The first\nbasic feature of dynGraph is the connecting line that appears whenever the user\nmoves a label associated with an object, i.e. an individual, a variable or a\ncategory. Labelling of the different objects displayed on the graph can be\neasily set. Colours can be assigned to individuals according to a categorical\nvariable of interest. One of the main features of dynGraph is the way objects\nare displayed. Objects are displayed according to their quality of\nrepresentation, by default above the threshold of 0.8 with respect to a maximum\nof 1. Of course the amount of information to be displayed can be easily set by\nthe user with a cursor: graphical outputs can be analyzed interactively from the\nmost general piece of information to the most relevant one. Moreover, the font\nsize of each label associated with an object is proportional to the importance\nof the object in the analysis which facilitates tremendously the interpretation\nof the results. Besides, different criteria can be used to assess the importance\nof an object and this information is calculated via R and the FactoMineR\npackage. Finally, by clicking on one of the dimensions provided by the analysis,\nthe user gets a list of the variables that may explain the dimension\nsignificantly that will help him to interpret the data."} {"Title":"A Graphical User Interface for Environmental Statistics","Author":"Rudolf Dutter","Session":"foc-gui_special-1-3","Keywords":"user interfaces-tcltk","Abstract":"We report on a package called DAS+R under development using a graphical user\ninterface which should ease the application of more or less sophisticated\nmethods. The basis of the graphical user interface comes from the R Commander\n(see Fox, 2004). It uses Tcl/Tk programming tools (Welch and Jones, 2003). The\nemphasis is on the analysis of spatially depending uni- or multivariate data,\nparticularly on problems of geochemical data. Three special properties of DAS+R\nshould be stressed: • Interactive definition of data subsets (numerically or\ngraphically) together with set operations. Usage of these subsets in almost all\ngraphics and computations. • Intensive use of possible relations between the\ngeographical information with the values of data in the statistical and\ngraphical analysis. • The strong requirement of fast reproducibility and\nrepeatability with small variations in the analysis. For specified subsets many\nsimple graphics can be generated in an easy way by a few mouse clicks\n(histograms, boxplots, xy-, ternary plots, scatterplot matrices). These\nnevertheless can become very sophisticated by using the provided advanced\noptions where almost all options of the usual R commands can be specified by\nclicking graphical icons. The geographical information is used by generating\ndifferent kinds of maps. Different symbol sets can be used for representing\nthe values in space. Surface maps may be produced by simple interpolation\nalgorithms or by sophisticated geostatistical methods as kriging. All these\ngraphical displays may be produced in any specified scale on a user defined\nworksheet which can be interactively splitted into arbitrary frames which are\nprovided for the different graphics. Finally many multivariate methods like\nprincipal component and factor analysis, cluster and discriminance analysis, are\navailable. The package is also meant as a companion to the book recently\npublished by Reimann et al. (2008). We describe in short the system and\nillustrate the usability on some geochemical data sets."} {"Title":"Scripting with R in high-performance computing: An Example using littler","Author":"Dirk Eddelbuettel","Session":"foc-highperf-1-3","Keywords":"high performance computing","Abstract":"High-Performance Computing with R often involves distributed computing. Here,\nthe MPI toolkit is a popular choice, as it is well supported in R by the Rmpi\nand snow packages. In addition, resource and and queue managers like slurm help\nin allocating and managing computational jobs across compute nodes and clusters.\nIn order to actually to execute tasks, we can take advantage of a scripting\nfrontend to R such as r (from the littler package) or Rscript. By discussing a\nstylized yet complete example, we will provide details about how to organise a\ntask for R by showing how to take advantage of automated execution across a\nnumber of compute nodes while being able to monitor and control its resource\nallocation."} {"Title":"Management and Analysis of Large Survey Data Sets Using the 'memisc' Package","Author":"Martin Elff","Session":"kal-app-1-3","Keywords":"social sciences","Abstract":"One of the aims of the memisc package is to make life easier for useRs who have\nto work with (large) survey data sets. It provides an infrastructure for the\nmanagement of survey data including value labels, definable missing values,\nrecoding of variables, production of code books, and import of (subsets of) SPSS\nand Stata files. Further, it provides functionality to produce tables and data\nframes of arbitrary descriptive statistics and (almost) publication-ready tables\nof regression model estimates. Also some convenience tools for programming and\nsimulation are provided, as well as some miscellaneous probability\ndistributions, statistical models, and graphics. Based on an example analysis of\nthe cumulated ALLBUS 19802004 data set (ZA-No. 4243), it is demonstrated how\neven large data sets can be handled without much pain using the memisc package.\nThe cumulated ALLBUS comprises data of 44,526 respondents and 1,141 (!)\nvariables. The proposed presentation shows the workflow of analysis of such a\nlarge data set: First, variables that are relevant for the analysis are loaded\nselectively into the workspace, thus minimizing the overall memory footprint.\nSecond, attributes of variables in such a data set, like variable labels, value\nlabels and user-defined missing values are retained and used for data\nmanagement conducive for typical social science data analysis. Third, tables of\ndescriptive statistics are produced for preliminary or exploratory analyses\nusing the genTable function of the package. Fourth, estimates of statistical\nmodels are formatted in a way suitable for publication in social science\njournals using the mtable function."} {"Title":"The bigmemoRy package: handling large data sets in R using RAM and shared memory","Author":"John Emerson and Michael Kane","Session":"foc-highperf-1-2","Keywords":"high performance computing-large memory","Abstract":"Multi-gigabyte data sets challenge and frustrate R users even on well-equipped\nhardware. C programming provides memory efficiency and speed improvements, but\nis cumbersome for interactive data analysis and lacks R’s flexibility and\npower. The new package bigmemoRy bridges this gap, implementing massive matrices\nin memory (managed in R but implemented in C) and supporting their basic\nmanipulation and exploration. It is ideal for problems involving the analysis in\nR of manageable subsets of the data, or when an analysis is conducted mostly in\nC. In a Unix environment, the data structure may be allocated to shared memory,\nallowing separate R processes on the same computer to share access to a single\ncopy of the data set; mutual exclusions (mutexes) are provided to avoid\nconflicts. This opens the door for more powerful parallel analyses and data\nmining of massive data sets."} {"Title":"Exploratory and Inferential Analysis of Benchmark Experiments","Author":"Manuel Eugster and Friedrich Leisch","Session":"foc-mod_man-1-3","Keywords":"machine learning","Abstract":"Benchmark experiments produce data in a very specific format. The observations\nare drawn from the performance distributions of the candidate algorithms on\nresampled data sets. benchmark is the comprehensive R toolbox for the setup,\nexecution and exploratory and inferential analysis of these experiments. The\npackage introduces an additional layer of abstraction (using S4 mechanisms)\nrepresenting the elements of benchmark experiments. This allows the integration\nof all statistical learning algorithms available in the R system and a\nconsistent way for developing new ones. The consequence of this slight extra\nwork is a standardized setup and analysis of benchmark experiments. The package\nprovides wrapper methods for common learning algorithms available in R. In this\npresentation we introduce the elements of benchmark experiments and show how to\ncombine them into a flexible framework. The usage is illustrated with exemplary\nbenchmark studies based on common learning algorithms on one or several popular\ndata sets, respectively. We present new visualisation techniques, show how\nformal test procedures can be used to evaluate the results, and, finally, how to\nsum up to an overall ranking."} {"Title":"Hedging interest rate risk with the dynamic Nelson/Siegel model","Author":"Robert Ferstl and Josef Hayden","Session":"foc-finance_risk-1-3","Keywords":"finance, econometrics","Abstract":"An accurate forecast of the yield curve is an important input for the pricing\nand hedging interest-rate-sensitive securities. Diebold and Li (2006) formulate\nthe widely-used Nelson and Siegel (1987) model in a dynamic context and provide\na factor interpretation of the estimated parameters as level, slope and\ncurvature. This model can be used to forecast the future yield curve. We\nimplement the dynamic Nelson/Siegel model in R by extending the CRAN package\ntermstrc, which allows us to efficiently use market data from coupon bonds (see\nFerstl and Hayden, 2008). Further, we test the performance of bond portfolio and\ninterest rate risk management problems, where the dynamic Nelson/Siegel yield\ncurve is used for pricing and hedging the underlying securities. We compare our\nresults to common strategies in practice, e.g. duration hedging, duration vector\nmodels."} {"Title":"Sweave or how to make 286 customized reports in two clicks","Author":"Delphine Fontaine","Session":"foc-report-1-3","Keywords":"reporting","Abstract":"The R environment is normally used to perform statistical analyses and then\npeople making statistics usually make a report. To make this report, we can\neither copy and paste the R output in a text editor or use Sweave. Sweave is an\nR tool created by Friedrich Leisch which allows the insertion of the R code in\nLaTeX code in such a way that statistical analysis and statistical reports are\ncompiled at the same time. The purpose of Sweave is to create dynamic reports\nwhich are automatically modified when data or analysis change (Leisch, 2002).\nSweave allows one to quickly update a report if data changed. It also can be\nused to make several reports with different data but all having the same\nstructure (same sections, same text, same graphs, same tables...). With some\nautomation and two clicks, it is possible to use Sweave to make a vast number of\nreports, each with a different data subset. In clinical development, doctors\nparticipating in a study usually receive a report with the general results of\nthe study. Sweave can be used to make this report. But one can go beyond this.\nSweave allows the customisation of a report for each doctor using the data\ncollected in his site and to compare the resulting customized statistical\nanalyses with the overall study results."} {"Title":"The Past, Present, and Future of the R Project - Social Organization of the R Project","Author":"John Fox","Session":"invited","Keywords":"invited","Abstract":"Much of the work in the social sciences on the development of open-source\nsoftware focuses on the issue of motivation: Why do individuals or organizations\nparticipate in open-source projects? Is their participation rational? Voluntary\nactivity is, however, a natural part of social life, and I find it more\ninteresting to ask how an open-source project, such as the R Project for\nStatistical Computing, is organized, and how its social organization contributes\nto the success – or lack of success – of the project. After all, anarchic\nvoluntary cooperation does not, on the face of it, seem a promising approach to\ndeveloping a complex product such as statistical software. My investigation,\nbased partly on interviews with members of the R Core team and with other\nindividuals closely associated with the R Project, suggests that the success of\nR is due to a number of factors. Some of these factors, such as the\nimplementation of a division of labour (albeit an informal one), are common to\nmost organizations; other factors, such as the clever social use of technology\n(e.g., version control and package systems), are specifically adapted to the\ndevelopment of software; and still others, such as the adoption of the S\nlanguage, which was already in wide use prior to the introduction of R, are\nparticular to R."} {"Title":"XML-based Reporting Application","Author":"Romain Francois and David Ilsley","Session":"foc-report-1-2","Keywords":"reporting","Abstract":"Several software projects recently developed at Mango Solutions require the\nproduction of fully styled reports in several output formats, mainly HTML, PDF\nand RTF. Many existing systems were considered by Mango Consultants but were\ndiscounted due to restrictive licences, inflexibility of input data format,\novercomplex or simplistic feature sets. The Mango Report Generator is a software\ncomponent written in Java that has been developed to respond to the demands of\nproducing flexible reports from multiple data sources. The system is based on\nXML descriptions of the content of the report — currently covering graphics,\ntables and styled text report items — and the XML description of the actual\nlayout of the report. The report layout is associated with the report items, and\nstyled using Cascading Style Sheets (CSS)[2] to produce fully styled reports\nsuitable for browsing using XHTML, printing or further editing in mainstream\nword processors using XSL-FO [1] and Apache FOP. The input and output streams of\nthe Report Generator are XML-based which makes it straightforward to create\nreport items and layouts via any third party application. A proof-of-concept R\npackage has been created as part of the project to demonstrate the ease of\nintegration of content from other systems. This presentation will highlight the\nchallenges that occured during the developement of the component and a\ndemonstration of the typical workflow of the system by creating reports by\namalgamating content from R as well as a commercial implementation of the S\nlanguage."} {"Title":"R4X: Simple XML Manipulation for R","Author":"Romain Francois","Session":"foc-conn-1-1","Keywords":"connectivity","Abstract":"Data transfer is an important component in many multi-technology applications.\nThe eXtensible Markup Language (XML) is a medium of choice for exchanging\nvarious sources of data. Recent developpements at Mango Solutions have\njustified the production of an R package to provide convenient manipulation of\nXML structures. Based on the powerful parsing facilities of the XML package[4]\nand templating abilities of the brew[3] package, R4X gives R users a simple\nmechanism to create, read and manipulate XML structures. The functionality of\nthe package is conceptually based on the E4X[2] standard which promotes XML as a\ncore data-type of the javascript language. In order to create a seamless\nintegration of XML into R, much of the functionality of E4X has been ported to\nR4X. R4X provides a convenient environment for the creation of XML structures,\nthrough the single generic xml function. R4X also features simple manipulation\nof XML structures via the usual R slicing operators ([ and [[) combined with a\nsyntax close to XPATH in order to extract arbitrarily nested content from an XML\nstructure. This presentation will describe key features of the R4X package and\ndiscuss anticipated extensions of the functionality. Examples will be used to\ndemonstrate the use of R4X to build a simple Rich Site Summary (RSS) reader,\ngenerate a tag cloud of the description of current CRAN packages in xHTML,\ncreate Scalable Vector Graphics (SVG) and a custom RUnit[1] protocol report\nbased on the Mozilla XML User Interface Language (XUL)."} {"Title":"SIMSURVEY - a tool for (geo-) statistical analyses with R on the web","Author":"Mario Gellrich and Rudolf Gubler and Andreas Schönborn and Andreas Papritz","Session":"foc-spatial-1-3","Keywords":"user interfaces, spatial, connectivity","Abstract":"Geostatistical methods are used in many branches of environmental research and\napplications for the statistical analyses of spatially referenced measurements\nand for the interpolation and mapping of data measured at a limited number of\nlocations in a study domain. Courses in geostatistics are therefore part of the\ncurriculum in environmental sciences and engineering at many universities.\nHowever, experience shows that geostatistics is a rather difficult subject to\nteach. Apart from the mostly limited prior knowledge in statistics, a lack of\nflexible, but at the same time easy-to-use software adds to the problems many\nstudents have with this topic. Commercially available statistics and GIS\nsoftware either offers no or only limited geostatistical functionality, or it is\nexpensive (and in addition often quite demanding to use). R includes several\npowerful packages for geostatistical analyses, but as a script-based programming\nlanguage, R is difficult to use in introductory courses. To mend this deficiency\nwe developed SIMSURVEY, a graphical user interface (GUI) for geostatistical\nanalyses with R. Unlike other R GUIs no software (apart from a browser) is\nrequired as SIMSURVEY runs on a web server\n(http://bolmen.ethz.ch/~simsurvey/simsurvey/simProto.html). Currently, SIMSURVEY\noffers the following functionality: • Data transformation and management, •\nexploratory analysis of spatial data, • linear regression analysis of spatial\ndata and analysis of variance, • estimation and modelling of variograms, and\n• universal kriging. Various kinds of graphical tools are available for all\nthese tasks. All these analyses can be run by using the GUI. For experienced R\nusers, SIMSURVEY contains in addition a command window. For educational\npurposes, SIMSURVEY allows one to sample and to analyse simulated soil pollution\ndata. SIMSURVEY was implemented by an interplay of Adobe Flash, PHP and R. The\nGUI by which a user interacts with R is a Flash animation in a browser window.\nThe dynamically changing structure of the GUI is largely controlled by XML code.\nThe actions of a user are passed to R by PHP. Based on template R code PHP\ndynamically generates complete R scripts that are processed by R processes\nrunning permanently on the web server. To improve the performance PHP and R\ncommunicate with each other by a socket connection. The output that R generates\n(text and graphic files) are then routed back to the Flash animation by PHP and\nare then presented in the browser to the user. Thanks to its modular\narchitecture, SIMSURVEY can be easily modified and extended. To this aim the\nfollowing steps are required: • Define the new items of the GUI by adding to\nthe XML code. To facilitate this task predefined elements for text input fields,\nradio buttons, check boxes etc. can be used. • Write the template R code for\nthe new tasks. • Extend the PHP-R interface to pass the required information\nfrom the GUI to R (by dynamically generating R scripts) and route the R output\nback to the GUI. This architecture provides a novel and flexible framework for\ngeneral computations with R on a web server. In our presentation we shall\ndemonstrate the use of SIMSURVEY, and we shall show by an example how SIMSURVEY\ncan be extended for new tasks."} {"Title":"Bayesian generalized linear models and an appropriate default prior","Author":"Andrew Gelman","Session":"invited","Keywords":"invited","Abstract":"Many statistical methods of all sorts have tuning parameters. How can default\nsettings for such parameters be chosen in a general-purpose computing\nenvironment such as R? We consider the example of prior distributions for\nlogistic regression. Logistic regression is an important statistical method in\nits own right and also is commonly used as a tool for classification and\nimputation. The standard implementation of logistic regression in R, glm(), uses\nmaximum likelihood and breaks down under separation, a problem that occurs often\nenough in practice to be a serious concern. Bayesian methods can be used to\nregularize (stabilize) the estimates, but then the user must choose a prior\ndistribution. We illustrate a new idea, the “weakly informative prior”, and\nimplement it in bayesglm(), a slight alteration of the existing R function. We\nalso perform a cross-validation to compare the performance of different prior\ndistributions using a corpus of datasets."} {"Title":"ChainLadder: Reserving insurance claims with R","Author":"Markus Gesmann","Session":"foc-business-2-2","Keywords":"actuarial","Abstract":"One of the biggest liability items on an insurance company’s balance sheet is\nthe reserves for future claims payments. This reserve is an estimate of the\namount an insurance company expects to pay for reported and unreported claims.\nBased on historical incurred claims and payment patterns, methods have been\ndeveloped to forecast future payments. The ChainLadder package provides the\nMack-chain-ladder and Munichchain-ladder methods to estimate reserves. The\nimplementation in R allows both methods to be seen in a linear model context and\ntherefore makes heavy use of the lm function in R. The ChainLadder package grew\nout of presentations the author gave at the Stochastic reserving and modelling\nseminar, 29 - 30 November 2007 at the Institute of Actuaries. "} {"Title":"Time Series Database Interface","Author":"Paul Gilbert","Session":"foc-ts-1-3","Keywords":"connectivity, time series","Abstract":"This presentation describes a package that abstracts an interface to time series\ndatabases, and a related group of packages that implement interfaces to SQL\ndatabases and to Fame through the PADI protocol. The TSdbi package, which\nimplements the abstraction, imports the DBI namespace and DBI functions that\nsupport many SQL databases. For these cases there is limited need for code\nspecialized to the specific database. This has been implemented in packages\nTSMySQL and TSSQLite, which require packages RMySQL and RSQLite respectively. It\nshould also be possible to use the abstraction with RODBC (in progress but\nuntested at the moment). TSdbi can also be used to interface to other time\nseries databases, but this will typically require more database specific code\nbelow the abstraction. A working interface to Fame is implemented in the package\nTSpadi. It should also be possible to implement a more direct connection using\nthe fame package. Time series databases are typically simple in the sense that\nseries are named with a unique identifier, and queries are limited to lookups\nusing this key. From this perspective an SQL database is hardly needed. Apart\nfrom the abstraction, which is useful to make other code independent of the\ndatabase implementation, the the main advantages are to use the database’s\nclient/server protocol, the ability to handle endian issues, and security\nfeatures. However, when an SQL database is used, additional features can be\nadded: it is possible to have vintage and panel dataset, with the same\nidentifier used for different release dates and/or different panel members.\nThe package also (potentially) allows choice of the R representation to use for\nthe time of the series data points. The default is ts where applicable, and zoo\notherwise. (At the moment, only the default is working.) The structure of the\nback-end SQL data bases and some utilities for implementing them will also be\ndiscussed."} {"Title":"The BLCOP package: an R implementation of the Black-Litterman and copula opinion pooling models","Author":"Francisco Gochez","Session":"foc-finance_risk-1-1","Keywords":"finance","Abstract":"In the early 1990s Fischer Black and Robert Litterman devised a framework for\nsmoothly blending analyst views on the mean of the distribution of financial\nasset returns with a market “official equilibrium” distribution. The model\nhas generated substantial interest since then, though it is limited by its\nassumptions of normality in market and analyst view distributions, as well as by\nvagueness in the meaning of certain parameters and their determination. In late\n2005 Attilio Meucci of Lehman Brothers proposed the “copula opinion pooling”\n(COP) method as a generaliztion that overcomes all of these limiations, though\nat the cost of greater complexity. The BLCOP package is an implementation of\nboth of these models. The emphasis of the package is on ease of use,\nflexibility, and allowing the user to easily analyze the impact of his or her\nviews on the market posterior distribution."} {"Title":"FAiR: A Package for Factor Analysis in R","Author":"Ben Goodrich","Session":"foc-multi-1-1","Keywords":"multivariate","Abstract":"The primary objective of FAiR is to provide functionality that is not available\nin closed-source factor analysis software, but FAiR also strives to integrate\nthe various tools for factor analysis that are already available in R packages\nand to provide a reasonably user-friendly GUI (based on gWidgets) so that people\nwho have more experience with factor analysis than with R can readily estimate\ntheir models. The first version of FAiR was released in February 2008, and the\nsecond version will have been released in April 2008. FAiR is unique in that it\nutilizes a genetic algorithm (rgenoud) for constrained optimization, which\npermits new approaches to exploratory and confirmatory factor analysis (EFA and\nCFA) and also straightforwardly leads to a new estimator of the common factor\nanalysis model called semi-exploratory factor analysis (SEFA). The common factor\nanalysis model in the population can be written as Σ = ΛΦΛ + Ψ, where Σ is\na covariance matrix among n observable variables, Φ is a correlation matrix\namong r common factors, Λ is a n × r matrix of factor loadings, and Ψ is a\n(typically diagonal) covariance matrix among n unique variances. However, Λ and\nΦ are not separately identified unless additional restrictions are imposed.\nFor example, CFA requires the analyst to specify which cells of Λ are zero a\npriori. SEFA differs by requiring the analyst to specify the number of zeros in\neach column of Λ but does not require the analyst to specify where the zeros\noccur. SEFA thus uses a genetic algorithm to maximize the fit to the data over\nthe locations of these exact zeros in Λ and the values of the corresponding\nnon-zero parameters. FAiR also differs from all other factor analysis software\nin that the analyst can impose a wide variety of (non-linear) inequality\nrestrictions on (functions of) parameters in SEFA and CFA models and also during\nthe transformation stage of EFA models. For example, Louis Thurstone — who was\nthe father of exploratory factor analysis with multiple factors — proposed a\ncriterion for factor transformation in 1935 that had never been implemented by\nany factor analysis software, largely due to its perceived computational\ndifficulty. Optimizing with respect to Thurstone’s criterion is implemented\nin FAiR, which is fairly easy and very reliable due to power of the underlying\ngenetic algorithm. FAiR utilizes S4 classes, which not only facilitates\npost-estimation analysis but also provides a framework that allows rapid\ndevelopment and easy integration of other R packages with FAiR. My goals for\nuseR are to attract the interest of additional developers and to expose\nattendees (and their colleagues at home) to the new ways of thinking that are\nembodied in FAiR but are unavailable in traditional factor analysis software."} {"Title":"Statistical Modeling of Loss Distributions Using actuar","Author":"Vincent Goulet","Session":"kal-ts-1-1","Keywords":"actuarial","Abstract":"actuar is a package providing additional Actuarial Science functionality to the\nR statistical system. The current version of the package contains functions for\nuse in the fields of loss distributions modeling, risk theory (including ruin\ntheory), simulation of compound hierarchical models and credibility theory. This\ntalk will present the features of the package most closely related to usual\nstatistical work, namely the modeling of loss distributions. Among other things,\nwe introduce a number of probability laws functions, handling of grouped data,\nminimum distance estimation methods and functions to compute empirical moments.\nIf time allows, we will also present the function to simulate data from compound\nhierarchical models."} {"Title":"Distributed Computing using the multiR Package","Author":"Daniel Grose","Session":"foc-highperf-2-1","Keywords":"high performance computing-parallel","Abstract":"There exist a large number of computationally intensive statistical procedures\nthat can be implemented in a manner that is suitable for evaluation using a\nparallel computing environment. Within this number there exists a class of\nprocedures, often described as “course grained parallel” or\n“embarrassingly parallel”. The defining characteristic of these procedures\nis that they can be reduced to a number of sub-procedures that are independent\nof each other and require little or no inter-procedure communication i.e. they\ncan be executed concurrently. Initially, it might be thought that this class is\ntoo small to warrant significant attention, however this is far from being the\ncase. For example, methodologies such as bootstrapping, cross-validation, many\ntypes of Markov Processes (including MCMC), and certain optimisation and search\nalgorithms are of this type. Importantly, the increase in availability of High\nThroughput Computing (HTC) environments, consisting of large numbers of\ninterconnected computers, has made employing such procedures particularly\nattractive, leading to a significant increase in the amount of research being\nundertaken using HTC, notably in the areas of biochemistry, genetics,\npharmaceuticals, economics, financial modelling and the social sciences. A High\nThroughput Computing environment provides a means for processing a large number\nof independent (non-interacting) tasks simultaneously. In the simplest case, the\nHTC environment may employ only a single multi-processor system. At the other\nextreme, the HTC environment might comprise a large number of systems with\ndifferent operating systems and hardware located across a number of different\ninstitutions and administrative domains. When this is the case the environment\nmay be said to provide High Throughput Distributed Computing (HTDC). HTC on a\nsingle multiprocessor system is relatively straightforward. Typically the user\nhas an account on the system (can be identified to the system by a user name\nand password) and can submit the tasks for processing by using the software\ntools available on that system. Higher level means of submitting tasks exist,\nsuch as the snow package for R [1]. This package allows functions defined in R\nor installed R packages to be invoked multiple times with varying argument\nsignatures and executed on a number of processors simultaneously. In [1] it is\nnoted that the functionality offered by snow could be extended to use the GRID,\nwhich by its nature provides a HTDC environment. Some of these extensions have\nbeen addressed within the GridR system [3], which is similar in principle to\nsnow but provides some of the technical requirements necessary for using GRID\nbased resources which it achieves by employing the COG toolkit [2]. However,\nthere are a number of important considerations which arise when using\ngeneralised HTDC (GRID based or otherwise) not all of which have been\nencapsulated in either snow or GridR. These considerations are 1. A client\nsession may terminate before all tasks have been processed. For instance, the\nresults of the completed tasks may need to be collected in a future client\nsession, possibly from a different system. 2. The systems employed to process\nthe tasks may be multi-fold and reside in different administrative domains,\nthus it is not practical for a client to have to obtain and manage accounts on\nall (potentially hundreds or even thousands) of these systems. Consequently, a\nsingle means of identifying the client is required. 3. The client system must\nemploy a secure channel for communication. 4. Host systems are typically shared\nby many clients and have scheduling systems to allocate resources, thus the\nexecution time and the order in which tasks are processed may vary. 5.\nIndividual tasks may fail to complete (this is quite common on certain systems,\nsuch as Condor pools). 6. The client interface should be independent of the\nnature of the distributed systems used for undertaking the computation. All of\nthese considerations have been well studied in many varied contexts and the\ndesign pattern most associated with realising the above design criteria is a\nthree-tier client server employing the public key infrastructure for\nauthentication and security. The technologies required for implementing such an\narchitecture to host a HTDC service for R are readily available and have been\nused to develop servers which expose an interface for use within a client R\nsession. The multiR package contains an implementation of a client interface for\nuse in R which is similar in many respects to that of snow and GridR in that it\nextends the apply family of functions (available in the base package) for\nsubmitting multiple function invocations in a distributed environment. multiR\nalso provides the functionality required to generate certificate based proxy\ncredentials, manage active jobs and harvest results when they become available.\nImportantly, the interface provided by multiR is independent of the many\ndifferent types of hardware and software systems employed within a HTDC\nenvironment and requires no additional software components (Globus, CoG and so\non) to be installed before it can be used. The full presentation of this work\ndemonstrates how multiR is installed and used using several example applications\nwhich include bootstrapping, calculating multivariate expectation values and\nfunction optimisation. For each of the examples the benefits of using multiR\nare examined, with particular reference to the reduced time required to compute\nthem."} {"Title":"FlexMix: Flexible fitting of finite mixtures with the EM algorithm","Author":"Bettina Gruen and Friedrich Leisch","Session":"kal-model-1-5","Keywords":"modeling-mixed","Abstract":"Finite mixtures are a flexible model class for modelling unobserved\nheterogeneity or approximating general distribution functions. The R package\nflexmix provides infrastructure for fitting finite mixture models with the EM\nalgorithm or one of its variants. The main focus is on finite mixtures of\nregression models and it allows for multiple independent responses and repeated\nmeasurements. Concomitant variable models as well as varying and constant\nparameters for the component specific generalized linear regression models can\nbe fitted. The main design principles of the package are easy extensibility and\nfast prototyping for new types of mixture models. It uses S4 classes and methods\nand exploits features of R such as lexical scoping. The implementation of the\npackage is described and examples given to illustrate its application."} {"Title":"SimpleR: Taking on the 'Evil Empire' by Developing Applications for Non-statistical Users","Author":"Bert Gunter and Nicholas Lewin-Koh","Session":"kal-gui_teach-1-2","Keywords":"user interfaces","Abstract":"The “Standard Model” for R software development – the R package –\nassumes: 1. Many independent users; 2. Some degree of statistical understanding\nby users; 3. Access and functionality within the overall R environment; 4.\nPersistence over time; 5. (Usually) A command line interface. This model serves\nserious data analysts well, and R provides many tools to facilitate such\ndevelopment. We argue here that R can also serve another and potentially much\nlarger community of engineers and scientists for whom Excel® (or similar\nsoftware) is now the primary tool for data analysis and statistical graphics.\nThese folks need: 1. Narrow, often single purpose “template” analyses; 2.\nRapid development/rapid discard – needs disappear as soon as a project is\ncompleted or change radically when technology changes; 3. Software customized\nfor a few – even one – users; 4. A simple GUI interface requiring minimal\ndocumentation and learning; 5. Graphs as the primary output. Excel is their\ndefault because (a) it’s there and they know it; (b) they don’t know about\nor can’t implement better methods. R can change this state of affairs. R is\nopen source, has superb graphics, and is easily embedded into web served\napplications using R2HTML, Rpad, Rserve, Rzope, etc.. Alternatively, it can be\nmodified for single use applications through a GUI such as Rcmdr, gWidgets, or\nby simply modifying the R menu structure. The key is to present the user with a\nsimple interface and readily interpretable output, even if the underlying\nanalysis is complex. We discuss our strategy for developing such applications,\nwhich relies on the global workspace as the software environment, thus avoiding\nthe unnecessary (for us) overhead of packages. We give an example in actual use\nat Genentech and discuss the pros and cons of this approach."} {"Title":"Estimation in classic and adaptive group sequential trials","Author":"Niklas Hack and Werner Brannath","Session":"foc-biostat_study-1-1","Keywords":"biostatistics-inference","Abstract":"We present a R-package for estimation in classic and adaptive group\nsequential trials. We will give an overview of classic and adaptive\ngroup sequential designs and will present two methods for the\ncalculation of p-values and confidence intervals. The first method is\nbased the repeated approach of Jennison and Turnbull (1984), which was\nextended by Mehta, Bauer, Posch and Brannath (2006) to the adaptive\nsetting. The second method is based on the stage-wise ordering of\nTsiatis, Rosner and Mehta (1989), which was extended by Brannath, Mehta\nand Posch (2008) to the adaptive setting. The key idea of both methods,\nbased on the method of M¨ller and Sch¨fer (2001), is to preserve the\noverall type I error rate after a u a possible design adaptation, by\npreserving the null conditional rejection probability of the remainder\nof the trial at each time of an adaptive change. The implementation and\nthe application of these methods in R (available in package AGSDest)\nwill be illustrated."} {"Title":"Introducing BioPhysConnectoR","Author":"Kay Hamacher and Franziska Hoffgaard and Philipp Weil","Session":"foc-networks-1-3","Keywords":"bioinformatics-workflow","Abstract":"The biggest challenge for systems biology and bioinformatics in the\npost-genome area is the integration of countless experimental data such\nas sequence information, gene expression data, physio-chemical values,\nphylogenetic relationships, or physiological data. With R researchers\ncommand over an efficient framework for statistical modelling and\naccordingly R became – with the event of Bioconductor at the latest –\nthe major platform for analyzing biostatistical data. Up to now much\neffort has been invested in the statistical modelling and subsequent\nimplementation of information-driven packages, and protocols. This\nallowed tremendous progress in understanding information contained\nwithin e.g. biological sequence data. Experiments are nowadays guided to\na large extend by the knowledge gained from such protocols. The\ninformation contained within biological sequences reflects the whole\nevolutionary history of the organism under investigation (including\nexternal selective pressure such as drugs and resistance\ndevelopment). The selection step of every evolutionary process is,\nhowever, an event in the physical realm as selection tests the\nphysiochemical properties of molecules involved in relevant\nprocesses. Therefore, to construct molecular interaction networks [1],\nthere is a pressing need to connect information (the evolutionary\nmemory) with the physical realm, its forces, the molecular dynamics and\nmechanics (the selective „horizon“). We achieve this with our ongoing\nefforts [2] in integrating standard sequence/statistical-modeldriven\nmethodologies with new reduced-molecular-models derived from biophysical\ninteraction theories [3,4], eventually bridging the gap between\nbioinformatics and molecular dynamics simulations/molecular\nbiophysics. We developed an R-package (BioPhysConnectoR) to this\nend. With this package we connect the information space and the physical\nspace – thus allowing for functional annotation of sequence data and\nsystematic in silico experiments. Additional useful functions for\ndealing with sequences and matrices are provided within the package. We\nintegrated C-code with R-routines and found that regarding the run-time\nefficiency our packages compares perfectly with our original code in\nC/FORTRAN. Due to the abstraction offered by R and leveraging the power\nof the packages Rmpi and papply, we were able to implement the package\nin a massively parallelized fashion. As it is possible in R to\ninteractively examine the results of the computations, this allows for\nboth large-scale screening and high-throughput-scans on the one hand and\nonline, interactive method development and hypothesis testing on the\nother. We discuss future research directions."} {"Title":"JavaStat: a Java-based R Front-end","Author":"E. James Harner and Dajie Luo and Jun Tan","Session":"foc-gui_frontend-1-3","Keywords":"user interfaces-java","Abstract":"Architectures are described which allow a Java-based front-end to run R\ncode on a server. The front-end is called JavaStat\n(http://javastat.stat.wvu.edu), a Java application. JavaStat is a\nhighly-interactive program for data analysis and dynamic visualization\nwith data management capabilities. The objective is to bring the\nhigh-level functions of R to JavaStat without excessive duplicative\ndevelopment work. Results returned from R are wrapped and then displayed\nusing dynamic graphics in JavaStat. The principal idea is to use RMI\n(Remote Method Invocation) to communicate with a Java server program\n(JRIServer), which in turn communicates with R using JRI (Java/R\nInterface). Two versions have been implemented. The first architecture\nmaintains a connection between the client and server in order to return\nthe results from R. This is suitable for small to moderate data sets in\nwhich statistical models are run. The second architecture queues the\nrequests and uses polling to fetch the results. It is suitable for large\ndata sets and complex models, e.g., those encountered in genomic\nstudies."} {"Title":"Models for Replicated Discrimination Tests: A Synthesis of Latent Class Mixture Models and Generalized Linear Mixed Models","Author":"Rune Haubo Bojesen Christensen","Session":"foc-mod_mixed-1-3","Keywords":"modeling-mixed","Abstract":"Discrimination tests are often used to evaluate if individuals can\ndistinguish between two items. The tests are much used in sensory and\nconsumer science to test food and beverage products, and in\npsychophysics to investigate the cognitive strategies of the\nmind. Signal detection theory, experimental psychology and medical\ndecision making are other areas, where the tests are applied. The basic\nidea is to use humans as instruments to measure attributes or differences\nbetween products (eg. Lawless and Heymann, 1998). In sensory and\nconsumer science a panel of judges or a sample of consumers are\nemployed, but humans are difficult to calibrate and much variation remains\nbetween individuals. Often respondents perform the test several times\nand because subjects tend to have different discriminal abilities, this\nleads to overdispersion in the data. Traditionally this is handled by\nmarginal models where the amount of overdispersion is estimated in order\nto adjust standard errors. Commonly used discrimination tests can be\nidentified as generalized linear models (GLMs) with the so called\npsychometric functions (Frijters, 1979) as inverse link functions\n(Brockhoff and Christensen, 2008). This makes generalized linear mixed\nmodels (GLMMs) available to model the variation between subjects. The\ninverse psychometric functions maps the probability of a correct answer\nin the discrimination test to a measure of discriminal ability, which\nbecomes an intercept parameter in a GLM or GLMM. Since the discriminal\nability is a non-negative quantity, the random effect distribution in a\nGLMM consists of a point mass at zero and a continuous positive\npart. The resulting model can be seen as a synthesis of a latent class\nmixture model and a generalized linear mixed effect model. We have\nimplemented functions that will fit the proposed model in R. Interest is\noften in characterizing the variation between subjects and in obtaining\nestimates of individual discriminal abilities. Both sets of quantities\nare available from the proposed model as a variance component and\nposterior modes respectively. Also available is an estimate of the\nproportion of discriminators in the population as well as an estimate of\nthe probability that each individual is a discriminator. This\npresentation will introduce models for replicated discrimination tests,\nshow how to fit them in R and consider important properties of the\nmodels. We end with an example from sensory science showing how to\ninterpret the results."} {"Title":"SpRay - an R-based visual-analytics platform for large and high-dimensional datasets","Author":"Julian Heinrich and Janko Dietzsch and Dirk Bartz and Kay Nieselt","Session":"foc-bioinf-1-1","Keywords":"bioinformatics-workflow","Abstract":"Recently developed high-throughput methods produce increasingly large\nand complex datasets. For instance, microarray-based gene expression\nstudies generate data for several thousands of genes under numerous\ndifferent conditions, yielding large, heterogeneous, potentially\nincomplete or conflicting datasets. From both technical and analytical\npoints of view, extracting useful and relevant information - known as\nthe knowledge discovery process - from these large data sets is a\nchallenge. While the technical capacity to collect and store such data\ngrows rapidly, the ability to analyze it does not advance at the same\npace. The extraction of relevant information from large and\nhigh-dimensional data is very difficult and requires the support of\nautomated extraction algorithms based on statistical\ncomputing. Unfortunately, the unsupervised application of these\nstatistical measures does not guarantee the successful extraction of\nrelevant information, but requires critical consideration itself. Hence,\nthe use of interactive visualization methods for the simultaneous\nevaluation of the applied statistical models is of central relevance and\nplays therefor a key role in the emerging field of visual analytics. The\naim of the work is to combine statistical methods with modern\nvisualization techniques in an extendable, hardware-accelerated\nvisual-analytics framework. We are currently developing SpRay (viSual\nexPloRation and AnalYsis of high-dimensional data), which provides for\nthe explorative analysis of large, high-dimensional datasets in\naccordance with the visual-analytics paradigma. Similar to GGobi\n[SLBC03], the statistical backend is provided through R, as a\nplugin. The performance-oriented design of SpRay, which uses\nhardware-accelerated graphics (OpenGL), C++ and Qt, also allows very\nlarge datasets to be explored with greatly reduced response times. The\nuse of modern GPUs (OpenGL) further accelerates the application of\ndifferent transparencymodulations and color maps to the currently\nimplemented plugins, such as refined parallel coordinates and\nscatterplots. All plugins (currently: parallel coordinates,\nscatterplots, TableLens, TableView, Histogram, R-Console, Brushing) are\nlinked by means of a common data model which is particularly useful to\ntightly integrate R along with all its extensions via packages. Hence,\nadequate statistical values may be defined and interactively visualized\ntogether with the raw data, providing an iterative, interactive and\nintegrated approach to the analytical reasoning process as proposed by\nthe visual-analytics-paradigm. The benefit of the currently implemented\nfeatures has succesfully been demonstrated with different gene-expression\ndatasets [DHNB06, DHNB08]."} {"Title":"NADA for R: A contributed package for censored environmental data","Author":"Dennis Helsel and Lopaka Lee","Session":"kal-environ-1-4","Keywords":"environmetrics-misc, chemometrics","Abstract":"Trace contaminants in air, water biota, soils, and rocks often contain\ndata recorded only as a \"nondetect\", or less than a detection\nthreshold. These left-censored values cause difficulties for\nenvironmental scientists, as no single number can be validly assigned to\nthem. The typical solution of substituting one-half the detection limit\nand proceeding with regression, t-tests, etc., has repeatedly been shown\nto be inaccurate. Instead, these data can be effectively interpreted\nusing survival analysis techniques more traditionally applied to\nright-censored data. Methods for calculating descriptive statistics,\ntesting hypotheses, and performing regression, both parametric and\nnonparametric, are available using the contributed package NADA. Methods\ninclude censored maximum likelihood (ML), Kaplan-Meier, and the Akritas\nversion of Kendall’s robust line that is applicable (unlike ML) to\ndoubly-censored data. Methods such as censored boxplots and residuals\nplots that can graph data containing nondetects are also included. The\nNADA package complements the first author’s textbook, Nondetects And Data\nAnalysis: Statistics for censored environmental data (Wiley, 2005)."} {"Title":"High Performance Computing with NetWorkSpaces for R","Author":"David Henderson and Stephen Weston and Nicholas Carriero and Robert Bjornson","Session":"kal-highperf_con-1-3","Keywords":"high performance computing-parallel","Abstract":"Increasingly, R users have access to multiprocessor machines or\nmultiple-core CPUs. However, base R does not natively support parallel\nprocessing; this can force R users to wait while computationally\nintensive work is done on a single processor or core and other\nprocessors or cores lie idle. NetWorkSpaces for R (NWS-R) was developed\nat Scientific Computing Associates, the predecessor to REvolution\nComputing. It is a Python-based coordination system that is portable\nacross virtually all popular computing platforms. NWS-R includes a web\ninterface that displays the workspaces and their contents; this is\nhelpful when debugging or developing a program, or monitoring the\nprogress of an application. NWS-R is easy to learn, accessible from many\ndevelopment environments, and deployable on ad hoc collections of spare\nCPUs. The server and client for NWS-R are available at SourceForge\n(nws-r.sourceforge.net); the client is also available at CRAN\n(cran.r-project.org/web/packages/nws/). We will present NetWorkSpaces\nfor R and demonstrate the web interface."} {"Title":"Providing R functionality through the OGC Web Processing Service","Author":"Katharina Henneböhl and Edzer Pebesma","Session":"kal-highperf_con-1-5","Keywords":"spatial, connectivity","Abstract":"Providing R functionality through the OGC Web Processing Service\nKatharina Henneböhl and Edzer Pebesma Institute for Geoinformatics\n(IfGI) University of Münster The new Web Processing Service (WPS) 1.0\nstandard, recently released by the Open Geospatial Consortium (OGC,\nopengeospatial.org), specifies how GIS calculations can be made\navailable via the Internet, as web services. Operations can be as simple\nas adding two map layers or as complex as running a full hydrological\nmodel. This opens the possibility of providing the spatial analysis\nfunctionality available in R also through this interface. In this paper\nwe will show how a connection between the open-source Java-based WPS\nreference implementation from 52North (52north.org) and R can be\nestablished, and how R functionality can be exposed through an\nOGC-compliant web service. The question how to exchange spatial datasets\nbetween Java and R is of special interest."} {"Title":"Estimation of Theoretically Consistent Stochastic Frontier Functions in R","Author":"Arne Henningsen","Session":"foc-econom-2-1","Keywords":"econometrics","Abstract":"Conventional econometric analysis in the field of production economics\ngenerally assumes that all producers always manage to optimize their\nproduction process. Least squares-based regression techniques attribute\nall departures from the optimum exclusively to random statistical noise\n(Kumbhakar and Lovell, 2000). However, producers do not always succeed\nin optimizing their production. Therefore, the framework of “Stochastic\nFrontier Analysis” (SFA) has been developed that explicitly allows for\nfailures in producers’ efforts to optimize their production (Kumbhakar\nand Lovell, 2000). Stochastic frontier analysis is generally based on\nproduction, cost, distance, or profit functions. Microeconomic theory\nimplies several properties of these functions. Sauer et al. (2006) show\nthat consistency with microeconomic theory is important especially for\nestimating efficiency with frontier functions. Although theoretical\nconsistency is required for a reasonable interpretation of the results,\nthese conditions are not imposed in most empirical estimations of\nstochastic frontier models — probably because the proposed procedures to\nimpose these conditions are rather complex and laborious. Recently, a\nmuch simpler three-step procedure that is based on the two-step method\npublished by Koebel et al. (2003) has been proposed by Henningsen and\nHenning (2008). We show how theoretical consistent stochastic frontier\nfunctions can be estimated in R using this new procedure. This is\nillustrated by estimating a stochastic frontier production function with\nmonotonicity imposed at all data points."} {"Title":"Metabolome data mining of mass spectrometry measurements with random forests","Author":"Chihiro Higuchi and Shigeo Takenaka","Session":"foc-bioinf-2-2","Keywords":"bioinformatics-models","Abstract":"Metabolome analysis is expected to become a leading technology for rapid\ndiscovery of novel biomarkers, which are key components for successful\ndrug development. Nuclear magnetic resonance (NMR) and mass spectrometry\n(MS) are frequently employed as effective tools for metabolome\nmeasurements, and when it comes to analysis, the principal component\nanalysis and the partial least square methods have been the methods of\nchoice for mining of metabolome data. In the present study, we have\ninvestigated the application of the random forests machine learning\nmethod (Breiman 2001) for analysis of metabolme data. The data comprised\nFT-ICR-MS measurements of urine from rats, which have been administered\nthe antiarrhythmic agent amiodarone. Amiodarone treated rats will\nexhibit lipidosis and phenylacetylglycine (PAG) can be measured in the\nurine. Unsupervised classification applied to these data with the random\nforests approach clearly separated the groups, that is before and after\namiodarone treatment, and the separation was superior to that of the\nprincipal component analysis method. The supervised classification with\nthe random forests approach furthermore suggested several class\ndiscriminating MS peaks, which were selected by the importance value\ngenerated by the random forests machine learning method. These MS peaks\nwere assigned biomarker candidates and ranked by the loading values from\nthe principal component analysis. This analysis was carried out with the\nrandomForest, amap and Heatplus packages of R 2.4.1 on Linux (kernel\n2.6.21) operating system."} {"Title":"RcmdrPlugin.epack: A Time Series Plug-in for Rcmdr","Author":"Erin Hodgess and Carol Vobach","Session":"foc-gui_special-1-2","Keywords":"user interfaces-tcltk","Abstract":"In many statistics courses, R has excellent facilities but the learning\ncurve can be somewhat daunting for undergraduates. Fox(2005) has\novercome some of these hurdles with the Rcmdr package, which provides\nmenu-driven options for regression in particular. Rcmdr also provides\noptions for most functions found in basic statistics classes and is\nsupplemented by Heiberger and Holland (2007), with their RcmdrPlugin.HH\npackage. R has nearly all of the typical functions used in undergraduate\ntime series courses. Even with these functions available from the\ncommand line, students still balk at command line use. This new package,\nRcmdrPlugin.epack, provides sets of submenus for student use in an\nundergraduate time series courses. RcmdrPlugin.epack promotes ease of\nuse and permits students to use their efforts to understanding concepts\nrather than programming. Students can develop models for both\nexplanatory and forecasting purposes."} {"Title":"Modelling and surveillance of infectious diseases - or why there is an R in SARS","Author":"Michael Höhle","Session":"kal-bio-1-4","Keywords":"biostatistics","Abstract":"This talk will focus on how R could assist in two aspects of the\ncontinuing efforts to better understand and control infectious diseases -\nbe it in human, plant or veterinary epidemiology. Firstly, stochastic\nmodelling is an important tool in order to better understand the\ndynamics of infectious diseases. A key epidemic model in this process is\nthe stochastic susceptible-exposed-infectious (SIR) model. The R package\nRLadyBug contains a set of functions for the simulation and parameter\nestimation in spatially heterogeneous SIR models. Simulation is based on\nthe Sellke construction or Ogata’s modified thinning algorithm, while\nestimation is based on maximum likelihood or when the disease is only\npartially observed - Markov Chain Monte Carlo. Secondly, routine\nsurveillance of public health data often boils down to the on-line\ndetection of change-points in time series of counts. Surveillance has\nhence a close connection to problems from statistical process\ncontrol. The R package surveillance contains an implementation of some\nof the most common surveillance methods such as the Farrington procedure\nor cumulative sums. Data and results can be temporally and - in case of\nmultiple time series - spatio-temporally visualized. Both packages are\nintroduced and their use is illustrated by means of examples and R-code."} {"Title":"Variable Selection and Model Choice in Survival Models with Time-Varying Effects","Author":"Benjamin Hofner and Thomas Kneib and Torsten Hothorn","Session":"foc-biostat_surv-1-1","Keywords":"biostatistics-survival","Abstract":"Flexible hazard regression models based on penalised splines allow to\nextend the classical Cox-model via the inclusion of time-varying and\nnonparametric effects (Kneib & Fahrmeir 2007). Despite their immediate\nappeal in terms of flexibility, these models introduce additional\ndifficulties when performing model choice and variable selection. Boosting\n(cf. B¨hlmann & Hothorn, 2008, and Tutz & Binder, 2006) supports model u\nfitting for high-dimensional data. By using component-wise base-learners,\nvariable selection and model choice can be performed in the boosting\nframework. We introduce a boosting algorithm for survival data that\npermits the inclusion of time-varying effects in a parametric form or in\na flexible way, using P-splines. Thus we can fit flexible, additive hazard\nregression models and have a fully automated procedure for variable\nselection and model choice at hand. The properties and performance of\nthe algorithm are investigated in simulation studies. In an application,\nwe present the analysis of retrospective data of surgical patients with\nsevere sepsis were the aim was to build a flexible prognostic model."} {"Title":"Good Relations with R","Author":"Kurt Hornik and David Meyer","Session":"kal-mach_num_chem-1-1","Keywords":"numerics","Abstract":"Relations are a very fundamental mathematical concept: well-known\nexamples include the linear order defined on the set of integers, the\nequivalence relation, notions of preference relations used in economics\nand political sciences, etc. A k-ary (finite) relation is defined by its\ndomain, a k-tuple of sets, and its graph, a set of k-tuples. Package\nrelations provides data structures along with common basic operations\nfor relations and relation ensembles (collections of relations with the\nsame domain). In doing so, it builds on the infrastructure for sets and\ntuples provided by package sets. Package relations also features various\nrelational algebra-like operations, such as projection, selection, and\njoins. Finally, it contains algorithms for finding suitable consensus\nrelations for given relation ensembles, including the constructive\napproaches of Borda, Condorcet and Copeland, as well as\noptimization-based methods which minimize the aggregate symmetric\ndifference distance between the ensemble members and their consensus. We\nshow how relations can be obtained and manipulated, and how the\nfunctionality in the package can be employed to rank the results of\nbenchmarking experiments."} {"Title":"The Past, Present, and Future of the R Project - Development in the R Project","Author":"Kurt Hornik","Session":"invited","Keywords":"invited","Abstract":"The development of R is a multi-tiered process, with a core team\nproviding a base system only, and even key statistical functionality\navailable via contributed extension packages. We review some of the\nbasic milestones of this development process, discuss current patterns,\nand speculate on what the future might have in store. Particular\nemphasis is given to the fact that the number of available R packages\nkeeps growing at amazing speed, making it increasingly challenging for\nboth users and developers to deal with the size and the complexity of\nthe R project. Given the complexity of the social network underlying R,\nwe emphasize that no single information technology solution can satisfy\nthe R community’s desire for information, and discuss how and by whom\nthe community might be served."} {"Title":"An extension of the coin package for comparing interventions assigned by dynamic allocation","Author":"Johannes Hüsing","Session":"foc-biostat_study-1-2","Keywords":"biostatistics-inference","Abstract":"Restricted randomisation or algorithm-based allocation procedures enjoy\nsome popularity among clinical researchers, promising a lower variance\nof the treatment effect estimate and balanced subgroups for exploratory\nanalysis. They have met criticism because classical asymptotics don’t\nhold and the argument for a random distribution may be less soundly\nbased. This has led to a statement of mistrust in the form of a\nguideline issued by pharmaceutical regulators. Permutation tests give\nrise to analysis strategies which incorporate the allocation strategy in\norder to generate more realistic null distributions. The plethora of\npublished allocation algorithms calls for a common framework which can\nbe used regardless of the algorithm employed. The package coin currently\noffers complete randomisation and balanced block randomisation as\nalternative procedures. An extension of coin is introduced which allows\nusers to consistently write new allocation procedures. The interface of\nthe coin extension is defined so that algorithms can be used to be used\nboth in treatment allocation service programs and in the reallocation\nprocedure. Following this requirement, algorithms should be formulated\nin an incremental way, returning only the next allocation instead of the\nwhole vector. Algorithms should accept as parameters all previously\nallocated treatments and the common distribution of all factors the\nallocation decision is based on. It is passed as a data frame which\ncontains all factors and the treatments. Treatment is null for the last\nobservation, which is subject to the current allocation. The completed\ndata frame is returned. The interface to coin is confined to the\nApproxNullDistribution method. Two additional arguments, algorithm\n(defaulting to “full permutation”) and shuffle (sampling from\nalternative accrual sequences, defaulting to “identity”) are passed. an\nengine is started which applies the (incrementally formulated) algorithm\nsequentially to the set of patients, ie. the (possibly shuffled) x slot of\nthe IndependenceTestStatistic object. It is hoped that the introduction\nof a common interface may encourage the use of dynamic allocation\nmethods, and increase the acceptance for the results gained from\nappropriate analyses of data obtained this way."} {"Title":"R for climate research","Author":"Thomas Jagger and James Elsner","Session":"kal-environ-1-1","Keywords":"environmetrics-climate","Abstract":"We demonstrate using examples from our recent research papers that the R\nstatistical language and its packages are excellent tools for climate\nresearch. The development of our expertise in R is based on the need to\nperform statistical analysis on climate data in research and\nindustry. We show examples based on our work with hurricane activity and\nclimate. Each example uses analytical and graphical functions. We\ndemonstrate the use of 1. glm and associated functions for exploring the\nrelationship between climate and hurricane activity. 2. analysis and\ngraphing functions from the ismev package for exploring the role of\nclimate on hurricane intensity. 3. graphical functions developed for\nselecting hurricane tracks and local wind maximums. 4. quantile\nregression functions from the quantreg package for exploring the\nrelationship between increased sea surface temperature and global\ntropical storm intensity. 5. functions from the BRugs package for\naccessing OpenBugs used for analyzing the relationship of insured losses\ndue to hurricanes to global climate covariates. In each case we explain\nthe advancements made in understanding the role that climate plays in\nthe nature of tropical storm activity and insured losses from these\nstorms. All demonstrations will be available on our Hurricane Climate\nwebsite at: http://garnet.acns.fsu.edu/~jelsner/www/"} {"Title":"The Execution Engine: Client-server mechanism for remote calling of R and other systems","Author":"John James and Fan Shao","Session":"foc-conn-2-3","Keywords":"connectivity","Abstract":"A recent project highlighted the need to execute and manage remote jobs\nover a range of servers (including via a Grid API). The software systems\non which jobs were to be managed included R, NONMEM and openBUGS (as\nwell as other internal systems). In order to meet this demand, Mango\ncreated a software component we call the Execution Engine. This\npresentation will discuss the design challenges in the development of\nthe execution engine, with particular focus on the execution of R for\nmodel-based reporting. This will include interfaces for allowing control\nover R command options, and a developer interface in order to fully\ncustomize the application for other users. This developed component can\nbe thought of as a general client-server framework for R and other\ntools. It has now been used in a number of projects and is continuing to\nbe developed."} {"Title":"ROMP - an OpenMP binding for R","Author":"Ferdinand Jamitzky","Session":"foc-highperf-2-2","Keywords":"high performance computing-parallel","Abstract":"A binding of the parallel application programming interface OpenMP for\nthe RInterpreter is presented. Fortran code is generated and compiled on\nthe fly by the toolkit and the OpenMP directives are inserted. The\ntoolkit consists of a family of special apply routines together with\nreduction routines like sum, mean, product which generate parallel\nOpenMP code. The toolkit can be used for easy parallelization of parts\nof an R program without a steep learning curve for the user. Examples\nare presented which implement a systolic loop in RMPI and in the\nROMP-toolkit."} {"Title":"Rapid Application Deployment with R","Author":"Wayne Jones and Marco Giannitrapani","Session":"kal-highperf_con-1-4","Keywords":"connectivity","Abstract":"In this presentation we discuss our experiences of deploying R client\nbased solutions. We demonstrate that R packages such as “R-(D)Com” and\n“RODBC” that allow communication with other applications (e.g. Excel)\ntogether with R-Gui packages such as “rpanel” and “tcltk” makes for a\nvery powerful combination of tools for building customised statistical\napplications. Thanks, in part, to the very concise nature of\nRprogramming such applications can be very quickly developed with\nextreme ease making previously unviable consultancy projects due to\ncost/benefit or manpower constraints achievable. The scope for using R\nbased applications within Shell is enormous and we have already deployed\nnumerous diverse solutions across all areas of the business including:\nMonte Carlo simulation tools, Customised Data Visualisation Tools,\nForecasting toolbox, Groundwater monitoring application, automatic\nreport generation and curve fitting to name but a few."} {"Title":"EpiR: a graphic user interface oriented to epidemiological data analysis","Author":"Washington Junger and Antonio Ponce de Leon and Elizabeth de Albuquerque and Reinaldo Marques and Leonardo Costa","Session":"foc-gui_special-1-1","Keywords":"user interfaces-workflow","Abstract":"The use of R is rapidly growing among Brazilian graduate students as\nwell as academic researchers, especially in public health\nsciences. Several post graduate programs have recently replaced\nproprietary software by R. In addition, the personnel from official\nepidemiological surveillance services is being trained to use R in the\nroutine analyses, together with other software, like Epi-Info, as part\nof the current Brazilian government effort towards open source\nsolutions. Despite R’s power and flexibility, the absence of a graphic\nuser interface (GUI) still refrains from adopting R as the main\nenvironment for data analysis, hence creating a demand for the\ndevelopment of an R GUI oriented to some applied statistical\nanalysts. The aim of the Epi-R project is to fill this gap in public\nhealth.\n\nThe interface is being developed as both a package and a standalone\napplication. The functions library is separated from the GUI, so\ncommands can be issued either by point and click or command line. The\nGUI is developed over RGtk2 package and built with libglade. GTk widgets\nlook nice and it is fairly stable running on any operating system. There\nare four main modules designed for data management (which also include a\nfront end for ODBC connections and a recycle bin), data description and\nstatistical modelling, graphical display and Epidemiology specific\nanalyses. The library core relies on the functions available from\nseveral existing R packages as well as some homemade ones. A plug-in API\nis also being developed so the GUI may be easily extended and to keep\nthe code light and clean. Besides the usual R help pages for the\nfunctions, an alternative help system for the GUI is available and\ninformation about the resources available in an open window can be\nobtained directly from such a window. The development of a Portuguese\nversion of EpiR is supported by the Brazilian Ministry of Health. The\nEpiR package will be submitted to CRAN as of the acceptance of this\npaper."} {"Title":"A Toolbox for Bicluster Analysis in R","Author":"Sebastian Kaiser and Friedrich Leisch","Session":"foc-machlearn-1-3","Keywords":"machine learning","Abstract":"Over the last decade, bicluster methods have become more and more\npopular in different fields of two way data analysis and a wide variety of\nalgorithms and analysis methods have been published. In this paper we\nintroduce the R package biclust, which contains a collection of\nbicluster algorithms, preprocessing methods for two way data, and\nvalidation and visualization techniques for bicluster results. For the\nfirst time, such a package is provided on a platform like R, where data\nanalysts can easily add new bicluster algorithms and adapt them to their\nspecial needs."} {"Title":"Survival Models Built from Gene Expression Data Using Gene Groups as Covariates","Author":"Kai Kammers and Jörg Rahnenführer","Session":"foc-bioinf-1-3","Keywords":"bioinformatics-models","Abstract":"We present prediction models for survival times built from high\ndimensional gene expression data. The challenge is to construct models\nthat are complex enough to have high prediction accuracy but that at the\nsame time are simple enough to allow biological interpretation. Typical\nunivariate approaches use single genes as covariates in survival time\nmodels, multivariate models perform dimension reduction through gene\nselection. Analysis of timedependent ROC curves and the area under the\ncurves (AUC) can be used to assess the predictive performance (Gui and\nLi, 2005). We present models with higher interpretability by combining\ngenes to gene groups (biological processes or molecular functions) and\nthen using these groups as covariates in the survival models. The\nhierarchically ordered ”GO groups” (Gene Ontology) are particularly\nsuitable. Cox models are used for detecting covariates that are\nsignificantly correlated with survival times. Based on these models\nstatistical shrinkage procedures like Lasso-Regression are applied for\nvariable selection. We make use of the R package penalized (Goeman,\n2008) that provides algorithms for penalized estimation in generalized\nlinear models, including linear regression, logistic regression and the\nCox proportional hazards model. Our aim is the combination of methods\nfor survival prediction with biological a priori knowledge. First, we\ncompare the prediction performance of models using single genes as\ncovariates with models using gene groups as covariates on several real\ngene expression datasets. First results indicate that models built with\ngene groups alone have decreased prediction accuracy since many genes\nare not yet annotated to their corresponding functions. However, adding\ngene groups as covariates to models built from single genes improves\ninterpretability while prediction performance remains stable. In a next\nstep, we integrate GO graph structure in the models (Alexa, Rahnenführer\nand Lengauer, 2006) in order to cope with the high correlations between\nneighboring GO groups."} {"Title":"Agreement analysis method in case of continuous variable","Author":"Kulwant Singh Kapoor","Session":"foc-biostat_model-1-3","Keywords":"biostatistics-modeling","Abstract":"In clinical and epidemiological studies research are very much\ninterested to know the inter - observer variation in a continuous\nvariable or two measurement techniques. Example. Measurement of blood\npressure with pulse oximetry and ausculatory method or measurement of\nPEFR respiratory diseases by wright peak flow meter and mini wright meter\nin other case pulse rate of patient measure by two nurse or doctor. The\nconventional Statistical method applied for studying the agreement\nbetween two method of measuring a continuous variable is computing the\nCorrelation Coefficient (r), but many times this is misleading for this\npurpose. A change of scale of measurement does not alter r but affect\nthe agreement . In order to overcome this difficulty we will apply five\ntest and in case three will come out to be true we can say that there is\ngood agreement exist between two rater or techniques 1. r – should be\nvery high [r > .80] 2. r – should be very low [r < .20] 3. ICC – should\nbe very high [ICC > .80] 4. b – should not be different from 1. 5. d –\nBias should not be different from zero and limit of agreement and their\n95% C.I. should be within acceptable range."} {"Title":"Using R as enterprise-wide data analysis platform","Author":"Zivan Karaman","Session":"foc-misc-1-2","Keywords":"platform","Abstract":"In this paper we consider the suitability of R to serve as a core tool\nfor enterprise-wide data analysis platform that would be used by both\nexpert and occasional users. Different requirements for such a tool are\nexamined, including:\n* Scope of built-in data analysis functions\n* Graphics (including interactive plots)\n* Extendibility\n* Development environment\n* Deployment facilities\n* Database and files system connectivity\n* Integration with other software\n* User interface\n* Web deployment capabilities\n\nOverall, R provides an excellent platform for delivering data analytical\nfunctions enterprise-wide, including some quite unique features: the\nbroad spectrum of statistical methods that are included, highly flexible\ngraphics, ease of extending existing code with algorithms developed in\nboth R/S language and in other languages (Fortran or C), great database\nand file system connectivity and nice built-in facilities for package\nupdates. However, we have also identified some aspects where significant\nimprovements could be done. These include standard, multi-platform IDE\n(Integrated Development Environment), at least some form of Graphical\nUser Interface for standard data analysis procedures (for mid-level\nusers - expert user can use command-line interface, low-level users need\ncompletely packaged applications) to be part of R core system, and some\nenhanced features for Web-based deployment."} {"Title":"Design and analysis of follow-up studies with genetic component","Author":"Juha Karvanen","Session":"foc-biostat_study-1-3","Keywords":"biostatistics-inference","Abstract":"In gene-disease association studies, the cost of genotyping makes it\neconomical to use a two-stage design where only a subset of the cohort\nis genotyped. At the firststage, the follow-up data along with some risk\nfactors or non-genetic covariates are collected for the cohort and a\nsubset of the cohort is then selected for genotyping at the\nsecond-stage. The case-cohort design and the nested case-control design\nare examples of two-stage designs that are commonly used in\nepidemiological follow-up studies. The data from a two-stage study can\nbe analyzed as a missing data problem where the genotype data are\nmissing by design for the majority of the cohort. The parameters of the\ndata model, typically logistic model or proportional hazards model, can\nbe estimated by maximizing the full likelihood of the data, which in\ngeneral case becomes an integral over the missing observations. When\ndealing with single nucleotide polymorphism (SNP) data, the integrals\nare replaced by sums over the possible genotypes. As a consequence, the\nlikelihood can be directly maximized by numerical optimization, e.g. by\nR function optim. The straightforward implementation of full likelihood\nanalysis makes it possible to consider alternative designs for the\nsecond stage. One such alternative is the extreme selection where cases\nand non-cases are selected for genotyping starting from those with\nlargest and smallest covariate values. Another alternative is the\nD-optimal design, which maximizes the determinant of the Fisher\ninformation matrix of the parameters. The determination of the Doptimal\ndesign requires the use of heuristic algorithms, which is illustrated in\nFigure 1."} {"Title":"Variable Selection in Regression Using R","Author":"Dattatraya Kashid","Session":"foc-rob-1-2","Keywords":"robust","Abstract":"Variable selection problem is one of the important problems in\nregression analysis. Over the years, several variable selection methods\nare proposed in the literature and some frequently used methods are\nMallow’s Cp-statistic, Forward and Backward, Stepwise selection method\netc. All these methods assume that the error distribution is normal and\npresent software packages offer some of these methods for variable\nselection in regression. It is well known that in the absence of\nnormality or absence of linearity assumption or outlier(s) presence in\nthe data, the classical subset selection methods perform poorly. Such\nsituations demand alternative approaches. In the last decade, a few\nmethods are developed in the literature based on different situation\nmentioned above. Ronchetti and Staudte (1994) have proposed robust\nversion of Mallow’s Cp called RCp for outlier data. Kashid and Kulkarni\n(2002, 2003) suggested variable selection techniques to deal the\nsituation mentioned above. Since these methods are computationally\nintensive, so it is difficult to select a set of variables without using\nthe software. The implementation of these methods is possible by using\nR-software. In this article, we exploit use of R in variable selection\nproblem in regression."} {"Title":"Specification of Landmarks and Forecasting Water Temperature","Author":"Goeran Kauermann and Thomas Mestekemper","Session":"kal-environ-1-3","Keywords":"environmetrics-climate","Abstract":"We present and analyse a data set containg water and air temperature in\nthe river Wupper in the northern part of Germany. The analysis pursues\ntwo concrete aspects. First, it is of interest to find so called\nlandmarks, these are regularly occuring timepoints at which the\ntemperature follows particular pattern. These landmarks will be used to\nassess whether the current year is running ahead or behind the ”average”\nseasonal course of a year. Secondly, we focus on forecasting water\ntemperature using smooth principal components. The latter approach is\nalso used for bootstrapping temperatur data, which allows to assess the\nvariability of the specified landmarks. The implications of our modelling\nexercise are purely economic. The data trace from a larger project which\naims to develop a temperature management tool for two power plants along\nthe river Wupper. These use river water for cooling purposes and to\npreserve natural wild life in the river there is a strict limit of the\nmaximal temperature of the water. The latter constraints the possible\nproduction range of the power plant. More accurate forecasts therefore\nmean a higher potential of energy production."} {"Title":"Toward Fully Bayesian Computing: Manipulating and Summarizing Posterior Simulations Using Random Variable Objects","Author":"Jouni Kerman and Andrew Gelman","Session":"foc-bayes-1-4","Keywords":"bayesian","Abstract":"Bayesian data analysis involves Bayesian inference (model fitting), but\nalso requires post-fitting tasks that include summarizing and\nmanipulating inferences numerically and graphically, and doing\nmodel-checking tasks and forecasting using predictive inference. Since\nBayesian inference is based on computing and summarizing probability\ndistributions, to do Bayesian data analysis efficiently and conveniently,\nwe need a computing environment that enables us to work with random\nvariables as easily as we do with numerical variables. We propose a\ncomputing environment that defines random variables as natural extensions\nof traditional numerical objects, which can be regarded as random\nvariables with zero variance. Each numeric vector variable in this\nenvironment has a hidden dimension of uncertainty, which is represented\nby a number of simulation draws from the joint distribution of its\ncomponents. The random variables can be manipulated transparently, in\nthe same fashion as we do numeric vectors and arrays. We present an R\npackage, ‘rv’, that implements this new computing paradigm in R by\nintroducing a new simulation-based random variable class, along with\nnumerous mathematical, statistical, and graphical functions. By\nconverting posterior simulations into random variable objects, they can\nbe manipulated and summarized intuitively and efficiently. We illustrate\nthis by several practical examples."} {"Title":"The Dataverse Network","Author":"Gary King","Session":"invited","Keywords":"invited","Abstract":"We introduce the Dataverse Network project for data archiving,\ndistribution, and statistical analysis. Via web application software,\ndata citation standards, and integration with R, the Dataverse Network\nproject increases scholarly recognition and distributed control for\nauthors, journals, archives, teachers, and others who organize, produce,\nor analyze data; facilitates data access and analysis; and ensures\nlong-term preservation whether or not the data are in the public\ndomain. With a few minutes work, you can put a “dataverse” (a full\nservice virtual archive and data analysis engine with your view of the\nuniverse of data) on your web page, branded completely as yours, without\nany local installations or need for maintenance or backups. In addition,\nany R statistical package can be automatically included in the Dataverse\nstatistical analysis GUI by writing a few simple bridge functions (for\nthe R package Zelig) that describes your package and methods. See the\nproject homepage at http://TheData.org/."} {"Title":"Generalized count data regression in R","Author":"Christian Kleiber and Achim Zeileis","Session":"foc-mod_ext-1-1","Keywords":"modeling-extensions, econometrics","Abstract":"Fitting functions for the basic Poisson and negative binomial regression\nmodels have long been available in base R and in the well-known MASS\npackage, respectively. More recently, a number of modified or generalized\nregression models for count data have become available. Specifically,\nthere now exist functions for fitting hurdle and zero-inflation models in\npackage pscl (see Zeileis, Kleiber and Jackman, forthcoming), for fitting\nPoisson-inverse Gaussian mixtures in package gamlss (Stasinopoulos and\nRigby, 2007) and for fitting finite Poisson mixtures in package flexmix\n(Leisch, 2004). The talk will present an overview of the available\nmethods along with empirical illustrations. It will also present\nfunctions for some further generalized count data models of recent\ninterest that are not yet publicly available, and suggest directions for\nfuture work."} {"Title":"Using R as an environment for automatic extraction of forest growth parameters form terrestrial laser scanning data","Author":"Hans-Joachim Klemmt","Session":"foc-environ-1-2","Keywords":"environmetrics-forests","Abstract":"Laser scanning becomes a more and more important measurement technology\nin forests. Meanwhile the applicability of airborne laser scanning\nsystems (ALS) for forestry measurement purposes is far advanced [3,\n7]. So far ALS-systems mainly concentrate on the extraction of tree\nheight parameters. To describe the structure of forests additional\nparameters are needed. Terrestrial laser scanning provides very quick\ninformation on the structure of forests in form of 3D-point clouds,\nwhich are processed to gain such taxation features as the number of\ntrees in a stand, geoposition of individual trunks, diameters at breast\nheight (DBH), crown base height and height of trees [2, 9]. So far\nunfortunately no software solution exists which extracts the requested\nparameters automatically from terrestrial 3D data. To eradicate this\nflaw at the Chair of Forest Growth and Yield at Technische Universität\nMünchen a working group has been installed which is concerned with this\ntopic. This working group uses R [8] to extract the relevant parameters\nfrom 3D-data. R is used because it is a GPL-licensed, Open Source\nsolution for statistical computing which is well-resourced with various\npackages for clustering-purposes as well as for image processing and\nvisualization. One big advantage of R is also the connectivity with\nother software like WEKA [5]. To this day a system is developed which\nseparates automatically data sets, which belong to the ground or soil\nlayer, from potential vegetation points. Trees are detected in\nvegetation point cloud by application of several cluster\nalgorithms. Forest parameters like DBH are calculated by application of\nHough-Transformation. Visualization in R is done by the use of standard\noutput as well as by the use of the OpenGL-extension in the package RGL\n[1]. Although R is an interpreted computer language, which seems to be a\nbig disadvantage for this aim because of the huge number of data sets to\nprocess [6], the promising results of the development have shown that it\nis possible to extract automatically forest growth parameters with a\nhigh accuracy and a high level of accordance to manual measurements by\nthe use of this language. Further work aims on the development of a\nR-based software framework in combination with a JAVA visualization\ncomponent for the automatic extraction of forest growth parameters from\nterrestrial laser scanning data."} {"Title":"Rfit: An R Package for Rank Estimates","Author":"John Kloke","Session":"foc-rob-2-1","Keywords":"robust","Abstract":"In the nineteen seventies, Jureckova and Jaeckel proposed rank\nestimation for linear models. Since that time, several authors have\ndeveloped inference and diagnostic methods for these estimators. These\nestimators and their associated inference are robust to outliers in\nresponse space. The methods include estimation of standard errors, tests\nof general linear hypotheses, confidence intervals, studentized\nresiduals, and measures of influential cases. Unfortunately, these\nmethods are not implemented in main stream software, and hence are not\nwidely used. For this presentation I will highlight the main features of\nan R package I am developing which implements these methods. The package\nuses standard linear models syntax and includes many of the main\ninference and diagnostics functions (e.g. anova, summary, rstudent,\ninfluence.measures)."} {"Title":"sfCluster/snowfall: Managing parallel execution of R programs on a compute cluster","Author":"Jochen Knaus","Session":"foc-highperf-2-3","Keywords":"high performance computing-parallel","Abstract":"Modern bioinformatics applications require a huge amount of computing\nresources. To adress these, techniques such as MPI, available through\nthe R packages Rmpi and snow, allow the bundling of single machines into\ncompute clusters. However, management of cluster resources has to be\nperformed manually, resulting in problems, when several, potentially\nunexperienced users access the same cluster pool. As a solution for this\nproblem we developed sfCluster and the snowfall R package based on the\nsnow package and LAM/MPI. Both are designed for easy and safe usage,\nhiding cluster setup and internals from end users, who only see a clean\nsnow-like API. sfCluster is a Unix tool for management of parallel R\nprograms, which assigns resources dynamically in a reasonable way, sets\nup the LAM cluster and monitors the execution of the parallel R program\nas well as controlling the cluster itself. sfCluster features various\nexecution modes: it can run an R interactive shell, raw batch mode or a\nvisual monitoring mode, which allows process and logfile control during\nruntime directly on the terminal using Curses. Memory observation,\nprocess control and cluster session shutdowns even work if the LAM\ncluster itself died or some machines went offline or network problems\noccurred. snowfall is the corresponding R package, which connects to\nsfCluster, but can also be used without it. In contrast to the snow\npackage it provides easy switching between sequential and parallel\nexecution, which eases development on machines without cluster\nenvironment. The package also features basic intermediate saving of\nresults (with restore), so not all results are lost in case of a cluster\nstop. The use of these advanced tools will be illustrated with\napplication scenarios from our department, where several users can now\nperform demanding bioinformatics simulation studies at the same time."} {"Title":"mboost - Componentwise Boosting for Generalised Regression Models","Author":"Thomas Kneib and Torsten Hothorn","Session":"kal-model-1-4","Keywords":"modeling-extensions","Abstract":"In recent years, boosting has emerged into a widely applied technique\nfor fitting various types of generalised regression models. The main\nreason for its popularity is that it is surprisingly simple in requiring\nonly iterative fitting of some (potentially simple) base-learning\nprocedure such as (penalised) least-squares to working\nresiduals. Moreover, boosting allows to define various types of\nregression situations by formulating them in terms of a suitable loss\nfunction. From a theoretical perspective, boosting then equals a\nfunctional gradient descent algorithm for solving the empirical risk\nminimisation problem and the working residuals are given by the negative\ngradient of the loss function. While boosting has mainly been used to fit\ncompletely nonparametric black box models in a prediction-oriented\nframework first, recent research has shown that it can actually be used\nto estimate structured regression models. Therefore the base-learning\nprocedure is separated into several components and only the best-fitting\ncomponent is updated in each iteration. For example, when fitting a\ngeneralised linear model, each base-learner might correspond to a single\ncovariate and only the effect of the best-fitting covariate is updated in\neach boosting iteration. Applying a suitable stopping rule to the\nboosting iterations yields an adaptively regularised model fit that also\nprovides a means of variable selection and model choice. The package\nmboost provides implementations for the most common types of univariate\nexponential family responses where the negative log-likelihood provides\nthe loss-function, but also for other types of regression situations\nsuch as robust regression based on Huber‘s loss function. Further\nextensions, for example to survival modelling are currently being\ninvestigated. A wide range of componentwise base-learning procedures is\navailable based on (penalised) least squares fits, for example •\nparametric linear effects as in generalised linear models, • penalised\nsplines for nonparametric effects and varying coefficient terms, •\npenalised tensor product splines for interaction surfaces and spatial\neffects, • ridge regression for random intercepts and slopes, • stumps\nfor piecewise constant functions. Through its modular formulation,\nboosting allows to define models consisting of arbitrary combinations of\nthese effects and we will illustrate the versatility of the resulting\nmodel class in a spatio-temporal regression model for the analysis of\nforest health."} {"Title":"Believing by Seeing before Seeing by Believing: Visualizing the Gaussian Regression Model by the SIM.REG package for intuitive teaching","Author":"Ruya Gokhan Kocer","Session":"foc-teach-1-2","Keywords":"teaching","Abstract":"The Gaussian Regression Model is one of the fundamental techniques in\neconometric theory. There are several crucial assumptions of this model\nsuch as homoskedastic error variance, no-autocorrelation between error\ndistributions, no high multicolinearity between independent variables,\nand identification of correct functional form. When the model is used\nfor hypothesis testing the normality of error distributions too should\nbe added to the list. These fundamental assumptions are of course always\nmentioned in the regression courses and several statistical tests are\nintroduced to diagnose possible violations. Unfortunately, students,\nespecially those with social science background, quite often fail to\ninternalize the importance of these assumptions neither do they really\nappreciate the BLUE (Best Linear Unbiased Estimator) property of The\nGaussian Regression Model. However, it is of crucial importance for the\nstudents to develop an intuitive understanding of the weaknesses and\nstrengths of this basic approach in order to be able comprehend the\npremises of and need for more sophisticated modeling techniques. In\nother words students first need to ‘believe’ by ‘seeing’ to be able to\n‘see’ more advanced theorems and models by ‘believing’. In this paper, a\npackage programmed by the author in R language to conduct Monte Carlo\nsimulations, the SIM.REG (Simulated Regression) is introduced and its\nvisual and analytical strength in depicting ins and outs of The Gaussian\nRegression Model is demonstrated. The SIM.REG allows creating, testing\nand visualizing data sets which contain desired degrees of\nheteroskedasticity, autocorrelation, multicolinearity and non-normality\nin order to reveal the isolated or simultaneous impact of these\nviolations on the Gaussian regression model. Similarly, the SIM.REG also\nallows revealing the impact of wrong functional forms on the\nmodel. Indeed SIM.REG contains an option which allows seeing, for\nexample, the impact of increasing level of autocorrelation on inference\nstructure as an animation. Moreover under the SIM.REG one can also\nvisually depict the BLUE property of the Gaussian Regression Model by\nmaking all assumptions hold. In this way students can be visually shown\nthe degree to which violation of assumptions damage the capacity of the\nmodel to make reliable estimations and correct inference. Moreover, it\nis also possible to use the SIM.REG to generate data sets which allow\ntesting the sensitivity of statistical tests and indicators, such as\nrank correlation test, Durbin-Watson test, condition index etc... In\nother words by using the SIM.REG one can create data sets which contain\nan isolated problem such as heteroskedasticity and then scrutinize the\nability of various statistical tests to identify the problem. In this\nway one can enable the students to develop a critical approach to\nvarious tests, that is, when and why to disregard the positive or\nnegative test outcomes about particular problems. More importantly one\ncan also create several problems simultaneously in order to reveal how\nthese simultaneous violations may render some instruments of diagnosis\nineffective. The main purpose of the paper is, by sketching out various\napplications of the SIM.REG package, to show how it can be effectively\nused for teaching purposes in MA level regression courses."} {"Title":"R-Packages for Robust Asymptotic Statistics","Author":"Matthias Kohl and Peter Ruckdeschel","Session":"foc-rob-2-2","Keywords":"robust","Abstract":"We present a family of R-packages designed for a conceptual adaptation\nof an asymptotic theory of robustness. Package RobAStBase provides the\nbasic S4 classes and methods for optimally robust estimation in the\nsense of Rieder (1994). That is, we consider L2 differentiable parametric\nmodels in the frame√ work of infinitesimal (shrinking at a rate of n )\nneighborhoods. The combination of RobAStBase with our R packages distr,\ndistrEx and RandVar enables us to implement one algorithm which works\nfor a whole class of various models, thus avoiding redundancy and\nsimplifying maintenance of the algorithm. Package ROptEst so far covers\nthe computation of optimally robust influence curves for all(!) L2\ndifferentiable parametric families which are based on a univariate\ndistribution. With the Kolmogorov and the Cram´r von Mises minimum\ndistance estimators which are implemented in our R e package distrMod\nand which serve as starting estimators, we are able to provide optimally\nrobust estimators by means of k-step constructions (k ≥ 1). Package\nRobLox includes functions for the determination of influence curves for\nseveral classes of robust estimators in case of normal location with\nunknown scale; cf. Kohl (2005). In particular, the function roblox,\ncomputes the optimally robust estimator for normal location and scale as\ndescribed in Kohl (2005). In contrast to package ROptEst, in which we\naim for generality, the function roblox is optimized for speed. Package\nROptRegTS contains the extension of the asymptotic theory of robustness\nto regressiontype models like the linear model and certain time series\nmodels (e.g., ARMA and ARCH). Finally, package RobRex provides functions\nfor the determination of optimally robust influence curves in case of\nlinear regression with unknown scale and standard normal errors where\nthe regressor is random. Analogously to package RobLox the functions in\npackage RobRex are optimized for speed."} {"Title":"Customer Heterogeneity in Purchasing Habit of Variety Seeking Based on Hierarchical Bayesian Model","Author":"Fumiyo Kondo and Teppei Kuroda","Session":"foc-business-1-2","Keywords":"business, bayesian","Abstract":"This research presents a model which expresses product choice behavior\nin terms of ’inertia’ or ’variety seeking’ for each customer by using a\nmixture normal-multinomial logit model in a hierarchical Bayesian\nframework. A product choice behavior is called as ’inertia’ if a\ncustomer chooses the same product as the previously purchased and\n’variety seeking’ if it is a different product from the previous\none. These kinds of behaviors are frequently observed in the product\ncategory of ’low involvement’ (Dick and Basu (1994), Peter and Olson\n(1999) ). Consumers tend to purchase a ’low involvement’ product such as\nbeverage or cake based solely on experience, inertia, or atmosphere. In\naddition to ’inertia’ or ’variety seeking’, Bawa (1990) proposed a model\nfor segmentation purposes. It has an additional segment of ’hybrid’\ncustomer, of which purchasing tendency changes from ’inertia’ to\n’variety seeking’ or vice versa. Moreover, it is getting increasingly\nimportant to understand the heterogeneity of customers in recent years,\nparticularly from the view point of the category attribute. A comparison\nwas made between a hierarchical Bayesian model and a finite mixture model\non the product category of Japanese tea and Chinese tea. The result\nshows that the hierarchical Bayesian model is superior to the finite\nmixture model in terms of ’hit rate’. Further, the model with the\nvariables of ’inertia’ or ’variety seeking’ was superior to the one\nwithout them in terms of Deviance Information Criterion, DIC. In\naddition, we extended the Bawa’s formula on ’inertia’, ’variety seeking’\nor ’hybrid’ behavior by considering the influence of purchasing\nintervals. Our proposed model that considers a timing of customer’s\nbrand switching was superior to the Bawa’s formula. We obtained the\nresults that each customer has a tendency of ’inertia’ or ’variety\nseeking’ or ’hybrid’ in product choice, which is different between the\ncategory of Japanese tea and that of Chinese tea. Finally, we proposed a\nCRM related strategy that calculates a necessary discount rate for\nindividual brand switching and offering the brand according to the brand\nswitching timing of each consumer."} {"Title":"Profiling the parameters of models with linear predictors","Author":"Ioannis Kosmidis","Session":"foc-mod_ext-1-2","Keywords":"modeling-diagnostics","Abstract":"Profiles of the likelihood can be used for the construction of confidence\nintervals for parameters, as well as to assess features of the\nlikelihood surface such as local maxima, asymptotes, etc., which can\naffect the performance of asymptotic procedures. The profile methods of\nthe R language (stats and MASS packages) can be used for profiling the\nlikelihood function for several classes of fitted objects, such as glm\nand polr. However, the methods are limited to cases where the profiles\nare almost quadratic in shape and can fail, for example, in cases where\nthe profiles have an asymptote. Furthermore, often the likelihood is\nreplaced by an alternative objective for either the improvement of the\nproperties of the estimator, or for computational efficiency when the\nlikelihood has a complicated form (see, for example, Firth (1993) for a\nmaximum penalized likelihood approach to bias-reduction, and Lindsay\n(1988) for composite likelihood methods, respectively). Alternatively,\nestimation might be performed by using a set of estimating equations\nwhich do not necessarily correspond to a unique objective to be\noptimized, as in quasi-likelihood estimation (Wedderburn, 1974;\nMcCullagh, 1983) and in generalized estimating equations for models for\nclustered and longitudinal data (Liang & Zeger, 1986). In all of the\nabove cases, the construction of confidence intervals can be done using\nthe profiles of appropriate objective functions in the same way as the\nlikelihood profiles. For example, in the case of bias-reduction in\nlogistic regression via maximum penalized likelihood, Heinze & Schemper\n(2002) suggest to use the profiles of the penalized likelihood, and when\nestimation is via a set of estimating equations Lindsay & Qu (2003)\nsuggest the use of profiles of appropriate quadratic score functions. In\nthis presentation we introduce the profileModel package, which\ngeneralizes the capabilities of the current profile methods to\narbitrary, user-specified objectives and, also, covers a variety of\ncurrent and potentially future implementations of fitting procedures that\nrelate to models with linear predictors. We give examples of how the\npackage can be used to calculate, evaluate and plot the profiles of the\nobjectives, as well as to construct profile-based confidence\nintervals. The presentation focuses on the following: • Generality of\napplication: The profileModel package has been designed to support\nclasses of fitted objects with linear predictors that are constructed\naccording to the specifications given by Chambers & Hastie (1991, Chapter\n2). Such generality of application stems from the appropriate use of\ngeneric R methods such as model.frame, model.matrix and formula. •\nEmbedding: The developers of current and new fitting procedures as well\nas the endusers can have direct access to profiling capabilities. The\nonly requirement is authoring a simple function that calculates the\nvalue of the appropriate — for their specific application — objective to\nbe profiled. • Computational stability: All the facilities have been\ndeveloped with computational stability in mind, in order to provide an\nalternative which improves and extends the capabilities of already\navailable profile methods."} {"Title":"Graphical Functions for Prior Selection","Author":"Stephanie Kovalchik","Session":"foc-bayes-1-1","Keywords":"bayesian","Abstract":"In a Bayesian analysis the choice of prior distributions for model\nparameters reflects the analyst’s a priori belief. Most discussion and\npresentation of prior densities are in terms of the model parameter\nvalues using the BUGS squiggle notation. With the exception of the\nUniform and Normal distributions, using this notation alone can make it\ndifficult to immediately assess the beliefs represented by the\npriors. Thus, the squiggle notation presentation creates\ninterpretational difficulty in that it does not reflect the process by\nwhich analysts will choose priors. In most cases, particularly when\nexperts outside of the statistical field are asked to give information to\nelicit priors, the construction will be in terms of the moments, mode\nand/or coverage probabilities of the parameters. It would be useful then\nto have a set of functions that will take these quantities as arguments\nand translate them into the corresponding prior density, returning the\nparameter values and providing a density plot. This presentation will\ndemonstrate a set of graphical functions written in R which allow the\nuser flexibility in specifying the desired moments, mode or coverage\nprobabilities when deciding on the appropriate prior. Examples from the\nliterature are given showing how these functions can facilitate prior\ndetermination when eliciting priors from experts as well as reveal\nmisspecification of a priori beliefs. The graphical functions are based\non the base graphics system which enables the user to easily annotate\nand customize the display. The tools are available for commonly used\ndensities of the stats package including the Normal, Student’s t, Beta\nand Gamma. Current work is being done to expand these plotting functions\nso as to allow the specification of mixture priors. It is the goal of\nthis work to provide prior selection tools in the R language comparable\nto those of Tony O’Hagan’s First Bayes. With these simple extensions of\nR’s standard statistical and graphical facilities Bayesian statisticians\nworking in R will be able to more efficiently select and present prior\ndistributions."} {"Title":"The BayHaz package for Bayesian estimation of smooth hazard rates in R","Author":"Luca La Rocca","Session":"foc-bayes-1-3","Keywords":"bayesian","Abstract":"Package BayHaz (La Rocca, 2007) for R (R Development Core Team, 2008) consists\nof a suite of functions for Bayesian estimation of smooth hazard rates using\ncompound Poisson process priors, introduced by La Rocca (in press), and first\norder autoregressive Bayesian penalized spline priors, based on Hennerfeind et\nal. (2006). Prior elicitation, posterior computation, and visualization are\ndealt with. For illustrative purposes, a data set in the field of earthquake\nstatistics is supplied. An interface to package coda (Plummer et al., 2007)\nfacilitates output diagnostics. Future plans are to implement other Bayesian\nmethods for hazard rate estimation, and to make available an extension to the\nproportional hazards model."} {"Title":"New possibilities for interactive specification and validation of models for Fluorescence Lifetime Imaging Microscopy (FLIM) data with the\nTIMP package","Author":"Sergey Laptenok and Katharine Mullen and Jan Willem Borst and Herbert van Amerongen and Antonie Visser","Session":"kal-mach_num_chem-1-5","Keywords":"chemometrics","Abstract":"The detection of protein-protein interactions in a biological cell is required\nto enhance our knowledge about mechanisms that regulate intracellular processes.\nFörster Resonance Energy Transfer (FRET) between donor and acceptor molecules\nis a widely used technique to monitor protein-protein interactions. As FRET is a\nfluorescence quenching process, it can be detected by the shortening of the\nfluorescence lifetime of the donor molecule. Fluorescence Lifetime Imaging\nMicroscopy (FLIM) allows the mapping of fluorescence lifetimes with (sub-)\nnanosecond time resolution and a spatial resolution of 250 nm. FRET phenomena\nmeasured with the FLIM technique provides temporal and spatial information about\nmolecular interaction in living cells. For accurate and quantitative FLIM data\nanalysis well-designed analysis protocols are required. The dynamical features\nof a FRET system are often well described by a small number of kinetic\nprocesses, in which the associated fluorescence lifetimes in all pixels have\nsimilar values, but the relative amplitudes may vary from pixel to pixel. In\nthis case significant advantages and accuracy in analysis can be achieved by\nglobal analysis of the image. Global analysis uses fluorescence decay traces\nfrom all pixels to estimate both kinetic parameters (lifetimes) and relative\namplitudes of components in each pixel. The TIMP package has been shown to be\neffective at performing global analysis of FLIM images [1]. A typical FLIM image\nrepresents in the order of 103 pixels with 103 time points per pixel.\nPresentation of the data analysis results needs to be well-organized and\ninteractive, allowing the user to obtain a detailed graphical presentation of\nthe fit at any pixel selected. Here we present new options for the TIMP package\nallowing interactive presentation of global analysis results, as well as\nimporting and preprocessing FLIM data. The analysis of FLIM images of\ntranscription factors fused with either cyan fluorescence protein (CFP) or\nyellow fluorescence protein (YFP) in plant cells will be given as an example.\nThe novel data analysis methodology could reveal molecular interactions among\ndifferent transcription factors in the nucleus of a plant cell."} {"Title":"ivivc - A Tool for in vitro-in vivo Correlation Exploration with R","Author":"Hsin-ya Lee and Pao-chu Wu and Yung-jin Lee","Session":"foc-biostat_model-1-2","Keywords":"biostatistics-modeling","Abstract":"In vitro-in vivo correlation (IVIVC) is defined as the correlation between in\nvitro drug dissolution and in vivo drug absorption. The main purpose of an IVIVC\nmodel is to utilize in vitro dissolution profiles as a surrogate for in vivo\nbioequivalence and to support biowaivers. In order to prove the validity of a\nnew formulation, which is bioequivalent with a target formulation, a\nconsiderable amount of efforts is required to study\nbioequivalence/bioavailability. Thus, data analysis of IVIVC attracts attention\nfrom the pharmaceutical industry. The purpose of this study is to develop an\nIVIVC tool (ivivc) in R. Methods Development and validation are 2 critical\nstages in the evaluation of an IVIVC model. In the first stage, the development\nof level A IVIVC model is usually estimated by a two-stage process. (1)\nDeconvolution: the observed fraction of the drug absorbed is based on the\nWagner-Nelson method. IV, IR or oral solution was attempted as the reference.\nThen, the pharmacokinetic parameters will be estimated using a nonlinear\nregression tool or be attempted from literatures reported previously. The IVIVC\nmodel is developed using the observed fraction of the drug absorbed and that of\nthe drug dissolved. Based on the IVIVC model, the predicted fraction of the drug\nabsorbed is calculated from the observed fraction of the drug dissolved. (2)\nConvolution: the predicted fraction of the drug absorbed is then convolved to\nthe predicted plasma concentrations by using the convolution method. In the\nsecond stage, evaluating the predictability of a level A correlation focuses on\nestimating the percent prediction error (%PE) between the observed and predicted\nplasma concentration profiles, such as the difference in pharmacokinetic\nparameters (Cmax, and the area under the curve from time zero to infinity,\nAUC∞). Results and Discussion We call this tool as ivivc. It can be used to\ncalculate the observed fraction of the drug absorbed in different pH media and\nformulations with multiple subjects at the same time. Based on the linear\nregression, the predicted fraction of the drug absorbed is calculated from the\nobserved fraction of the drug dissolved. Furthermore, the percent prediction\nerror (%PE) between the observed and predicted plasma concentration profiles,\nsuch as Cmax and AUC∞ are also calculated. Conclusion and Future Work In this\nstudy, we have successfully created the package, ivivc. ivivc will be released\nto public soon. In the future, we will include more methods that have been\npublished and frequently used to develop IVIVC."} {"Title":"PKfit - A Pharmacokinetic Data Analysis Tool on R","Author":"Chun-ying Lee and Yung-jin Lee","Session":"foc-pharma-1-1","Keywords":"pharmacokinetics","Abstract":"Pharmacokinetic (PK) data analysis heavily depends on computer calculation\npower. In this study, we tried to create a nonlinear regressions tool on R using\nits available packages and functions. Methods and Materials: Design goal of this\ntool was aimed to be easy-to-use, so a menu-driven interface on RGui was\ndeveloped. We used lsoda function (in odesolve package) to solve all\ndifferential equations used to define PK models. As for data fitting algorithms,\nGauss-Newton algorithm (nls function in stats package) for non-linear\nregression, and the Nelder-Mead simplex method (optim function in stats package)\nfor minimization of weighted sum of squares, as well as the genetic algorithm\n(genoud function in rgenoud package) were applied. Users just follow the menu\nstep by step, and then will get the job done. Fourteen pharmacokinetic models:\nintravenous drug administrations with i.v. bolus or i.v. infusion, extravascular\ndrug administrations, linear with 1st-ordered absorption/elimination or\nnonlinear (Michaelis-Menten models were built. Two weighting schemes, 1/Cp(obs),\nand 1/Cp2(obs) were also included. The output information included a summarized\ntable (consisting of time, observed and calculated drug plasma/serum\nconcentrations, weighted residuals, area under plasma concentration curve (AUC),\nand area under the first moment (AUMC), goodness-of-fit, final PK parameter\nvalues, and plots such as linear plots, semi-log plots, and residual plots. In\nthe part of simulation, runif and rnorm functions from stats package provide the\ngeneration of random uniform distribution derivates and normal distribution\nderivates for PK parameters, respectively. Further, we also provide the function\nof Monte-Carlo Simulation. Results and Discussion: We called this tool as PKFit.\nIt has been announced publicly, and can be downloaded from mirror sites of CRAN\n(package name: pkfit). With only a few examples, most results obtained from in\nPKfit were comparable to those obtained from other two pharmacokinetic programs,\nWinNonlin and Boomer. Conclusion and Future Work: PKfit running on R has been\nbuilt and has been proved that it can provide efficiency and accuracy in data\nfitting functions. Multiple dosing models or algorithms may be required for\nfurther development of PKfit."} {"Title":"Small groups and questionnaires","Author":"Lucien Lemmens","Session":"foc-social-1-2","Keywords":"social sciences","Abstract":"Most administrations want to have surveys on the quality of services provided by\ntheir officials. The questionnaire technique is often used. For a number of\nitems one asks the respondents to indicate how strongly they agree or disagree\nwith a given statement. Usually several items form a dimension – a name given\nto an essential part of the service – and the survey of the dimensions reports\na summary of the attitude of the respondents. This summary is used to evaluate\nthe performance of the official and can have consequences for promotion. Because\nthese surveys can have consequences, they are contested when the groups are\nsmall: a classical analysis rarely avoids the use of the central limit theorem\nor the law of large numbers. In bayesian statistics, however, the inverse\nprobability problem is readily solved given the likelihood function of the\nproblem and prior density and the evidence follows as usual from normalization.\nAssume that a respondent can take 6 attitudes for an item, the information we\nwant to obtain is then {Ni i ∈ [1, · · · , 6]} with i Ni = N where N is the\nnumber of possible respondents in the complete group. The information we obtain\nin the questionnaire is {ni i ∈ [1, · · · , 6]} with i ni = n where n is\nthe number of respondents for that item. The knowledge about {ni } will be used\nto guess the {Ni } This model has a multivariate hypergeometric density with the\nNi as parameters. The prior starts from an educated guess that predicts the {Ni\n} without using results from the questionnaire on that item. The most convenient\ndensity is a multinomial with given {pi } where the pi indicate the plausibility\nthat a respondent takes the attitude i. Combining this setting for an item to a\nmodel for a dimension we can use the posterior of the first item as a prior for\nthe second item and so on. This leads, for the dimension, to a Dirichlet-model,\nthat belongs to the exponential family. Hence the results are obtained by\nupgrading, avoiding numerical integration for the calculation of the evidence.\nAlthough the statistical analysis is computationally simple, there are a lot of\nsurveys to be analyzed and communicated to decision makers. A relatively simple\nR–code was written to automize the analysis and decision theoretical arguments\nare used to implement a representation of the uncertainties on the data\ngraphically."} {"Title":"Comparison of spatial interpolation methods using a simulation experiment based on Australian seabed sediment data","Author":"Jin Li and Andrew Heap","Session":"foc-spatial-1-1","Keywords":"spatial","Abstract":"Spatial distribution data of environmental variables are increasingly required\nas geographic information systems (GIS) and modelling techniques become powerful\ntools in natural resource management and biological conservation. However, the\nspatial distribution data are usually not available and the data available are\noften collected from point sources. This is particularly true of seabed data for\nthe world’s oceans, especially the deep ocean. A typical example is\nAustralia’s marine region. Here, Geoscience Australia has to derive spatial\ndistribution data of seabed sediment texture and composition for 8.9 million km2\nof Australia’s marine region from about 14,000 sparsely and unevenly\ndistributed samples. The need for these data comes from seabed habitat\nclassifications and predictions of marine biodiversity as key information\nsources supporting ecosystem-based management. Spatial interpolation techniques\nprovide essential tools to generate such spatial distribution data by estimating\nthe values for the unknown locations using the point samples, but they are often\ndata- or even variable- specific. The estimation of a spatial interpolator is\nusually affected by many factors including the nature of data and sample\ndensity. There are no consistent findings about how these factors affect the\nperformance of spatial interpolators. Therefore, it is difficult to select an\nappropriate interpolator for a given input dataset. In this study, we aim to\nselect appropriate spatial interpolation methods by comparing their respective\nperformance using a simulation experiment based on Australian seabed sediment\ndata in R. Three factors affecting the accuracy and precision of the\ninterpolations are considered: the spatial interpolation method, spatial\nvariation in data, and sample density. Stratification based on geomorphic\nfeatures is also used to improve estimation. Bathymetry data are considered as\nsecondary information in the experiment. Cross-validation is used to assess the\nperformance of spatial interpolation methods. Results of this experiment provide\nsuggestions and guidelines for improving the spatial interpolations of marine\nenvironmental data, which have application for using seabed mapping and habitat\ncharacterisations in achieving management and conservation goals."} {"Title":"A Closer Examination of Extreme Value Theory Modeling in Value-at-Risk Estimation","Author":"Wei-han Liu","Session":"foc-finance_risk-1-2","Keywords":"finance","Abstract":"Extreme value theory has been widely used for modeling the tails of return\ndistribution. Generalized Pareto distribution (GPD) is popularly acknowledged as\none of the major tools in Value-at-Risk (VaR) estimation. As Basel II stipulates\nthe significance level for VaR estimation from previous 5% quantile level to\nmore extremal quantile levels at 1%, it demands a more accurate estimation\napproach. It is imperative to take a closer examination at GPD modeling\nperformance at more extremal quantile levels. Empirical analysis outcomes show\nthe acknowledged outperformance of GPD is sustained at 5% quantile level but not\nat 1% level. Alternative methods are introduced and the empirical outcomes\nindicate that both the penalized spline smoothing in semiparametric regression\nand maximum entropy density based dependent data bootstrap outperform GPD in\nmodeling extremal quantile levels lower than 1%."} {"Title":"R and Stata for Building Regression Models","Author":"Andras Low","Session":"foc-teach-2-2","Keywords":"teaching","Abstract":"Stata system provides many routines for data manipulation and data analysis.\nStata has excellent capabilities for developing models on count data. At a\nspecific task a user interface can facilitate the researcher to focus on the\ntopic of his/her research instead of syntax. To create a web browser based user\ninterface for special purpose is easier in R with Tcl/Tk, R2HTML and Rpad. With\nthis interface the user can • specify different models, • compare the models\nwith each others, • diagnose the residuals, • update the models, • verify\nthe assumptions derived from models, • document them."} {"Title":"Surface and Sprinkle Irrigation Analysis with R","Author":"Vilas Boas Marcio Antonio and Uribe-Opazo Miguel Angel and Alves da Silva Edson Antonio","Session":"foc-environ-2-4","Keywords":"environmetrics-misc","Abstract":"The application of water to agricultural lands for the purpose of irrigation is\none of the alternate uses of this natural resource in many areas. It is\nessential that water be used effectively and efficiently, whether the supply is\nlimited or excessive. Irrigation efficiency is a concept used extensively in\nsystem design and management. It can be divided into two components, uniformity\nof application and losses. If either uniformity is poor or losses are large,\nefficiency will be low. This paper will present Irrigation-R for basic\nirrigation analysis. The functions uniformity and efficiency for irrigation\nsurface e sprinkle cried. The program is not intended for experts and gives\ndirect access only to a very limited set of R-functionality.Strengths and\nweaknesses of the approsch and possible further development steps will be\ndiscussed. Also, the results of an empirical investigation will be presented\nthat tests, whether the Windows look and feel really can lower the entry\nbarriers for novice users. The visualization is implemented using R under GLIB\nover Windows environment."} {"Title":"The statistical evaluation of DNA crime stains in R","Author":"Miriam Marusiakova","Session":"kal-bio-1-3","Keywords":"bioinformatics-models","Abstract":"Suppose a crime has been committed and a blood stain was collected from the\ncrime scene. It is believed the stain was left by an offender. A suspect is\narrested and it is found out that his DNA profile matches the DNA profile of\nthe crime stain. In forensic science, it is common to consider DNA profile\nmatch probabilities under the hypothesis that the offender was someone else\nthan the suspect. The problem was investigated, e.g., in [1], under general\nassumptions allowing for population substructure and relatedness. In case of DNA\nmixtures (from more than one person), the weight of the DNA evidence is assigned\nin terms of likelihood ratio of match probabilities, comparing two hypotheses\nabout origin of the mixture. Authors in [2] obtained a general formula for\ncalculation of match probabilities under assumption of independent alleles in\nDNA profiles. The result was further extended by [3] and [4] to allow for\npopulation substructure and dependence. The DNA mixture problems with presence\nof relatives were discussed, e.g., in [5]. The aim of this talk is to introduce\nan R package called forensic where the calculations of match probabilities\nmentioned above are implemented. The functionality of the package will be\ndemonstrated using data from real situations."} {"Title":"R packages from a Fedora perspective","Author":"José Matos","Session":"foc-misc-1-3","Keywords":"platform","Abstract":"Fedora is a Linux distribution that showcases the latest in free and open source\nsoftware. It serves as the common root where other Linux distributions branch,\nwith the best known being Red Hat Enterprise Linux, CentOS and Scientific\nLinux. Both Fedora, R and most R packages are free software as defined by the\nFree Software Foundation. Although Fedora and R share such an important feature\ntechnically they have different goals. The purpose of R is to work in the\nlargest possible set of platforms (assuming that there are interested\ndevelopers). The purpose of Fedora is to package the largest possible set of\nfree software packages and have them smoothly integrated into a single set. When\npackaging R packages in Fedora the chalenge becomes then how to bring together\nthese different goals. This talk deals with some of these issues. It should be\nnoted that these challenges are common to other free software projects like\nPerl, Python, TEX ( and others languages) on one side and Linux distributions on\nthe other."} {"Title":"Desirabilitiy functions in multicriteria optimization - Observations made while implementing desiRe","Author":"Olaf Mersmann and Heike Trautmann and Detlef Steuer and Claus Weihs and Uwe Ligges","Session":"foc-numerics-1-1","Keywords":"numerics","Abstract":"Desirability functions and desirability indices are powerful tools for\nmulticriteria optimization und multicriteria quality control purposes. The\npackage desiRe not only provides functions for computing desirability functions\nof Harrington- (Harrington, 1965) and Derringer/Suich-type (Derringer and Suich,\n1980) but also allows the specification of functions in an interactive manner.\nDensity and distribution functions of the desirability functions and the\ndesirability index are integrated including the possibility of random number\ngeneration (Steuer, 2005), (Trautmann and Weihs, 2006). Optimization procedures\nfor the desirability index and a method for determining the uncertainty of the\noptimum influence factor levels (Trautmann and Weihs, 2004) as wells as a\ncontrol chart for the desirability index with analysis of out-of control-signals\nare implemented (Trautmann, 2004). The Desirability Pareto-Concept allows\nfocussing on relevant parts of the Pareto-front by integrating\na-priori-expert-knowledge in the multicriteria optimization process (Mehnen et\nal., 2007). We will focus on the implementation of the Desirability\nPareto-Concept in R. First we will give a short review of the traditional\noptimization strategy using desirability indices. Then, after showcasing NSGA-II\n(Deb et al., 2002), we will briefly talk about how desirability functions can\nbe integrated into optimization procedures that estimate the pareto front.\nFinally some of the problems faced during the development will be discussed.\nThese include interfacing R and C code and using functions as first class\nobjects. In addition a short overview of the package will be given."} {"Title":"An automated R tool for identifying individuals with difficulties in a large pool of raters","Author":"Pete Meyer and Shaun Lysen","Session":"kal-app-1-5","Keywords":"connectivity","Abstract":"R is used extensively by the analysts at Google for analyzing everything from\nvery small to very large datasets, from one-off analyses to regular production\nruns. In this talk we describe the use of R in flagging raters involved in the\nassessment of ad quality, who appear to be having difficulty performing their\nrating tasks. The use of this R script has resulted in an increase in system\nefficiency, improved timeliness of responding to rater needs, and decreased\nburden on those managing the raters. The package RMySQL allows R to seamlessly\nintegrate with MySQL databases, enabling data access directly to the production\ndatabases containing rater scores. Likewise, the R2HTML package provides output\nin a browser supported format, enabling report generation that can display web\ncontent and which enables movement between summary tables and supporting\ndocumentation using hyperlinks. Leveraging these features of R, we describe\ngenerating flags for three warning signs of rater difficulty: 1. excessive run\nlengths of repeated values, 2. the repetitive use of identical values for two\ndistinct measures, and 3. identifying sequences of scores that appear to be\nassigned randomly rather than specific to the ads involved. These tests could\nnot be done by eye, either because of the large number of tasks involved or\nbecause they depend upon comparisons to reference distributions that are not\nvisually apparent. However, those managing the raters easily grasp the\nconceptual basis for the tests and the summary tables contain hyperlinks to\ndocumentation that enables them to quickly find, cut and paste constructive\nfeedback to the raters into emails in a simple and efficient manner. While these\nflags would be difficult to program in SQL, they are straightforward in R."} {"Title":"The strucplot framework for Visualizing Categorical Data","Author":"David Meyer and Achim Zeileis and Kurt Hornik","Session":"kal-visual-1-3","Keywords":"visualization","Abstract":"The vcd package (‘Visualizing Categorical Data’) has been around for quite a\nwhile now. This talk demonstrates the capabilities of a major part in this\npackage: the strucplot framework. We give an overview on how state of the art\ndisplays like mosaic and association plots can be produced, both for exploratory\nvisualization and model-based analysis. Exploratory techniques will include\nspecialized displays for the bivariate case, as well as pairs plotlike displays\nfor higher-dimensional tables. As for the model-based tools, particular emphasis\nwill be given to methods suitable for the visualization of conditional\nindependence tests (including permutation tests), as well as for the\nvisualization of particular GLMs (such as log-linear models)."} {"Title":"Random Forests for eQTL Analysis: A Performance Comparison","Author":"Jacob Michaelson and Andreas Beyer","Session":"foc-bioinf-2-1","Keywords":"bioinformatics-systems","Abstract":"In recent years quantitative trait locus (QTL) methods have been combined with\nmicroarrays, using gene expression as a quantitative trait for genetic linkage\nanalysis. Finding genetic loci significantly linked to the expression of a gene\ncan help to identify regulators of the expressed gene. Traditional QTL methods\nused to find expression quantitative trait loci (eQTL) typically apply a\nunivariate model to each genotyped locus in order to assess linkage to the\nquantitative trait. This univariate approach makes it difficult to uncover the\ninteracting genes in the upstream regulatory pathway of the target. As has been\npreviously suggested [1], in this work we view the eQTL problem as one of\nmultivariate model selection: finding the genotyped loci which together best\nexplain the variability of target gene expression in a population. We performed\nregression with Random Forests using the genotyped loci as predictor variables\nand the gene expression as the response. Measures of variable importance\nreturned by Random Forests were used in locating eQTL. To assess whether this\nwas a valid approach to eQTL, we determined eQTL for transcriptional targets of\nseveral canonical regulatory pathways using both Random Forests and several\nconventional QTL methods provided by the R qtl package. Gene expressions derived\nfrom several tissues of recombinant inbred mouse strains were used, and each\neQTL method was evaluated for its ability to recapitulate known members of the\ncanonical regulatory pathways. The results of our work demonstrate the\nbiological validity and performance advantages of using Random Forests as a tool\nfor finding eQTL."} {"Title":"Cross-sectional and spatial dependence in panels","Author":"Giovanni Millo","Session":"kal-ts-1-3","Keywords":"econometrics","Abstract":"Econometricians have recently turned towards the problems posed by\ncrosssectional dependence across individuals, which may range from inefficiency\nof the standard estimators and invalid inference to inconsistency. Panel data\nare especially useful in this respect, as their double dimensionality allows\nrobust approaches to general cross-sectional dependence. A general object\noriented approach to robust inference is available in the R ˆ system (Zeileis,\n2004), for which all that’s needed are coefficients β and robust ˆ\nestimators for vcov(β). An useful implementation is, e.g., in linear hypotheses\ntesting (see Fox, package car). The plm package for pael data econometrics\nalready has features for heteroskedasticity– and serial correlation–robust\ninference (Croissant and Millo, forthcoming). If cross-sectional dependence is\ndetected, using a robust covariance estimator allows valid inference. I describe\nthe implementation in the plm package for panel data econometrics of: • tests\nfor detecting cross-sectional dependence in the errors of a panel model\n(Friedman 1928, Frees 1995, Pesaran 2004) • robust estimators of covariance\nmatrices for doing valid inference in the presence of cross-sectional dependence\n(White 1980, Beck and Katz 1995, Driscoll and Kraay 1998) If a particular\nspatial structure is assumed, this allows a parsimonious characterization of\nspatial dependence but, on the converse, the resulting models are\ncomputationally expensive to estimate, all the more so in the panel case.\nEfficient ML estimators for spatial models on a cross-section (Anselin 1988)\nare implemented in the spdep package (Bivand et al.). I describe implementation\nin a forthcoming package of • marginal and conditional LM tests for spatial\ncorrelation, serial correlation and random effects (Baltagi, Song, Jung and Koh\n2007) • ML estimators for panel models including spatial lags, spatial errors\nand possibly serial correlation (Anselin 1988, Elhorst 2003, Baltagi, Song, Jung\nand Koh 2007) I illustrate the functionalities by application to Munnell’s\n(1990) data on 48 USA states observed over 17 years. On an ordinary desktop\nmachine, the estimators and tests all take under one minute (few seconds for the\nbasic ones). The ML approach is nevertheless structurally limited to a few\nhundred cross-sectional observations, so further work is warranted to implement\nKapoor, Kelejian and Prucha (2007)’s GM approach, which promises to handle\nproblems with n in the thousands."} {"Title":"Resolving components in mass spectrometry data: parametric and non-parametric approaches","Author":"Katharine Mullen and Ivo van Stokkum","Session":"foc-chemo-1-2","Keywords":"chemometrics","Abstract":"A fundamental problem in mass spectrometry data analysis is decomposition of a\nmatrix of measurements D, the rows of which represent times and the columns of\nwhich represent mass-to-charge ratio, into two matrices C and S, so that D = CS\nT and column i of C represents a component contributing to the data with respect\nto time (called an elution profile), and column i of S represents the mass\nspectrum of that component. This decomposition allows the compounds in a complex\nsample to be identified by taking the maximum of the elution profile of a\ncomponent (that is, its retention time) and its mass spectrum and matching these\nproperties to those of a known compound stored in a database. A popular\nnonparametric means of resolving C and S given D is multivariate curve\nresolution alternating least squares (MCR-ALS), which combines the alternating\nleast squares algorithm with constraints to impose nonnegativity, unimodality,\nselectivity, etc. MCR-ALS also allows the resolution of components in many\ndatasets D1 , . . . , DK simultaneously. We present a package ALS to perform\nMCR-ALS in R. While the package can be applied to any kind of data, it includes\nfunctions to plot mass spectra in particular. A new methodology for resolving C\nand S given D currently in development uses a parametric description for C (in\nwhich components are usually described by functions based on a exponentially\nmodified Gaussian), and optimizes the resulting separable nonlinear least\nsquares problem to improve estimates for nonlinear parameters, while treating\nthe mass spectra S as conditionally linear parameters. Like MCR-ALS, the\nmethodology is well-suited to resolving components in many datasets\nsimultaneously. We present options for the package TIMP that implement this\nparametric model-based methodology, and address issues such as outliers, a\nbaseline, and instrument saturation."} {"Title":"Package Development in Windows","Author":"Duncan Murdoch","Session":"invited","Keywords":"invited","Abstract":"Developing R packages in Windows is much like developing them on Unix/Linux\nsystems, except that most Windows users don’t have the necessary tools\ninstalled. In this talk I will describe how to get the tools (which is much\neasier now than it was even two years ago), and give an overview of how to use\nthem. I will follow this with a demonstration of how to put together a simple\npackage including external C code. The issues here are common to all platforms:\nhow to set up the package, how to install and test it, and how to package it for\ndistribution to others."} {"Title":"Speeding up R by using ISM-like calls","Author":"Junji Nakano and Ei-ji Nakama","Session":"foc-highperf-1-1","Keywords":"high performance computing-large memory","Abstract":"R sometimes analyzes huge amount of data and requires huge size of memory\noperation for them. Many operating system have calls to help handling such huge\nmemory. For example, Solaris has ‘ISM (Intimate Shared Memory)’ mechanism,\nLinux has ‘Huge TLB (Translation Look aside Buffer)’ and AIX has ‘Large\nPage’. OS usually translates 4-8 KB logical addresses to physical addresses at\na time. These ISM-like mechanisms can change this size to much larger, such as\n2-256 MB to speed up handling large memory. However, the cost of translation\nbetween logical addresses and physical addresses is called ‘TLB miss’ and\nsometimes becomes a bottle-neck. We introduce the use of ISM-like mechanisms in\nR by adding a wrapper program on the memory allocation function of R and\ninvestigate the performance of them."} {"Title":"ccgarch: An R package for modelling multivariate GARCH models with conditional correlations","Author":"Tomoaki Nakatani","Session":"foc-finance-1-2","Keywords":"finance","Abstract":"The multivariate GARCH models with explicit modelling of conditional\ncorrelations (the CC-GARCH models) have been widely used in modelling\nhigh-frequency financial time series. Examples include the Constant Conditional\nCorrelation GARCH, the Dynamic Conditional Correlation GARCH, and the Smooth\nTransition GARCH models and their extensions to allow for volatility spillovers.\nThe package ccgarch provides functionality for estimating the major variants of\nthe CC-GARCH models in arbitral dimensions. Both normal and robust standard\nerrors for the parameter estimates are calculated through analytical\nderivatives. Numerical optimisations are carried out in such a way that negative\nvolatility spillovers are allowed. The package is capable of simulating data\nfrom the major family of the CC-GARCH models with multivariate normal or\nstudent’s t innovations. Procedures for misspecification diagnostics such as\na test for volatility interactions are also included in ccgarch. In\npresentation, we will discuss usefulness, limitation and directions for\nmodification of the package."} {"Title":"R meets the Workplace - Embedding R into Excel and making it more accessible","Author":"Erich Neuwirth","Session":"foc-gui_frontend-1-1","Keywords":"user interfaces-embedding","Abstract":"One of the problems withstanding a more widespread use of R by nonspecialists\n(i.e. users with neither a high proficiency in software controlled by a\nclassical programming language paradigm nor with a deeper knowledge of\nstatistical methods) is the difficulty of starting to use R. Most statistical\ndata become analyzable data by being entered into Excel. Therefore, being able\nto transfer data from Excel to R is a key issue for wider use of R. There are\ntechnical answers like the packages RODBD or xlsReadWrite which essentially\nallow transfer of data frames. RExcel, an add-in for Excel, offers very similar\nfacilities, but also brings the sequential programming paradigm of R and the\ndependency base automatic recalculation model of Excel closer together. It\nallows to use R-expressions as Excel formulas, combining the power of R’s\ncomputational engine with the dependency tracking mechanisms of Excel. An\nadditional problem is the syntactic complexity of R formulas. A very nice tool\nfor “guided discovery learning” how to build R expressions is the R\nCommander which gives the user a menu driven interface to statistical methods\ncomparable to, say, SPSS, but at the same time displays the the R expressions\nneeded to produce the requests results. The user then has the option of\nmodifying these expression to adapt the result (data or graphs) to his needs.\nThe latest incarnation of RExcel embeds R Commander within Excel. The R\nCommander becomes an Excel menu, and in this way the naive user is presented\nwith an already well established interface to R as an extension to Excel. R\nCommander allows developers to write plugins, i.e. their own extensions to the\nmethods offered by the menu interface, and thereby becomes a hub for making any\nR method available through a menu driven interface. RExcel is compatible with\nthis plugin mechanism, any extension to R Commander also becomes an extension of\nExcel. The embedding mechanism of R Commander into Excel does not directly use\nthe autmatic recalculation engine, but R Commander can be used to support\n“production” of R formulas which then can be turned into Excel formulas.\nCombining the power of Excel’s dependency tracking mechanism and spatial\nparadigm for establishing relationships, R’s powerful programming paradigm and\ncomputational engine, and R Commander’s menu system ease the creation of R\nformulas seems to offer a very powerful combination of methodologies to make R\nmore accessible to a much wider class of users than R alone."} {"Title":"Retreiving old data using 'read.isi'","Author":"Rense Nieuwenhuis","Session":"foc-conn-1-2","Keywords":"connectivity","Abstract":"Background: Due to technological and software development, it sometimes is no\nlonger possible to automatically read older data-files into statistical\nsoftware. Especially data-files that originate from the times magnetic tapes\nwere used to store data are often distributed as raw (ASCII) data, without\nproper means to read those data into statistical packages. However, for those\ninterested in using data to perform longitudinal analyses, these older sets of\ndata are very valuable. In the Netherlands, the national archive for data\nstorage (DANS) is currently organizing conferences on a unified and time-proof\nmanner of storing data-files. But what to do with those data that already have\nbecome difficult to access?\n\nThe Problem: In a research project on fertility issues, it was found that the\n‘World Fertility Surveys’1 are stored in a format that is no longer\n(directly) accessible to commonly used statistical software. Only data converted\nto ASCII directly from magnetic tape and a code-book are provided. The\ncode-books are in a format specific by the ‘International Statistical\nInstitute’ (ISI) and provides for each variable information on starting and\nending positions in the data-file, valueand variable labels and information on\nmissing values. However, no statistical software package presently used is known\nto be able to automatically read data based on this type of code-book. It was\nrequired to read all variables into the statistical software manually. Variable\nnames and value labels have to be assigned manually as well. This is not an\ninviting process and a highly laborious when many variables are needed.\n\nThe Solution: This problem may however be solved – in select cases – by\nusing R-Project. Applying the flexible data-structure provided by R-Project, it\nwas possible to read and interpret the code-books (meant for the human eye) and\nto use this to automatically read the data, add value and variable labels,\nassign missing values, and to do this for whole data-sets at once. The resulting\nsyntax was transformed to the function called ‘READ.ISI’. V106 141 2 0 1 88\nRemariee 0 Non 1 Oui 88 Non rompue 99 Etat actuel Above, a small fragment of one\nof the code-books is shown. The function READ.ISI reads these fixed-width ASCII\nfile twice. Once to read the variable names, labels, starting- and ending\npositions, and missing values (on the first and last row of the example above).\nThe second time to read the value labels (in the middle rows of this example).\nAs is illustrated on the last row of the fragment above, the value labels of\nvariable ‘V107’ are identical to that of ‘V104’. This is taken into\naccount as well. Based on this automatic interpretation of this code-book,\neither the ASCII data-file is read, or a SPSS-syntax is created illustrating\nthat people using other statistical packages can benefit from this function as\nwell.\n\nProposal: Applicable to a select number of R users, but highly valuable for\nthose who want to use (some) old data, this approach will help and inspire those\nwho are interested in longitudinal analysis. Possibly, this approach can be\ntransferred to the code-books of other collections of data. Therefore, I feel\nthat this would make an excellent poster presentation on the userR! conference.\nOn this poster the problem could be clearly illustrated and the steps needed to\nread this type of data automatically will be identified.\n"} {"Title":"Invariant coordinate selection for multivariate data analysis - the package ICS","Author":"Klaus Nordhausen and Hannu Oja and David Tyler","Session":"kal-app-1-4","Keywords":"multivariate","Abstract":"Invariant coordinate selection (ICS) has recently been introduced by Tyler et\nal. (2008) as a method for exploring multivariate data. It includes as shown in\nOja et al. (2006) as a special case a method for recovering the unmixing matrix\nin independent components analysis (ICA). It also serves as a basis for classes\nof multivariate nonparametric tests. The aim of this paper is to briefly\nexplain the ICS method and to illustrate how various applications can be\nimplemented using the R-package ICS. Several examples are used to show how the\nICS method and ICS package can be used in analyzing a multivariate data set."} {"Title":"Automating Business Modeling with the AutoModelR package","Author":"Derek Norton","Session":"foc-business-1-1","Keywords":"business","Abstract":"Many issues arise in the business environment, like other fields, which must be\naddressed. Many of these issues have been addressed in other fields separately,\nbut need to be jointly addressed in a business environment. The objective of\nAutoModelR is to attempt to address these issues in an automatic manner. The\nissues addressed are: Exploratory Data Analysis, Dimension Reduction, and\nAutomatic Initial Modeling. To address these issues, a data set is passed to\nAutoModelR consisting of a dependant variable and one or more (often many more)\nindependent variables. The first step is to remove variables which have zero\nvariation, missing value percentages above a threshold, and variables with one\nunique value accounting for more than some threshold percentage. A report is\nthen generated using Sweave which gives tables of descriptive statistics for\nnumeric and factor data separately as well as graphical displays of the data.\nThe next step is an application of a filter type dimension reduction to arrive\na smaller data subset for initial modeling. The last step is to automatically\nfit various simple models determined by the type of dependent variable and\nreport on those fits. AutoModelR is an attempt to automate some of the\nrepetitive steps in modeling, so that more time can be spent on advanced\nmodeling."} {"Title":"rPorta - An R Package for Analyzing Polytopes and Polyhedra","Author":"Robin Nunkesser and Silke Straatmann and Simone Wenzel","Session":"foc-numerics-1-2","Keywords":"numerics","Abstract":"In application fields like mechanics, economics and operations research the\noptimization of linear inequalities is of interest. There are algorithms\nhandling such problems by utilizing the theory of polyhedral convex cones (PCCs)\nin particular that PCCs can be defined by a span form or a face form. These two\nrepresentations are called double description pair and represent the link\nbetween PCCs and linear inequalities. In practice, the transformation from one\nform into the other is often useful. Here, we present an R package called rPorta\nproviding a set of functions for polytopes and polyhedra mainly intended for the\ndouble description pair. The underlying algorithms used are part of a program\nnamed PORTA (Polyhedron Representation Transformation Algorithm) that comprises\na collection of routines for analyzing polytopes and polyhedra in general. In\nparticular, it supports both representations of PCCs, i.e. the representation as\na set of vectors and as a system of linear equations and inequalities. The main\nfunctionality of PORTA is the transformation from one representation to the\nother, but PORTA also provides other handy routines, e.g. to check whether\npoints are contained in a PCC or not. All functions of PORTA read and write data\nfrom text files containing one of the two representations, i.e. the user\ninterface of PORTA communicates through text files. Our package rPorta provides\nan interface to use the routines of PORTA in R by enwrapping the text file\ninformation in S4 Objects. This ensures an easy-to-use and R friendly way to run\nthe functions of PORTA. In addition, an application of rPorta in design of\nexperiments is presented. In engineering processes the parameter space often\ncontains parameter settings that produce missing values in the design since the\nproduced workpiece fails. The goal is to create a design where the design points\nconcentrate in the feasible area, although the boundaries where missing values\nare liable to occur are not known. The design is created sequentially and\nemerging missing values are used to update the excluded failure regions, which\nis done with the help of PCCs determined with rPorta."} {"Title":"A first glimpse into 'R.ff', a package that virtually removes R's memory limit","Author":"Jens Oehlschlägel and Daniel Adler and Oleg Nenadic and Walter Zucchini","Session":"kal-highperf_con-1-2","Keywords":"high performance computing-large memory","Abstract":"The availability of large atomic objects through package ’ff’ can be used to\ncreate packages implementing statistical methods specifically addressing large\ndata sets (like subbagging or package biglm). However, wouldn’t it be great if\nwe could apply all of R’s functionality to large atomic data? Package\n’R.ff’ is an experiment to provide as much as possible of R’s basic\nfunctionality as ’ff-methods’. We report first experiences with porting\nstandard R functions to versions operating on ff objects and we discuss\nimplications for package authors (and maybe also R core). Instead of a summary,\nhere we just quicken your appetite through the list of functions and operators\nwhere we have first experimental ports: ! != \\%\\% \\%*\\% \\% /\\% \\& | * + - / <\n<= == > >= ^ abs acos acosh asin asinh atan atanh besselI besselJ besselK\nbesselY beta ceiling choose colMeans colSums cos cosh crossprod cummax cummin\ncumprod cumsum dbeta dbinom dcauchy dchisq dexp df dgamma dgeom dhyper digamma\ndlnorm dlogis dnbinom dnorm dpois dsignrank dt dunif dweibull dwilcox exp expm1\nfactorial fivenum floor gamma gammaCody IQR is.na is.nan jitter lbeta lchoose\nlfactorial lgamma log log10 log1p log2 logb mad order pbeta pbinom pcauchy\npchisq pexp pf pgamma pgeom phyper plnorm plogis pnbinom pnorm ppois psigamma\npsignrank pt punif pweibull pwilcox qbeta qbinom qcauchy qchisq qexp qf qgamma\nqgeom qhyper qlnorm qlogis qnbinom qnorm qpois qsignrank qt quantile qunif\nqweibull qwilcox range range rbeta rbinom rcauchy rchisq rexp rf rgamma rgeom\nrhyper rlnorm rlogis rnbinom rnorm round rowMeans rowSums rpois rsignrank rt\nrunif rweibull rwilcox sample sd sign signif sin sinh sort sqrt summary t\ntabulate tan tanh trigamma trunc var."} {"Title":"BMDS: A Collection of R Functions for Bayesian Multidimensional Scaling","Author":"Kensuke Okada and Kazuo Shigemasu","Session":"foc-bayes-1-2","Keywords":"bayesian","Abstract":"Bayesian MDS has recently attracted a great deal of researchers’ attention\nbecause (1) it provides a better fit than classical MDS and ALSCAL, (2) it\nprovides estimation errors of the distances, and (3) the Bayesian dimension\nselection criterion, MDSIC, provides a direct indication of optimal\ndimensionality; see the original paper by Oh & Raftery (2001). However, Bayesian\nMDS is not yet widely applied in practice. One of the reasons can be attributed\nto the apparent lack of software: there is none except for the original Oh &\nRaftery’s code, which requires good experience in Fortran programming and the\nIMSL library, which is a commercial library for numerical calculation. It may be\ndifficult to require such environment for many researchers. Considering this\nsituation, we propose a set of R functions, BMDS, to perform Bayesian MDS and to\nevaluate the results. Using BMDS, researchers can (1) perform Bayesian\nestimation in MDS, (2) check the convergence of Markov chain Monte Carlo (MCMC)\nestimation, (3) evaluate the optimal number of dimensions, (4) evaluate the\nestimation errors and (5) plot the resultant configurations. Also, using BMDS\nusers can comparatively evaluate the result of Bayesian and classical MDS in\nterms of the value of stress and the plot of observed and estimated distances.\nIn our functions, we made use of WinBUGS (Spiegelhalter, Thomas, Best, & Lunn,\n2007) via R2WinBUGS package (Sturtz, Ligges, & Gelman, 2005) for MCMC\nestimation. Because the Bayesian MDS model is rather complex and it is\nimpossible to use single WinBUGS script for any model, our bmds() function\nautomatically produces a BUGS script that is adequate for the current data every\ntime we run the R function. By using WinBUGS in this way we can speed-up the\nMCMC estimation while maintaining the readability of the code, which tends to be\ncomplex in Bayesian estimation."} {"Title":"Forecasting species range shifts: a Hierarchical Bayesian framework for estimating process-based models of range dynamics","Author":"Joern Pagel and Frank Schurr","Session":"foc-environ-2-3","Keywords":"environmetrics-climate","Abstract":"Shifts of species ranges have been widely observed as ‘fingerprints’ of\nclimate change and more drastic shifts are expected in the coming decades.\nCurrent studies projecting range shifts in response to climate change are\npredominantly based on phenomenological models of potential climate space\n(climate envelope models ). These models assume that species distributions are\nat equilibrium with climate, both at present and in the future. A more reliable\nprojection of range dynamics under environmental change requires process-based\nmodels that can be fitted to distribution data and permit a more comprehensive\nassessment of forecast uncertainties [1]. To achieve this goal, we develop a\nHierarchical Bayesian framework [2] that utilizes models of local population\ndynamics and regional dispersal to link data on species distribution and\nabundance to explanatory environmental variables. In a simulation study we\ninvestigate the performance of this approach in relation to the biological\ncharacteristics of the target species and the quantity and quality of biological\ninformation available. We use R to implement an integrated routine that combines\na grid-based ecological simulation model and a ‘virtual ecologist’ with\nefficient MCMC algorithms ( e.g. DRAM [3]) for sampling from the full posterior\ndistribution of model parameters and derived predictions of spatially\ndistributed abundances under prescribed climatic changes. This enables us to run\na range of virtual scenarios differing in both ecological assumptions and\nsampling design in order to examine how forecast uncertainty depends on a\nspecies' ecology as well as on data quality and quantity."} {"Title":"Random Forests and Nearest Shrunken Centroids for the Classification of eNose data","Author":"Matteo Pardo and Giorgio Sberveglieri","Session":"kal-mach_num_chem-1-3","Keywords":"machine learning","Abstract":"Artificial Olfactory Systems or eNoses are instruments that analyze gaseous\nmixtures for discriminating between different (but similar) mixtures and, in the\ncase of simple mixtures, quantify the concentration of the constituents. eNoses\nconsist of a gas sampling system (for a reproducible collection of the mixture),\nan array of chemical sensors, electronic circuitry and data analysis software\n(Pearce, 2003). Random Forests (RF) and (NSC) are state of the art\nclassification and feature selection methodologies and have never been applied\nto eNose data. RFs are ensembles of trees, where each tree is constructed using\na different bootstrap sample of the data and each node is split using the best\namong a subset of features randomly chosen at that node. RF has only two hyper\nparameters (the number of variables in the random subset at each node and the\nnumber of trees in the forest) (Breiman, 2001). NSC classification makes one\nimportant modification to standard nearest centroid classification. It \"shrinks\"\neach of the class centroids toward the overall centroid (for all classes) by an\namount called the threshold (Tibshirani, et al., 2003). In this paper we compare\nthe classification rate of RF, NSC and Support Vector Machines (SVM) -which we\nconsider as a top level reference method- on three eNose datasets for food\nquality control applications. Classifiers’ parameters are optimized in an\ninner cross-validation cycle and the error is calculated by outer\ncrossvalidation in order to avoid any bias. To carry out computations we used\nthe R package MCRestimate (Ruschhaupt, et al., 2004). MCRestimate is built on\ntop of a number of R packages, e.g. the randomForest package (Liaw and Wiener,\n2002). We were interested in three computational aspects: 1. Relative\nperformance of the three classifiers. 2. Since nested cross-validation is\ncomputationally expensive we also investigate the dependence of the error on the\nnumber of inner and outer folds. We considered a grid of 25 outer folds/ inner\nfolds numbers: outer CV folds: 2, 4, 6, 8, 10; inner CV folds: 2, 4, 6, 8, 10.\nAltogether this means training e.g. 45050 SVM. 3. Feature rankings produced by\nRF and NSC. We find that: 1. SVM and RF perform similarly (each classifier does\nbetter on one problem), while NSC consistently performs worse. NSC is by far the\nsimpler (and faster) classifier. 2. There is a slight dependence on the number\nof external CV folds (particularly in the fungi dataset, where four external CV\nfolds produce a consistently –over internal CV folds number- higher\nclassification rate), while the number of inner CV folds seems to be immaterial.\n2x2 nested CV is often enough for a good result. With respect to 10x10 CV, 2x2\nCV requires 4% of the training time, so this result may spare quite some time in\nfuture computational studies. 3. Of the 30 original features, RF and NSC have\nthe same two top positions. Further, they share other four features in the top\nten. Other four features have quite, or even very different rankings. In fact,\nNSC – differently from RF- ranks features individually and independently in\nthe classifier construction process. In this way, on the one hand it cannot\nconsider the joint discrimination capabilities of features groups and on the\nother hand it does not exclude correlated features."} {"Title":"TCL Expect: Yet another way to develop GUI for R","Author":"Ivailo Partchev","Session":"foc-gui_build-1-4","Keywords":"user interfaces-tcltk","Abstract":"Point-and-click interfaces, as opposed to a sensible command language, will\nalways have their supporters and critics. We present an approach to developing a\npoint-and-click interface based on the TCL/TK extension, Expect. Compared to the\ninternal support of TCL/TK through packages like tcltk, rpanel, tkrplot, and\nothers, the Expect approach is slower and less appropriate for animated\ndisplays. On the positive side, applications are very easy to develop, the very\nexistence of R is hidden from the user, and the interactive nature of R is\nexploited fully. As an example, we present a toy application (a t-test or a\nregression), and a larger application for the analysis of treatment effects in\nexperimental and quasi-experimental research."} {"Title":"Dynamic Linear Models in R","Author":"Giovanni Petris","Session":"kal-model-1-1","Keywords":"modeling, time series","Abstract":"Dynamic Linear Models (DLMs) are a very flexible tool for time series analysis.\nIn this talk we introduce an R package for the analysis of DLMs. The design goal\nwas to give the user maximum flexibility in the specification of the model.\nThe package allows to create standard DLMs, such as seasonal components,\nstochastic polynomial trends, regression models, autoregressive moving average\nprocesses and more, and it also provides functions to combine in different ways\nelementary DLMs models as building blocks of more complex univariate or\nmultivariate models. For added flexibility, completely general constant or\ntime-varying DLMs can be defined as well. The drawback of allowing so general\nmodels to be used is that for many DLMs the standard algorithms for Kalman\nfiltering and smoothing are not numerically stable. The issue has been\naddressed in the package by using filtering and smoothing algorithms that are\nbased on the recursive calculation of the relevant variance matrices in terms of\ntheir singular value decomposition (SVD). The same SVD-based algorithm employed\nfor Kalman filter is also used to find maximum likelihood estimates of unknown\nmodel parameters. In addition to filtering, smoothing and maximum likelihood\nestimation, the package provides some functionality for simulation-based\nBayesian analysis of DLMs. A function that generates the unobservable states\nfrom their posterior distribution is available, as well as a multivariate\nversion of adaptive rejection Metropolis sampling, which can be used to generate\nrandom vectors having an essentially arbitrary continuous distribution. Both\ngenerators can be fruitfully employed within a Gibbs sampler or other Markov\nchain Monte Carlo algorithm. In the talk I will give an overview of the most\nimportant features of the package, illustrating them with practical examples."} {"Title":"Objects, clones and collections: ecological models and scenario analysis with simecol","Author":"Thomas Petzoldt","Session":"foc-environ-2-1","Keywords":"environmetrics","Abstract":"R is increasingly accepted as one of the standard environments for ecological\ndata analysis and ecological modeling. An inreasing collection of packages\nexplicitly developed for ecological applications (Kneib and Petzoldt, 2007) and\na number of textbooks that use R to teach ecological modeling (Ellner and\nGuckenheimer, 2006; Bolker, 2007) are just an indicator for this trend. In this\ncontext, the package simecol (simulation of ecological models) was developed in\norder to facilitate implementation, analysis and share of simulation models by\nmeans of object oriented programming (OOP) with S4 classes. The idea behind\nsimecol is an object model of ecological models, i.e. to put everything needed\n(state variables, parameters, inputs, equations) to define an ecological model\ntogether in one code object (an instance of a subclass of simObj), that can be\nhandled by appropriate generic functions. Because all essentials of a particular\nmodel are encapsulated in one code object, individual instances can simply be\ncopied with the assignment operator <-. These clones can be modified with\naccessing functions to derive variants and scenarios without copying and pasting\nsource code. This way, it is also possible to interactively enable or disable\nonline-visualisation (using observer-slots), to adapt numerical accuracy or to\ncompare scenarios with different structure, e.g. ecological models with\ndifferent types of functional response. After a short overview the presentation\nwill concentrate on examples how to clone, modify and extend simecol objects and\nhow to organize scenario analyses. From the user’s perspective these are: 1.\nImplement a model-prototype by filling out a pre-defined structure or by\nmodifying existing examples, 2. Simulate and test your model with existing\nsolvers or develop your own algorithms, 3. Clone your prototype object and\nmodify data and/or code of individual clones to generate scenarios, 4. Simulate,\nanalyse, compare scenarios, fit parameters, supply observer functions for\nrun-time visualisation. 5. Save your model object and share it with your\ncolleagues, students and readers of your papers. Individual simecol-objects can\nbe stored persistently as binaries or human-readable list representation which\ncan be distributed in reproducible and fully functional form. In addition, it is\nalso possible to assemble collections of models as separate R-packages, together\nwith necessary documentation and examples, e.g. to reproduce the figures of a\npaper, or with additional classes and functions extending simecol."} {"Title":"Bayesian Modelling in R with rjags","Author":"Martyn Plummer","Session":"kal-visual-1-2","Keywords":"bayesian","Abstract":"JAGS (Just Another Gibbs Sampler) is a portable engine for the BUGS language,\nwhich allows the user to build complex Bayesian probability models and generates\nrandom samples from the posterior distribution of the model parameters using\nMarkov Chain Monte Carlo (MCMC) simulation. The rjags package currently provides\na small library that permits a direct interface from R to the main JAGS library.\nFuture versions of rjags should provide additional Bayesian modelling tools.\nHowever, there are still outstanding problems, such as the choice of R class for\nrepresenting MCMC output, that still need to be resolved. This talk will discuss\nsome of the issues involved in creating a portable interface package for R. I\nwill illustrate the way that R and JAGS can be combined to provide tools for\nBayesian modelling, such as the deviance information criterion (DIC) and related\npenalized loss functions for model comparison."} {"Title":"Direct Marketing Analytics with R","Author":"Jim Porzak","Session":"foc-business-1-3","Keywords":"business","Abstract":"Direct marketing has traditionally claimed to be a quantified discipline.\nMarketing campaigns are measured by actual results. Testing is routinely done to\nimprove performance. The concept of a control group is ingrained in the direct\nmarketing culture. Modern marketing is based on relevant messaging. Segmentation\nis a practical way to deliver relevant communication to individuals. That said,\nin practice there is a lot of confusion as to exact definition and\nimplementation of specific metrics, methods, and statistical procedures. And,\nunfortunately, rigor is often more hype than actuality. The goal of the dma\npackage is to provide the direct marketing community with a well defined set of\nprocedures to easily do direct marketing analytics. Since the source code is\navailable, there is total visibility into the methods. No black box. No\nproprietary secrets. At the end of the day, all needed analytics could be done\nmanually in R, or most any other computing platform. The dma package wraps and\nbundles appropriate R procedures in a way direct marketers will, hopefully, find\nintuitive and natural to use. Having a single code base should eliminate errors\nand make results reproducible. We also leverage the graphics power of R to add\nvisualizations that make the findings more intuitive than just presenting\nnumerical results. Direct marketing has certain characteristics which influence\nthe design. These include: • a huge N. • very small proportions. • users\ntypically work in a Windows/Office environment. • the paradox of the need for\nrigor and client comprehensibility. There are three main modules in the dma\npackage. Basic Metrics: Straightforward definitions of key direct marketing\nperformance indicators. These include response, profitability, and LTV metrics.\nThe point is to implement these metrics based on accepted best practices in an\nopen and transparent way. Testing: In addition to single and multiple tests\nagainst a control (A/B and A/BCD... tests) visualizations are generated that\nmake marketing campaign test results obvious to non-statisticians. Methods are\nwrapped to make them accessible to direct marketing staff without requiring any\nunderstanding of R. Segmentation: The classic direct marketing segmentation\nmethod is RFM (Recency, Frequency, and Monetary). Using customer order history\ndata, loaded into an orders object, recency and frequency are used to create\nactionable customer life-stage segments. Adding product purchase details allows\ncreation of prospect relevant messaging tactics."} {"Title":"R for the Masses: Lessons learnt from delivering R training courses","Author":"Richard Pugh and Matt Aldridge","Session":"kal-gui_teach-1-4","Keywords":"teaching","Abstract":"Over the last 5 years, Mango have delivered R and S training courses to\napproximately 1,100 people at organizations around the world. To ensure the\nstandard of training material is of the highest quality, regular reviews are\nheld based on feedback and lessons learnt from the courses delivered. This\npresentation describes the manner in which our training package has evolved\nbased on the experience of Mango trainers and feedback from course attendees. We\nwill also present our view on how a complex software language such as R is best\ntaught, including the use of Mango tip sheets and continued support beyond the\ntraining period. We will also discuss the challenges of training non-statistical\nend users in languages like R."} {"Title":"Indicators of Least Absolute Deviation's sensibility","Author":"Soumaya Rekaia","Session":"foc-rob-2-3","Keywords":"robust","Abstract":"This paper aims to propose indicators which make it possible to\ncontrol the sensitivity of the robust method least absolute deviation\n(LAD), in the presence of single observation. Indeed, recent studies\nnoted that this estimator has gained a relatively little favour in the\nrobustness since data comprise outlier raised in X. Relaying on the\nsensitivity curve, we focus our study on the measure the sensitivity\nof LAD by the bias of estimate and by the importance of the leverage\nin this skew, expressed by its contribution to the total inertia of\nthe sample. Simulations of Monte Carlo enabled us to retain two\nmodels: a sigmoid model and a model with threshold. The results of the\nestimates show that Least Absolute Deviation is sensitive to the\npoints whose contribution is higher than 80%. We also verify this\nrésulte for real data sets. Mots clés : Outliers, estimation robuste,\nsingularité, sensibilité, LAD."} {"Title":"Equilibrium Model Selection","Author":"Tomas Radivoyevitch","Session":"foc-bioinf-2-3","Keywords":"bioinformatics-systems","Abstract":"Ribonucleotide reductase (RNR) is precisely controlled to meet the\ndNTP demands of scheduled (replication driven) and unscheduled (repair\ndriven) DNA synthesis. It has a small subunit R2 that exists almost\nexclusively as a dimer, and a large subunit R1 (R) that dimerizes when\ndTTP (t), dGTP, dATP, or ATP binds to its specificity site, and\nhexamerizes when dATP or ATP binds to its activity site. In general,\nRNR is modeled as a pre-equilibrium of proteins, ligands, and\nsubstrates whose parameters of interest are dissociation constants K,\nand a set of turnover rate parameters k that map distributions of\nactive enzyme complexes into expected k measurements of\nmixtures. Because the masses of R1 and R2 are known, it is logical to\nfocus first on K estimation from protein oligomer mass measurements,\nand later on k estimation from enzyme activity measurements. Further,\nit is also logical to begin with the simplicity of ligand-induced R1\ndimerization. The total concentration constraint full model for\ndTTP-induced R1 dimerization is 0 = p[ RT ] − [ R ] − 0= Using the\nfollowing average mass output model 90 [ R ] + [ RT ](1 − p ) 2[ RR ]\n+ 2[ RRt ] + 2[ RRtt ] + 180 [ RT ] [ RT ] the 58 models were fitted\nto available data. The top 6 models based on AICc are Model 3M violet\nParameter Initial value RRtt 1.000 p 1.000 SSE 0.064 AICc -48.066 RRtt\n1.000 p 1.000 SSE 0.064 AICc -44.852 p 1.000 RRtt 0.000 SSE 0.106 AICc\n-42.954 RRt 1.000 RRtt 1.000 p 1.000 SSE 0.165 AICc -35.303 R_R 75.000\nRR_t 0.550 RRt_t 0.550 p 1.000 SSE 0.041 AICc -49.222 Rt 1.000 RRtt\n1.000 p 1.000 SSE 0.221 AICc -32.422 Final value 18.697 1.000 0.034\n-54.448 5.558 0.907 0.027 -53.308 0.822 0.000 0.041 -52.590 49.568\n37.930 1.000 0.030 -52.218 685.986 0.142 0.142 1.000 0.032 -51.815\n91.059 14.612 1.000 0.032 -51.627 CI (4.807,72.966) fixed 3Mp blue\n(0.370,84) (0.787,1.044) 3Rp black (0.736,0.918) fixed 3I green\n(5.755,428) (5.003,290) fixed 2M yellow (2.801,162755) (0.005,3.975)\nconstrained fixed [ R] [ t ] − 2 [ R] K Rt K Rt 2 [ tT ] − [ t ] − [ R\n] [ t ] K RR [ R ] [ t ] − 2 [ R] [ t ] 2 2 2 3F orange (1.557,5324)\n(2.545,84) fixed [ R] [t ] − 2 [ R] 2 [t ] 2 2 K RRt K RRtt K RRt K\nRRtt where the subscript T denotes totals (note that [R]2[t]2/KRRtt =\n[RRtt]) and the probability that an R molecule is undamaged and\ncapable of dimerizing is p. This full model generates 58 a priori\nplausible equilibrium models/hypotheses as follows. Firstly, K=∞\nassumptions are used to remove specific terms one at time, two at a\ntime, and so on, to yield 24 = 16 models, each hypothesizing that the\ndeleted complexes are not detectable above noise. Secondly, of these\nmodels, the 4 single K models yield 4 additional models via K=0\nassumptions, each alleging that the free concentration of the reactant\nthat is not in excess is indistinguishable from zero. Thirdly, after\nexpanding K into products of strictly binary K, nine additional models\nthat allege that some Ks equal others also arise; these nine models\ncorrespond to hypotheses of independence between the R and t binding\nsites on R. Finally, for each model it can be hypothesized that the\ndata are not rich enough to discriminate p close to one from p = 1,\nand this expands the model space by an additional factor of two to\n58. Of these, model 3Rp differs substantially from the other five in\nits predictions over physiological values of [tT]=.1 to 50 µM and\n[RT]=.005 to 1 µM. If 3Rp is rejected by a measurement of 95 kDa at\n[tT] = 1 µM and [RT] = 0.2 µM, the best next 10 measurements for\ndiscrimination between the remaining 5 models are the following points\non the hill where their predictions differ most. dTTP = 1 uM 100 120\n140 160 180 Average Mass (kDa) [R1] 0.001 1 0.0 0.2 0.4 0.6 0.8 1.0\n[R1] uM 0.050 1.000 2 5 20 50 [dTTP] The methods, data, R functions\nand R scripts used to fit the model space can be found in\nRadivoyevitch T: Equilibrium model selection: dTTP induced R1\ndimerization. BMC Systems Biology 2008, 2 (1):15."} {"Title":"RReportGenerator: Automatic reports from routine statistical analysis using R","Author":"Wolfgang Raffelsberger and Luc Moulinier and David Kieffer and Olivier Poch","Session":"foc-report-1-1","Keywords":"reporting","Abstract":"With RReportGenerator we have developed a tool dedicated to performing\nautomatic routine statistical analysis using R via a graphical user\ninterface (GUI) in a highly user-friendly way that can be run on\nWindows and Linux platforms. The program is freely available under\nhttp://www-bio3d-igbmc.u-strasbg.fr/~wraff . Since the command-line\nsyntax of R is very powerful but difficult to access for\nnon-statisticians, we have developed a simple graphical interface\ndesigned for routine execution of predefined \"analysis scenarios\" for\na given problem (written as Sweave code). The key function of\nRReportGenerator consists in automatically generating a pdf-report\ncombining results from statistical analysis, tables and\nfigures. Depending on the analysis scenario chosen, reports can be\naccompanied by supplemental data-sets for exporting results to other\nprograms, too. At this point several applications (“analysis\nscenarios”) for quality control and lowlevel data analysis in the\nfields of transcription profiling (e.g. extensive QC for Affymetrix\nGeneChips or QC & data normalization of printed arrays), CGH-analysis\n(simultaneous comparison using multiple segmentation approaches) and\ntransfected cell array (TCA) platforms have been developed and are\ngetting further enhanced. For example, use of RReportGenerator may\nhelp technology platforms considerably as it produces automatically\nwell documented analysis reports in a standardized format for\ntransferring QC results and assay data to other research teams. A\nspecial function of this GUI allows accessing directly the most recent\nversions of the distributed analysis scenarios and facilitates using\nautomatically the most recent analysis scenarios. Finally, our\nautomatic analysis platform is open to distribute contributed and\ndocumented analysis scenarios from the R community."} {"Title":"Estimating evolutionary pathways and genetic progression scores with Rtreemix","Author":"Jörg Rahnenführer and Jasmina Bogojeska and Adrian Alexa and André Altmann and Thomas Lengauer","Session":"foc-bioinf_systems-1-2","Keywords":"bioinformatics-systems","Abstract":"In genetics, many evolutionary pathways can be modeled on the\nmolecular level by the ordered accumulation of permanent changes. We\nhave developed the class of mixture models of mutagenetic trees\n(Beerenwinkel et al., 2005a) that provides a suitable statistical\nframework for describing these processes. These models have been\nsuccessfully applied to describe disease progression in cancer and in\nHIV. In cancer, progression is modeled by the accumulation of lesions\nin tumor cells such as chromosomal losses or gains (Ketter et al.,\n2007). In HIV, the accumulation of drug resistance-associated\nmutations in the genome is known to be associated with disease\nprogression. Mutations in the genome of the dominant strain in the\ninfecting virus population arise when a patient receives a specific\nmedication. From such evolutionary models, genetic progression scores\ncan be derived that assign measures for the disease state to single\npatients (Rahnenführer et al., 2005). Progression of a single patient\nalong such a model is typically correlated with increasingly poor\nprognosis. In the cancer application, we showed that higher genetic\nprogression scores are significantly associated with shorter expected\nsurvival times in glioblastoma patients (Rahnenführer et al., 2005)\nand times until recurrence in meningioma patients (Ketter et al.,\n2007). We present applications in this framework as well as the\neasy-to-use and compute-efficient R package Rtreemix for estimating\nsuch mixtures of evolutionary models from cross-sectional\ndata. Rtreemix builds up on efficient C/C++ code provided in the\nMtreemix package (Beerenwinkel et al., 2005b) for estimating mixture\nmodels. It contains additional new functions for estimating genetic\nprogression scores with corresponding bootstrap confidence intervals\nfor estimated model parameters. Furthermore, the stability of the\nestimated evolutionary mixture models can be analyzed (Bogojeska et\nal., 2008). "} {"Title":"Estimation of Standard Errors in Non-Linear Regression Models: Spatial Variation in Risk Around Putative Sources","Author":"Rebeca Ramis and Peter Diggle and Gonzalo López-Abente","Session":"foc-spatial-1-2","Keywords":"spatial","Abstract":"Background We consider the problem of investigating spatial variation\nin the risk of non-infectious diseases in populations exposed to\npollution from one or more point sources. The data most commonly\navailable to study this question include case-counts (Oi) in each of a\nset of areas that partition the geographical region of interest,\nsuitable denominators, Ei, proportional to the expected number of\ncases in each area, and the locations of the relevant point sources,\nfrom which we can compute distances dij between the jth focus and a\nreference location, typically the centroid, within the Ith area. Also\navailable in most applications are covariates relating to\nsocio-economic status or other risk-factors associated with each area,\nwhich we denote by Zk. The standard approach to the analysis of data\nof this kind is a log-linear regression of the casecounts on the\ncovariates, with log-transformed denominators as an offset\nvariable. To model distance-related point source effects, a log-linear\nformulation is unrealistic because of the need to combine an elevated\nrisk close to the source with a neutral long-distance effect. We\ntherefore extend the model by including a non-linear distance\nfunction, f(dij), hence [1,2]: μ i = ρ ∑ (ϑk Z ik )∏ f (d ij ); k j Oi\n~ Po( Ei μ i ) f (d ij ) = 1 + α j exp − (d ij / β j ) [ 2 ] Parameter\nestimation and standard error calculations Generic functions available\nin R to fit non-lineal regression models include the “gnlm” library by\nJ. K. Lindsey [3], which in turn uses the nlm function of Bates and\nPinheiro to estimate the parameters. These functions use a numerical\nestimate of the Hessian matrix evaluated at the parameter estimate to\ncalculate standard errors. We have found that, for point source models\nlike the one described above, even when numerically accurate values\nare returned for the maximum likelihood parameter estimates, the\nassociated standard errors derived by inverting the estimated Hessian\ncan be unreliable. As an alternative strategy, we obtain standard\nerrors by combining an R function for direct maximisation of the\nlikelihood with replicated Monte Carlo simulations of the fitted\nmodel. Results We have carried out a simulation study to compare the\nestimators yielded by the two methodologies and to asses the\nperformance of the Hessian and Monte Carlo methods for calculating\napproximate standard errors. As expected, parameter estimates obtained\nfrom the two methods are almost identical. However, standard errors\nfor the non-linear parameters (αj, βj) are estimated more reliably by\nMonte Carlo than by inversion of the estimated Hessian. "} {"Title":"Functional regression analysis using R","Author":"Christian Ritz","Session":"kal-model-1-2","Keywords":"modeling","Abstract":"Functional data consist of observations which can be treated as\nfunctions rather than just numeric vectors. One example is fluorescence\ncurves commonly used in photosynthesis research: The curve reflects the\nbiological processes taking place in a plant in the first second of\nexposure to sunlight. As several (mostly unexplained) processes are\ninvolved the resulting curves can have several local minima and maxima\nand is not easily described using parametric models. Another example\nis repeated measurements over time on the same subject, frequently\nencountered in dietary and growth studies. The resulting curves may\nfluctuate as a consequence of daily patterns or seasonal trends. Any\nvariety of models for describing functional data exists. Functional\nregression models are particularly appealing as they strike a balance\nbetween flexible non-parametric modelling of the unknown average curve\nand semi-parametric modelling of effects due to explanatory variables\nin much the same way as for ordinary ANOVA models. This presentation\nshows how to use R for estimation, hypothesis testing and graphical\nmodel checking of functional regression models of the form: y(t) =\nφ(z)µ(t) + (t) with y(t), µ(t) and (t) denoting the functional\nobservation, the average curve and the error process,\nrespectively. The term φ(z) is a multiplicative effect modifying the\naverage curve according to the explanatory variable z. The functions µ\nand φ are estimated nonparametrically in a two-step procedure. A\nquasi-likelihood or glm approach can be used to estimate the effects of\nthe explanatory variables."} {"Title":"Item Response Theory Using the ltm Package","Author":"Dimitris Rizopoulos","Session":"foc-numerics-1-3","Keywords":"psychometrics","Abstract":"Item Response Theory has been steadily evolving in the past few\ndecades and is starting to become one of the standard tools in modern\npsychometrics. The R package ltm has been developed to fit various\nlatent trait models useful for Item Response Theory analyses. In\nparticular, for dichotomous data the Rasch, the Two-Parameter\nLogistic, and Birnbaum’s Three-Parameter models have been implemented,\nwhereas for polytomous data Semejima’s Graded Response model is\navailable. In this talk the capabilities of ltm will be illustrated\nusing real data examples."} {"Title":"Patient teenagers? A comparison of the sexual behavior of virginity pledgers and matched non-pledgers","Author":"Janet Rosenbaum","Session":"kal-bio-1-5","Keywords":"biostatistics","Abstract":"Objective: The US government spends over $200 million annually on\nabstinence-promotion programs, including virginity pledges, and\nmeasures abstinence program effectiveness as the proportion of\nparticipants who take a virginity pledge. Past research used\nnon-robust regression methods. This paper examines whether adolescents\nwho take virginity pledges are less sexually active than matched\nnon-pledgers. Patients and Methods: National Longitudinal Study of\nAdolescent Health respondents, nationally representative sample of\nmiddle and high school who, when surveyed in 1995 never had sex or\ntaken virginity pledge, and over age 15 (n=3440). Adolescents\nreporting virginity pledge on the 1996 survey (n=289) were matched\nwith non-pledgers (n=645) using exact and nearest-neighbor matching\nwithin propensity score calipers on factors including pre-pledge\nreligiosity and attitudes towards sex and birth control. Pledgers and\nmatched non-pledgers were compared five years post-pledge on\nself-reported sexual behaviors and positive test result for\nC. trachomatis, N. gonorrhoeae, and T. vaginalis; and safe sex outside\nof marriage by use of birth control and condoms in past year and at\nlast sex. Results: Five years post-pledge, 84% of pledgers denied\nhaving ever pledged. Pledgers and matched non-pledgers did not differ\nin premarital sex, STDs, anal, and oral sex. Pledgers had 0.1 fewer\npast year partners, but the same number of lifetime sexual partners\nand age of first sex. Pledgers were 10 percentage-points less likely\nthan matched non-pledgers to use condoms in the last year, and also\nless likely to use birth control in the past year and at last\nsex. Conclusions: Virginity pledgers and closely-matched non-pledgers\nhave virtually identical sexual behavior, but pledgers are less likely\nto protect themselves from pregnancy and disease before marriage than\nmatched non-pledgers. Abstinence programs may not affect sexual\nbehavior, but may increase unsafe sex. Federal abstinence education\nfunds should be directed to programs which teach birth control, and do\nso accurately. Virginity pledges should not be used as a measure of\nabstinence program effectiveness. "} {"Title":"distrMod - an S4-class based package for statistical models","Author":"Peter Ruckdeschel and Matthias Kohl","Session":"foc-mod_new-1-2","Keywords":"modeling-diagnostics","Abstract":"The S4 concept ([1]) is a strong tool for writing unified\nalgorithms. As an example for this in R ([3]), we present a new\npackage distrMod for a conceptual implementation of statistical models\nbased on these S4-classes. It is part of the distrXXX-family of\npackages ([4]), which is available on CRAN for quite a while, and\nwhich is developed under the infrastructure of R-Forge ([6]) in\nproject distr ([5]). The infrastructure to package distrMod is laid in\npackages distr and distrEx. In package distr, we introduce S4 classes\nfor distributions with slots for a parameter and for functions r, d,\np, and q corresponding to functions like rnorm, dnorm, pnorm and\nqnorm. We have made available quite general arithmetical operations to\nour distribution objects, generating new image distributions\nautomatically, including affine transformations, standard mathematical\nunivariate transformations like sin, abs, and convolution. Package\ndistrEx provides additional features like evaluation of certain\nfunctionals on distributions like expectation, variance, median, and\nalso distances between distributions like total variation-,\nHellinger-, Kolmogorov-, and Cram´r-von-Mises-distance. Also,\n(factorized) conditional e distributions and expectations are\nimplemented. Package distrMod then implements parametric resp. L2\ndifferentiable models, introducing S4classes ParamFamily and\nL2ParamFamily. Based on these, quite general “Minimum\nCriterium”estimators such as Maximum-Likelihood- and\nMinimum-Distance-Estimators are implemented. This implementation goes\nbeyond the scope of fitdistr from MASS ([7]), as we may work with\ndistribution objects themselves and have available quite general\nexpectation operators. . . In short, we are able to implement one\nstatic algorithm which by tt S4 method dispatch may take care\ndynamically about various models, thus avoiding redundancy and\nsimplifying maintenance. This approach is also taken up to implement\noptimally robust estimation in the infinitesimal setup of ([4]) and its\nrefinements in ([2]); this will be the topic of a contribution to this\nconference by the second author."} {"Title":"Analysis of CGH arrays using MCMC with Reversible Jump: detecting gains and losses of DNA and common regions of alteration among subjects","Author":"Oscar Rueda and Ramon Diaz-Uriarte","Session":"kal-bio-1-2","Keywords":"bioinformatics-models","Abstract":"Copy number variation (CNV) in genomic DNA is linked to a variety of\nhuman diseases (including cancer, HIV acquisition and progression,\nautoimmune diseases, and neurodegenerative diseases), and array-based\nCGH (aCGH) is currently the main technology to locate CNVs. To be\nimmediately useful in both clinical and basic research scenarios, aCGH\ndata analysis requires accurate methods that do not impose unrealistic\nbiological assumptions and that provide direct answers to the key\nquestion “What is the probability that this gene/region has\nCNAs?”. Current approaches fail, however, to meet these\nrequirements. We have developed RJaCGH, a method for identifying CNAs\nfrom aCGH. We use a nonhomogeneous Hidden Markov Model fitted via\nReversible Jump Markov Chain Monte Carlo, and we incorporate model\nuncertainty through Bayesian Model Averaging. RJaCGH provides an\nestimate of the probability that a gene/region has CNAs while\nincorporating inter-probe distance. Using Reversible Jump we do not\nneed to fix in advance the number of hidden states, nor do we need to\nuse AIC or BIC for model selection. We presented a first version of our\nmodel at UseR two years ago. Since then, we have explored different\napproaches to improve convergence and speed-up computations, including\nusage of Gibbs sampling vs. Metropolis-Hastings, delayed rejection,\nand coupled parallel chains. Based on the output from RJaCGH, we have\nalso developed two probabilistically-based methods for the\nindentification of regions of alteration that are common among\nsamples. Our methods are unique and qualitatively different from\nexisting approaches, not only because of the use of probabilities, but\nalso because they incorporate both within- and among-array variability\nand can detect small subgroups of samples with respect to common\nalterations. The two methods emphasize different features of the\nrecurrence (sample heterogeneity, minimal required evidence for\ncalling a common region) and, thus, will be instrumental in the\ncurrent efforts to standardize definitions of recurrent or common CNV\nregions, cluster samples with respect to patterns of CNV, and\nultimately in the search for genomic regions harboring\ndisease-critical genes. We will discuss the statistical features of\nour models, as well as the implementation of RJaCGH, including the\ncombined usage of R and C, and different approaches for improving speed\nand decrease memory consumption."} {"Title":"CXXR: Refactoring the R Interpreter into C++","Author":"Andrew Runnalls","Session":"foc-conn-2-1","Keywords":"connectivity","Abstract":"CXXR (www.cs.kent.ac.uk/projects/cxxr) is a project to refactor\n(reengineer) the interpreter of the R language, currently written for\nthe most part in C, into C++, whilst as far as possible retaining full\nfunctionality. It is hoped that by reorganising the code along\nobject-oriented lines, by deploying the tighter code encapsulation\nthat is possible in C++, and by improving the internal documentation,\nthe project will make it easier for researchers to develop\nexperimental versions of the R interpreter. The author’s own\nmedium-term objective is to create a variant of R with built-in\nfacilities for provenance tracking, so that for any R data object it\nwill be possible to determine exactly which original data files it was\nderived from, and exactly which sequence of operations was used to\nproduce it. (In other words, an enhanced version of the old S AUDIT\nfacility.) At the time of this abstract: • Memory allocation and\ngarbage collection have now been decoupled from each other and from\nR-specific functionality, and encapsulated within C++ classes. Classes\nCellPool, MemoryBank and Allocator look after memory allocation;\nGCManager, GCNode, GCRoot and WeakRef look after garbage\ncollection. (All CXXR classes are within the namespace CXXR.) Class\nGCRoot provides C++ programmers with a mechanism for protecting\nobjects from the garbage collector, as a more user-friendly (and\nprobably less error-prone) alternative to the PROTECT/UNPROTECT\nmechanism used in standard R • The SEXPREC union of CR is being\nprogressively converted into an extensible hierarchy of classes rooted\nat a class RObject (which inherits from GCNode). This has already\nhappened for vector objects and CONS-cell type objects, and it is now\nstraightforward to introduced new types of R object simply by\ninheriting from RObject. The proposed paper will: 1. Describe the\nmotivation behind CXXR; 2. Report on progress to date; 3. Illustrate\nsome of the simplified coding practices that CXXR enables; 4. Describe\nthe measures taken to keep CXXR in synch with successive releases of\nstandard R; 5. Outline future plans. The paper will assume some\nfamiliarity with C programming and with concepts of object-oriented\nprogramming (e.g. in R or in Java), but C++-specific concepts will be\nexplained as required.\n"} {"Title":"'robande': An R package for Robust ANOVA","Author":"Majid Sarmad and Peter S. Craig","Session":"foc-rob-1-3","Keywords":"robust","Abstract":"The R package ‘robande’ will be described and demonstrated. The\npackage is based on the methodology in Sarmad (2006)1 which\ngeneralises the ideas introduced by Seheult and Tukey (2001)2 for\nperforming a robust analysis of variance for a factorial experimental\ndesign, those ideas being based on earlier work by Tukey and\ncollaborators on median polish for two way tables. The method may be\napplied to any type of factorial design including full and fractional\nfactorial designs with and without replication. A version of\nsequential ANOVA is proposed for non-orthogonal designs. The package\nincludes functions to decompose the data using a specified sweep\nfunction, to present the resulting decomposition, to detect and\nhighlight possible outliers and to compute the robust ANOVA table."} {"Title":"Spatial Analysis and Visualization of Climate Data Using R","Author":"David Sathiaraj","Session":"foc-environ-1-1","Keywords":"environmetrics-climate","Abstract":"R’s spatial libraries and its efficient data handling abilities make it\na very effective tool for spatial analysis and visualization of climate\ndata. At the NOAA Southern Regional Climate Center, R is being used\nfor the spatial analysis and visualization of climate data. This talk\nwill demonstrate how climate maps are generated using the spatial and\ninterpolation packages in R. The talk will also outline how R serves\nas an effective GIS tool in the development of climate data driven map\nlayers. R-generated maps that visualize a number of climatological\nelements will be demonstrated. Techniques used in developing the maps\nwill be discussed."} {"Title":"RLRsim: Testing for Random Effects or Nonparametric Regression Functions in Additive Mixed Models","Author":"Fabian Scheipl and Sonja Greven and Helmut Küchenhoff","Session":"foc-mod_mixed-1-1","Keywords":"modeling-mixed","Abstract":"Testing for a zero random effects variance is an important and common\ntesting problem. Special cases include testing for a random intercept,\nand testing for polynomial regression versus a general smooth\nalternative based on penalized splines. The problem is non-regular,\nhowever, due to the tested parameter on the boundary of the parameter\nspace. Our package RLRsim uses the approximate null distribution for\nthe Restricted Likelihood Ratio Test proposed in Greven et al. (2008)\nto provide a rapid, powerful and reliable test for this problem. This\nmethod extends the exact distribution derived for models with one\nrandom effect (Crainiceanu & Ruppert, 2004) to obtain a good\napproximation for models with several random effects. The test\nperformed better than a number of competitors in an extensive\nsimulation study covering a variety of typical settings (Scheipl et\nal. , 2008). RLRsim also proved to be an equivalent and fast\nalternative to computationally intensive parametric bootstrap\nprocedures. Our package can be used in a variety of settings,\nproviding convenient wrapper functions to test terms in models fitted\nusing nlme::lme, lme4::lmer, mgcv::gamm or SemiPar::spm."} {"Title":"robfilter: An R Package for Robust Time Series Filters","Author":"Karen Schettlinger and Roland Fried and Ursula Gather","Session":"kal-mach_num_chem-1-2","Keywords":"robust, time series","Abstract":"robfilter is a package of R functions for robust extraction of an\nunderlying signal from a time series. Assuming a standard signal plus\nnoise model for the series, the general idea is to approximate the\nsignal in a moving time window by a local parametric model like a\nlocally constant level, i.e. location-based methods (Fried, Bernholt,\nGather, 2006), or a local linear trend, so-called regression-based\nmethods (Davies, Fried, Gather, 2004; Fried, Einbeck, Gather,\n2007). We present several filters which differ with respect to the\nsignal characteristics and the outlier patterns they can deal with\n(Schettlinger, Fried, Gather, 2006). In particular, some filters are\nespecially designed to preserve sudden shifts and local extremes\n(turning points) even if patches of subsequent outliers may occur\n(Fried, 2004). Furthermore, most of the filters are available both for\nretrospective filtering as well as online filtering without time\ndelay. Estimation of the signal in the centre of a time window\ngenerally leads to better signal approximations. This approach is\nreasonable for retrospective data analysis, since the estimation\nalways takes place with a time delay of half a window width. The\nproposed online filters estimate the signal value at the end of each\ntime window without time delay, but the resulting signal estimates\nhave a larger variability than their retrospective counterparts. We\npresent filters which are applicable to time series containing\noutliers, trends, trend changes or shifts in the signal level and give\nrecommendations which filter is suitable for which data\nstructure. "} {"Title":"Local Classification Methods for Heterogeneous Classes","Author":"Julia Schiffner and Claus Weihs","Session":"foc-machlearn-1-1","Keywords":"machine learning","Abstract":"Many classification methods, for example LDA, QDA or Fisher\ndiscriminant analysis (FDA), assume the classes to form homogeneous\ngroups. But in practical applications heterogeneous classes that\nconsist of multiple subclasses can often be observed. In such cases\nlocal classification methods that take the local class structure,\ni. e. the subclasses, into account can be beneficial. In package klaR\n(Weihs et al., 2005) the function loclda that performs localized\nlinear discriminant analysis (Czogiel et al., 2007) is already\navailable. Now, three more local classification methods are added. The\nfirst two methods, the common components classifier and the hierarchical\nmixture classifier (Titsias and Likas, 2002), rely on modeling the\nclass conditional densities by means of gaussian mixtures. The third\nmethod, local Fisher discriminant analysis (LFDA), was proposed by\nSugiyama (2007). FDA seeks for a projection of the data into a\nsubspace such that the between-class scatter is maximized and the\nwithin-class scatter is minimized. In LFDA the projection additionally\nhas to fulfill the condition that nearby data points in the same class\nare kept close to each other and thus the local class structure is\npreserved. The three local methods and their implementations in R are\npresented and their usefulness is demonstrated in several examples."} {"Title":"Parallelized preprocessing algorithms for high-density oligonucleotide array data","Author":"Markus Schmidberger and Ulrich Mansmann","Session":"foc-highperf-2-4","Keywords":"high performance computing-parallel","Abstract":"Studies of gene expression using high-density oligonucleotide\nmicroarrays have become standard in a variety of biological\ncontexts. The data recorded using the microarray technique are\ncharacterized by high levels of noise and bias. These failures have to\nbe removed, therefore preprocessing of raw-data has been a research\ntopic of high priority over the past few years. Actual research and\ncomputations are limited by the available computer hardware. For many\nresearchers the available main memory limits the number of arrays that\nmay be processed. Furthermore most of the existing preprocessing\nmethods are very time consuming and therefore not useful for first and\nfast checks in laboratories. To solve these problems, the potential of\nparallel computing should be used. In microarray technologies and\nstatistical computing parallel computing does not appear to have been\nused extensively. For parallelization on multicomputers, message\npassing (MPI) methods and the R language will be used. Ideas for\nparallelization of VSN and FARMS as well as a large project in applied\nbioinformatics (> 5000 microarrays) will be discussed. Furthermore\nthis presentation proposes the new BioConductor package affyPara for\nparallelized preprocessing of high-density oligonucleotide microarray\ndata. Partition of data could be done on arrays and therefore\nparallelization of algorithms gets intuitive possible. In view of\nmachine accuracy, the same results as serialized methods will be\nachieved. The partition of data and distribution to several nodes\nsolves the main memory problems and accelerates the methods by up to\nthe factor ten."} {"Title":"MORET - A Software For Model Management","Author":"Ralf Seger and Antony Unwin","Session":"foc-mod_man-1-1","Keywords":"modeling-diagnostics, user interfaces","Abstract":"Administrating sets of models is a cumbersome task, since the number\nof models which can be fit can be very large. The project MORET[1] is\ndesigned to facilitate this task. The first version introduced at\nuseR!2006 was capable of handling lm, glm, gam and rpart models. There\nare many other model commands and new ones are continually being\ndeveloped, so a more dynamic concept is needed. MORET now provides the\nuser with a graphical interface that allows the management of\npreviously unknown models. The purpose of this presentation is to\ndemonstrate how to incorporate new model commands and to describe\nother additions to MORET."} {"Title":"MSToolkit: Distributed R for the creation and analysis of simulated clinical trial data","Author":"Mike Smith and Richard Pugh and Romain Francois","Session":"foc-highperf-3-2","Keywords":"high performance computing","Abstract":"A key tool in the area of pharmaceutical research is\nsimulation-based modeling. This requires the generation and analysis\nof clinical trial datasets, which can be both complex and\ncomputationally intensive. Pfizer and Mango Solutions collaborated on\nthe development of an R package which allows the generation and\nanalysis of simulated clinical trial data. In order to leverage\nexisting IT infrastructure and programming skillsets, the package was\nintegrated closely with the internal Pfizer Linux Grid cluster. The\npackage was designed to allow SAS to be called as well as R in order\nto perform the required analyses. This presentation will discuss the\ndesign and implementation of the MSToolkit package, before giving a\ndemonstration of its use."} {"Title":"TIMPGUI: A graphical user interface for the package TIMP","Author":"Joris J. Snellenburg and Katherine M. Mullen and Ivo H. M. van Stokkum","Session":"foc-chemo-1-3","Keywords":"chemometrics","Abstract":"The package TIMP is in use by biophysicists who seek to discover\nmodels for (photo)-physical processes in complex systems. The\nmeasurements under consideration most often represent some\nspectroscopic property resolved with respect to time, and the goal is\ntypically to discover a nonlinear model for the kinetics. This problem\nis approached by postulating an initial model, in which the spectra\nassociated with the system are obtained as conditionally linear\nparameters, then optimizing the nonlinear parameters and finally\nvalidating the resulting model for physical interpretability. We have\nbeen motivated to use Java to develop an interface to TIMP for several\nreasons. One reason is that many of the scientists using TIMP prefer a\ngraphical user interface (GUI) to a command line interface. Another\nreason is that Java, and the JFreeChart plotting library we are using,\nalong with the JRI library (part of the rJava package), allows for\nmore possibilities for interacting with plots than is currently\npossible in R alone. This facilitates interactive data exploration,\nwhich can greatly improve the rate at which models can be formulated\nand tested. A third reason to use Java is that it allows the GUI to be\nprogrammed with a GUI builder (we use the Netbeans Integrated\nDevelopment Environment (IDE)) as opposed to manually specifying the\nparameters of widgets in R code. We feel this allows for a flexible\nmodular design which is easily extended by other developers. Finally,\nwe require a fully crossplatform interface, for which Java is\nwell-suited. Here we showcase the current capabilities of the\ninterface and demonstrate its usability by demonstrating several case\nstudies, fitting kinetic models to time-resolved fluorescence and\nabsorption data."} {"Title":"SQLiteMap: package to manage vector graphical maps using SQLite","Author":"Norbert Solymosi and Andrea Harnos and Jenõ Reiczigel","Session":"foc-conn-2-2","Keywords":"connectivity","Abstract":"Some server based database management systems implemented the OpenGIS\n”Simple Features Specification for SQL”.1 The OpenGIS specification\ndefines two standard ways of expressing spatial features: the\nWell-Known Text (WKT) form and the Well-Known Binary (WKB) form. Both\nWKT and WKB include information about the type of the feature and the\ncoordinates which form the feature.2 These systems\n(e.g. PostgreSQLPostGIS, MySQL, ORACLE, MSSQL) allow to store the\ntopological features and the descriptive data in the same\ndatabase. This makes it possible to connect the spatial and\ndescriptive tables without any interface and to access the spatial\ndata by a large number of users in a secure way. But these systems\nassume that the user needs permission to a running service or to\ninstall a server to use the spatial data. In some cases, it is useful\nif the user can use the database stored maps on different computers and\nplatforms. The SQLite is a good choice for a portable database, it is\nplatform-idendependent and there are some R packages to manage SQLite\ndatabases. Unfortunately, it has no spatial extension, but there is an\nSQLite extension for the SharpMap library.3 Following the idea of this\nsolution we developed a package that may help the user read and write\nspatial features from and to an SQLite database. Each table with\ngeometry field is treated as a layer. The tables contain the\ntopological features (polygon, linestring, point etc.) in one geometry\nfield in WKT form. http://www.opengeospatial.org/standards\nhttp://postgis.refractions.net/ 3 http://www.codeplex.com/SharpMap"} {"Title":"Approximate Conditional-mean Type Filtering for State-space Models","Author":"Bernhard Spangl and Peter Ruckdeschel and Rudolf Dutter","Session":"foc-ts-1-1","Keywords":"time series","Abstract":"We consider in the following the problem of recursive filtering in\nlinear state-space models. The classically optimal Kalman filter\n(Kalman, 1960; Kalman and Bucy, 1961) is well known to be prone to\noutliers, so robustness is an issue. For an implementation in R (R\nDevelopment Core Team, 2005), the first two authors have been working\non an R package robKalman (Ruckdeschel and Spangl, 2007), where a\ngeneral infrastructure is provided for robust recursive filters. In\nthis framework the rLS (Ruckdeschel, 2001) and the ACM (Martin, 1979)\nfilter have already been implemented, the latter as an equivalent\nrealization of the filter implemented in Splus. While this ACM filter is\nbound to the univariate setting, based on Masreliez’s result\n(Masreliez, 1975) the first and the third author propose a generalized\nACM type filter for multivariate observations (Spangl and Dutter,\n2008). This new filter is implemented in R within the robKalman package\nand has been compared to the rLS filter by extensive simulations."} {"Title":"Regression Model Development and Yet Another Regression Function","Author":"Werner Stahel","Session":"foc-mod_man-1-2","Keywords":"modeling-diagnostics","Abstract":"A strategy to develop a regression model involves many steps and\ndecisions which are based on pertinent numeric tables and thorough\nanalysis of residual plots. With the standard regression functions\navailable in R, such an assessment consists of several function calls\nand informed settings of their arguments, depending also on the type\nof target variable (continuous, count, binary, multinomial,\nmultivariate, ...). The examination of a logistic regression fit, e.g.,\ninvolves calling glm, summary, drop1, influence, plot, and termplot,\nand selecting the useful information from what is obtained from\nthem. This contribution presents a user oriented function that sets\nthe sensible choices for the different models. It produces an object\nwhich gives the useful information for judging the model fit when\nprinted and plotted. More specifically, the function accepts the same\narguments as lm or glm, and some more. It also accepts ordered,\nmultinomial, and multivariate responses. Of course, calculations are\ndone by calling the available fitting functions. The function stores\nresults that are produced by the fitting function and by calling\nsummary on the object, as well as some additional ones, like the\nleverage values. If printed, it gives a table that contains, for\ncontinuous or binary explanatory variables, the coefficients, their\nP-values, 2 the collinearity measure Rj and a new measure of\nsignificance that additionally characterizes the confidence\ninterval. For factors, the P-value is given, since individual\ncoefficients and their P-values are of limited information content. The\ncoefficients of factor levels are reproduced separately. – The last part\nof the print output is very similar to the usual summary part of\nprinting the summary, but includes, in the case of a glm, an\noverdispersion test if applicable. The strength of the new function\nlies in its plotting method. All residual plots use a plotting scale\nthat is not affected by outliers in the residuals, but outliers are\nstill shown in a marginal region of the plot. Most plots are\ncomplemented by a smooth by default. In order to judge the significance\nof any curvature shown by this line, 19 such lines are simulated from\nrandom data corresponding to the model. Reference lines indicate\ncontours of equal response values and help to identify suitable\ntransformations of the explanatory variables. In summary, the function\nregr and its printing and plotting methods have made my life much\neasier when developing regression models and have lead to higher\nquality of analyses obtained by students."} {"Title":"A pipeline based on multivariate correspondence analysis with supplementary variables for cancer genomics","Author":"Christine Steinhoff and Matteo Pardo and Martin Vingron","Session":"foc-multi-1-3","Keywords":"multivariate","Abstract":"The development of several high throughput gene profiling methods,\nsuch as comparative genomic hybridization (CGH) and gene expression\nmicroarrays enables for studying specific disease patterns in\nparallel. The underlying assumption for studying both genomic\naberrations and gene expression is that genomic aberration might\neffect gene expression either directly or indirectly. In cancer\nresearch, in particular, there have been a number of attempts to\nimprove cancer subtype classification or study the relationship\nbetween chromosomal region and expression aberrations. The intuitive\nway to analyze different data sources is separately and consecutively,\ne.g. first determine regions with copy number aberrations (possibly\ntissue or patients -specific) and then look for differentially\nexpressed (onco)genes inside these regions 1. There is a natural\nreason for integrating results rather than data: strong heterogeneity\ndoes not allow sensible alignments of the source data. Still,\nintegrative approaches –where data are fused before their analysis-\nare preferable. Only recently, few integrative methods have been\npublished 2. Nevertheless, these approaches do not integrate covariate\ndata like tumor grading, mutation status and other disease\nfeatures. These features are frequently available and of interest for\nan integrative analysis. We address these two problems, namely jointly\nanalyzing different data sources and integrating supplementary\ncategorical data. Furthermore, our approach can easily be applied to\ndiverse data sources, even more than two, with and without\nsupplementary patients’ information. We established a new data\nanalysis pipeline for the joint visualization of microarray expression\nand arrayCGH data (aCGH), and the corresponding categorical patients’\ninformation. All computational analysis steps are programmed using R\nand Bioconductor. The pipeline comprises four parts: (a) data\ndiscretization, (b) binary mapping, (c) gene filtering, (d) multiple\ncorrespondence analysis. The first two steps transform data to a\ncommon binary format, a necessary step for jointly analyzing\nthem. Filtering removes noise and redundancy by reducing the number of\nfeatures (genes). We considered variance filtering, expression-aCGH\ncorrelation filtering and PCA loading on the first two principal\ncomponents. In the last pipeline step, we apply a method based on\ncorrespondence analysis, namely multivariate correspondence analysis\nwith supplementary variables (MCASV) 3. MCASV has been applied in the\ncontext of social sciences but to our knowledge has not been used in\nthe context of biological high throughput data analysis. Features\n(expression and aCGH) and covariates (patients’ information) are\ntransformed into a common space. Vicinity between features and\ncovariates can then be visualized and quantified. We e.g. determine\ngenes that are correlated with covariates, possibly for interesting\nsubsets of patients. In MCASV vicinity is measured by the angle\nintercurring between covariate and feature. We applied our approach to\na published dataset on breast cancer. Pollack et al. 4 studied genomic\nDNA copy number alterations and mRNA levels in primary human breast\ntumors. We were able to retrieve candidate genes that show strong\nassociation with grade 3 tumors and p53 mutant status. Candidate genes\ndisplay significant enrichment of cancer related GO terms. Moreover\nthere are interesting differences between genes selected starting from\naCGH and expression data alone and genes selected by integrating the\ndatasets. "} {"Title":"Why and how to use random forest variable importance measures (and how you shouldn't)","Author":"Carolin Strobl and Achim Zeileis","Session":"kal-mach_num_chem-1-4","Keywords":"machine learning","Abstract":"Random forests are becoming increasingly popular in many scientific\nfields, especially in genetics and bioinformatics, for assessing the\nimportance of predictor variables in high dimensional\nsettings. Advantages of random forests in these areas are that they\ncan cope with “small n large p” problems, complex interactions and\neven highly correlated predictor variables. The talk gives a short\nintroduction to the rationale of random forests and the their variable\nimportance measures as well as the two random forest implementations\noffered in the R system for statistical computing: randomForest in the\npackage of the same name by Breiman et al. (2006) and cforest in the\npackage party by Hothorn et al. (2008). Moreover, recent research\nissues are addressed: • Solutions are presented for bias in random\nforest variable importance measures towards, e.g., predictor variables\nwith many categories (Strobl, Boulesteix, Zeileis, and Hothorn 2007)\nand correlated predictor variables (Archer and Kimes 2008). •\nCurrently suggested tests for random forest variable importance\nmeasures (Breiman and Cutler 2008; Rodenburg et al. 2008) are\ncritically discussed in an outlook."} {"Title":"R AnalyticFlow: A flowchart-style GUI for R","Author":"Ryota Suzuki","Session":"kal-gui_teach-1-1","Keywords":"user interfaces-workflow","Abstract":"R AnalyticFlow is a new flowchart-style GUI for R. A user draw an\n”analysis flow”, which contains R expression nodes connected by\ndirected edges. By ”running” a series of nodes, the corresponding R\nexpressions are executed by R engine. There are two main advantages:\n(1) flowchart-style visualization helps us to overview the processes of\ndata analysis, and (2) ”branching” the processes enables flexible\nanalysis strategies. R AnalyticFlow is written in Java, with the help\nof several opensource Java libraries. Our source code is also\nopen-sourced and available under the BSD license. It depends on Java\n(≥ 5), R (≥ 2.5.0), rJava and JavaGD. It currently runs on Windows,\nLinux and Mac OS X. Ef-prime, Inc. URL: http://www.ef-prime.com/"} {"Title":"Some Aspects on Classification, Variable Selection and Categorical Clustering","Author":"Gero Szepannek and Uwe Ligges and Claus Weihs","Session":"foc-machlearn-1-2","Keywords":"machine learning","Abstract":"The package klaR contains several utilities to handle\nclassification problems, e.g. Friedman’s RDA, an interface to svmlight\n(Joachims, 1999) as well as variable selection procedures like the\nstepclass algorithm or Wilk’s Λ, a visualization tool for SOMs or\nseveral classification performance measures (see Weihs et al.,\n2006). This poster presents recent extensions towards classification on\nminimal variable subspaces for multi class problems by performing\nclass pair wise variable selection (see Szepannek and Weihs,\n2006). Examples of situations are presented where this approach may be\nhighly beneficial in terms of misclassification rates. Furthermore, the\nk-modes algorithm (Huang, 1998) is implemented allowing to perform a\nk-means like clustering for categorical data."} {"Title":"rsm: An R package for Response Surface Methodology","Author":"Ewa Sztendur and Neil Diamond","Session":"foc-mod_new-1-1","Keywords":"modeling","Abstract":"Introduction rsm is an R package for Response Surface Methodology. For\n1st order response surfaces rsm provides • Calculation of the Path of\nSteepest Ascent • Precision of the Path For 2nd order response\nsurfaces rsm provides • Ridge Analysis • Maximum or Minimum plots •\nCanonical Analysis • Precision of canonical analysis based on Double\nLinear Regression Implementation rsm provides first order objects which\ninclude print, summary and plot methods: fit1 <− firstorder(X,y)\ncreates a firstorder fit object. print(fit1) adds the path of steepest\nascent: (Intercept) X1 X2 X3 X4 X5 57.175 -3.350 -2.162 0.275 4.638\n-4.725 The path of steepest ascent is: X1 X2 X3 X4 X5 = -0.433r =\n-0.280r = 0.036r = 0.598r = -0.611r Ridge Analysis Ridge analysis is\nequivalent to the path of steepest ascent applied to second order\nresponse surfaces and was developed by A.W. Hoerl (1959) and\nR.W. Hoerl(1985). In Ridge analysis, stationary points of the response\nsurface subject to xT x = r2 are found, resulting in 1 xS = − (B −\nµI)−1b 2 for various values of µ. For maximisation of the response,\nonly values of µ greater than the largest eigenvalue of B are used;\nwhile for minimisation of the response, only values of µ less then the\nsmallest eigenvalue are used. A ridgeplot gives the dependence of the\nradius of the stationary values of the response against the value of\nthe Lagrangian multipler, µ. Implementation rsm provides second order\nobjects which include print, summary and plot methods: fit2 <−\nsecondorder(X,y) creates a secondorder fit object. print(fit2) adds the\nA and B Canonical Forms: (Intercept) x1 x2 x3 x1sq x2sq x3sq 59.140\n2.006 1.004 0.670 -1.999 -0.731 -0.998 x1x2 x1x2 x2x3 -2.801 -2.179\n-1.154 A Canonical Form:\ny=59.140+0.273X1-0.341X2+2.300X3+0.188X1sq-0.411X2sq-3.505X3sq X1 =\n0.585x1 -0.797x2 -0.149x3 X2 = -0.280x1 -0.371x2 +0.885x3 X3 = 0.761x1\n+0.476x2 +0.440x3 Location of Stationary Point: (-0.058, 0.888,\n-0.114) Distance of Stationary Point from Origin: 0.898 B Canonical\nForm: y=59.490+0.188XX1sq-0.411XX2sq-3.505XX3sq XX1 = 0.585(x1+0.058)\n-0.797(x2-0.888) -0.149(x3+0.114) XX2 = -0.280(x1+0.058)\n-0.371(x2-0.888) +0.885(x3+0.114) XX3 = 0.761(x1+0.058)\n+0.476(x2-0.888) +0.440(x3+0.114) The estimated response on the path\nis ycap = 57.175 + 7.734r 1st Order Response Surfaces The model is yi\n= β0 + β1x1 + β2x2 + . . . + βk xk + εi summary(fit1) gives, in\naddition, the percentage of directions excluded from the 95% confidence\ncone. The 95% confidence cone for the path of steepest ascent\nexcludes 99.03% of possible directions. Canonical Analyis Canonical\nanalysis of the 2nd degree response surface allows the investigation\nof the underlying nature of the response surface and whether it is a\nmaximum, minimum, saddle, rising ridge, or stationary ridge. Path of\nSteepest Ascent The path of steepest ascent is given by: x1 = rb1 k\nplot(fit1) gives the co-ordinates of the path of steepest ascent and\nthe predicted response on the path: 4, x2 = b2 i 80 rb2 k, . . . , xk\n= b2 i rbk k i=i Maximum Co−ordinates Maximum Response i=i i=i 70 b2 i\nX1 X2 X3 X4 X5 Max y A Canonical Form In the A Canonical Form, the\naxes are rotated so that the cross-product terms are removed,\nresulting in the model: y = b0 + XTθ + XTΛX ˆ where Λ = diag(λ1,\n. . . , λk ). summary(fit2) adds the standard errors of the λs based\non the double linear regression method: (Intercept) XX1sq XX2sq XX3sq\nEstimate Std. Error t value Pr(>|t|) 5.949e+01 8.307e-01 71.610\n1.02e-13 1.880e-01 1.188e-01 1.583 0.148 -4.114e-01 3.775e-01 -1.090\n0.304 -3.505e+00 4.473e-01 -7.835 2.61e-05 2 where b1, . . . , bk are\nthe estimates of β1, . . . , βk and r is the Radius, the distance to\nthe centre of the design region. The estimated response on the path is\ngiven by k 0 60 plot(fit2) gives the ridge plot and the maximum or\nminimum plot. −2 50 B Canonical Form Radius y = b0 + r ˆ i=i b2 i −4 0\n1 2 Radius 3 4 5 In the B Canonical Form, both cross-product and\nlinear terms are removed by shifting the origin and rotating the axes,\nresulting in the model: ˜ ˜ y = yS + XT ΛX ˆ ˆ The values of the λs\nshow the nature of the surface. If all the λs are negative, the\nsurface is a maximum; if all the λs are positive, the surface is a\nminimum; if the λs are of mixed sign, the surface is a saddle; while\nif some of the λs are zero, the surface is a stationary ridge. The\nlatter is particularly important, as it indicates a linear or planar\nmaximum or minimum, rather than a point maximum or minimum. Precision\nof the Path of Steepest Ascent Box (1955) and Box and Draper (1987,\npp. 190-194) gave a method for computing a confidence cone for the\ndirection of steepest ascent. The proportion of directions included in\nthe confidence cone gives a measure of the precision of the path of\nsteepest ascent, and is measured by taking the ratio of the surface\narea of the cap of the sphere within the confidence cone to the surface\narea of the sphere. See also Sztendur and Diamond (2002). Path of\nSteepest Ascent 0 −10 2 4 −5 µ 0 5 Ridge Co−ordinates −2.0 where x= \nx1  x2   .  . xk −3.0 1 2 Radius 3 4  b=  x1 x2 x3 b1  b2  \n.  . bk  B=   . . b11 1 b12 2 1 2 b12 b22 . . 1 2 b2k 1 2 b1k ···\n···  ... .  . · · · bkk 1 2 b1k 1 2 b2k  Double Linear Regression\nMethod In practice, because of experimental error and mild lack of fit,\nλs exactly equal to 0 will not occur. However, small λs indicate that\nthe surface can be approximated by a ridge system. The standard errors\nof the λs are determined using the double linear regression method,\ndue to Bisgaard and Ankenman (1996). Cap of Sphere where b is the (k\n× 1) vector of the first-order regression coefficients and B is the (k ×\nk) symmetric matrix whose diagonal elements are the pure quadratic\ncoefficients and whose off-diagonal elements are one-half the mixed\nquadratic coefficients. 0 10 y = b0 + xTb + xTBx ˆ X1 X2 X3 Min y 30\nThe equation is Minimum Response −1.0 Second Degree Response Surfaces\n0.0 50"} {"Title":"Collaborative Development Using R-Forge","Author":"Stefan Theussl and Achim Zeileis and Kurt Hornik","Session":"kal-visual-1-1","Keywords":"community services","Abstract":"A key factor in open source software development is the rapid creation\nof solutions within an open, collaborative environment. The open\nsource model had its major breakthrough with the increasing usage of\nthe internet. Online communities successfully combined not only their\nprogramming effort but also their knowledge, work and even their social\nlife. The consequence was an increasing demand for centralized\nresources e.g., to manage projects or source code. The most famous of\nsuch platforms—the world’s largest open source software development\nweb site—is SourceForge.net. For a decade, the R Development Core Team\nas well as many R package developeRs have been using development tools\nlike Subversion (SVN) or Concurrent Versions System (CVS) for managing\ntheir source code. A central repository is hosted by ETH Zürich u\nmainly for managing the development of the base R system. Now, the\nR-project wants to provide infrastructure for the entire R\ncommunity. R-Forge (http://R-Forge.R-project.org) is a set of tools\nbased on the open source software GForge—a fork of the open source\nversion of SourceForge.net. It aims to provide a platform for\ncollaborative development of R packages, R related software or other\nprojects which are somehow related to R. It offers source code\nmanagement facilities through SVN and a wide variety of web-based\nservices. Furthermore, packages hosted on R-Forge are built daily for\nvarious operating systems, i.e., Linux, MacOSX and Windows. These\npackage builds are downloadable from the project’s website on R-Forge\nas well as installable directly in R via install.packages(). In our\ntalk we show how package developeRs can get started with R-Forge. In\nparticular we show how people can register a project, use R-Forge’s\nsource code management facilities, provide their packages with\nR-Forge, host a project specific website, and finally submit a package\nto CRAN."} {"Title":"Multivariate Data Analysis in Microbial Ecology - New Skin for the old Ceremony","Author":"Jean Thioulouse","Session":"invited","Keywords":"invited","Abstract":"The molecular biology revolution has a particularly strong impact in\nmicrobial ecology, as molecular methods are now giving access to data\nthat were previously impossible to obtain. Soil microbial ecology\nstudies are a good example of this situation. Knowledge of soil\nbacterial diversity is of great interest, both from an applied\nagronomical perspective, and in the framework of theoretical\necological models like the species-area relationship. Previously, only\nculturable species could be studied, which represented an extremely\nlow part of the total bacterial community. Today, tools like DNA\nfingerprints, DNA microarrays, and transcriptomic methods can be used\ndirectly on DNA extracts from bulk soil samples, providing new\ninsights into the diversity and functioning of bacterial soil\ncommunities. However, while molecular tools have been rapidly\napropriated by microbiologists, the statistical methods needed to\nanalyse the resulting huge amounts of numerical information still\nrepresent an obstacle for many microbial ecologists. The R environment\nalready plays an important role in genomic data analysis thanks to the\nbioconductor project, but the statistical methods needed for\nmultivariate ecological data analysis are part of standard R packages,\nlike vegan and ade4. Although these packages are designed for plant or\nanimal ecology, many of their functions can be used to analyse\nmolecular biology data sets. Furthermore, other packages, like seqinr\nand made4 are very useful to bridge standard packages and genomic data\nstructures. Lastly, graphical user interfaces are also needed to help\nbiologists master the intricacies of some R functions."} {"Title":"Chipster: A graphical user interface to DNA microarray data analysis using R and Bioconductor","Author":"Jarno Tuimala","Session":"foc-bioinf-1-2","Keywords":"bioinformatics-workflow","Abstract":"In order to enable more researchers to benefit from the method\ndevelopment in the Bioconductor-project, we have created analysis\nsoftware Chipster for microarray data. Chipster offers an intuitive\ngraphical user interface to a comprehensive collection of up-to-date\nanalysis methods. Chipster supports all major DNA microarray platforms\nand, being a Java program, it is compatible with Windows, Linux and\nMacOS X. The basic analysis features such as preprocessing,\nstatistical tests, clustering, and annotation are complemented with,\ne.g., linear (mixed) modeling, bootstrapping hierarchical clustering\nresults, and finding periodically expressed genes from time series\ndata. Analysis history is automatically recorded, and the analysis\nscripts can be viewed at the source code level. Chipster can not only\ndisplay images produced by R and Bioconductor, but also produce\ninteractive visualizations for various clustering results, 2D and 3D\nscatter plots, histograms and time series plots. Users can freely\nchoose different features of datasets to be plotted, such as log\ntransformations of expression values. Graphical client software runs\non the user’s computer, and connects to a remote server environment\nthrough a front-end server. Chipster can also connect to external Web\nServices. There is a possibility to set up a stand-alone version of\nthe analysis environment on a Linux system, and an open source version\nwill be available through SourceForge. The technical implementation is\ndesigned to maximize flexibility and minimize memory usage and data\ntransfer between components. New tools can be added using a simple\nannotation system, and no modifications or wrappers are\nneeded. Analyzer instances are pooled so that analysis requests can be\nprocessed as fast as possible. For more information about Chipster,\nplease see: http://www.csc.fi/molbio/microarrays/nami\nhttp://chipster.csc.fi http://www.sourceforge.org/projects/chipster"} {"Title":"Custom Functions for Specifying Nonlinear Terms to gnm","Author":"Heather Turner and David Firth and Andy Batchelor","Session":"kal-model-1-3","Keywords":"modeling-extensions","Abstract":"gnm is a function provided by the gnm package for fitting generalized\nnonlinear models. These models extend the class of generalized linear\nmodels by allowing nonlinear terms in the predictor. Nonlinear terms\ncan be specified in the model formula passed to gnm by functions of\nclass nonlin. A number of these functions are provided by the gnm\npackage. Some specify basic mathematical functions, such as Exp for\nspecifying an exponentiated term, whilst others are more specialized,\nsuch as the Dref function for specifying diagonal reference terms as\nproposed by Sobel (1981, 1985). Users are able to nest the nonlin\nfunctions provided by gnm in order to specify more complex nonlinear\nterms. However this functionality is limited in the terms that can be\nspecified and can result in rather long-winded model descriptions. The\nalternative is to write a custom nonlin function to fit the desired\nterm. Turner and Firth (2007) explain how to write such a function\nusing a standard example of a logistic model; whilst this provides a\nuseful illustration, that particular model would be more simply\nhandled in practice using nls. In this talk we demonstrate how to\nwrite a custom nonlin function in the context of a novel application\nof generalized nonlinear models. Our application is modelling the\nhazard of entry into marriage for women in Ireland, based on data from\nthe Living in Ireland Survey conducted in 1994-2001 by the Economic\nand Social Research Institute. We propose a nonlinear discrete-time\nhazard model, extending the approach of Blossfeld and Huinink\n(1991). This model may be fitted as a generalized nonlinear model, but\nrequires a custom nonlin function to specify the terms. We show how to\nwrite such a function, exploring the different options available and\nconsidering the difficulties that can arise.\n"} {"Title":"Using R to test Bayesian adaptive discrete choice designs","Author":"Boris Vaillant","Session":"foc-misc-1-1","Keywords":"business, bayesian","Abstract":"We present a proof of concept in R for the implementation of truly\nadaptive discrete choice designs. These algorithms use MC methods to\nupdate the posterior probability after each new answer and generate\nnew product comparisons based on a variety of possible target measures\n(A / D-criterion, minimal expected entropy of the posterior or maximal\nentropy of the next question). We provide results comparing different\nadaptive strategies with fixed MNL- and linear designs based on a\nsimulation study performed in R. Compared to well-known industrial\nsolutions for adaptive question generation our methods are\nconsistently based on discrete choice theory and should therefore lead\nto more reliable results."} {"Title":"Refactoring R Programs","Author":"Tobias Verbeke","Session":"foc-conn-1-3","Keywords":"connectivity","Abstract":"Refactoring code has been daily bread for developers since the advent\nof programming languages and is given a central role in modern\nprogramming methodologies such as eXtreme programming. Automation of\nrefactoring operations is therefore supported by many professional\nIDEs for common programming languages. For the R language, there has\nnot yet been an in-depth study of refactoring operations and the\ncurrent IDEs have no or limited support for it. In this presentation\nwe determine how the specificities of the R language (as a functional\nlanguage with object orientation) impact R software change and\nrefactoring. In a first part, the common refactoring operations are\nreviewed and a typology of the operations is proposed. The typology is\nconfronted with other refactoring categorizations and frameworks\npublished in the software engineering literature. Special attention\nwill be given to the possibilities the R package concept offers to keep\nR code and other software artifacts (documentation, tests, etc.) in\nsync. In a second part, a reflection is offered on user interfaces for\nautomated refactoring (refactoring browsers etc.). This reflection will\nbe based on studying interfaces for other programming languages in\ncomparative perspective. The resulting refactoring framework and\ninterface are planned to be integrated into the StatET eclipse plugin\nfor R, though it is hoped for that other IDEs will benefit as well from\nour results."} {"Title":"Segmented Poisson Models","Author":"Enrique Vidal and Roberto Pastor-Barriuso and Marina Pollan and Gonzalo Lopez-Abente","Session":"foc-pharma-1-3","Keywords":"pharmacokinetics","Abstract":"Standard dose-response analyses (such as categorical, spline, or\nnonparametric regression) provide flexible tools to describe the\noverall shape of the dose-response relation across the entire exposure\nrange, but the identification of trend changes with these methods is\nsubjective. Specific methods are needed to formally test for the\nexistence of change-points in risk trends. We propose a log-linear\nmodel for aggregated data with Poisson variance and free dispersion\nparameter, in which the predictor function consists of two\nintersecting straight lines connected at an unknown change-point\nthrough a hyperbolic transition function, that allows for abrupt\nchanges or more gradual transitions between the linear trends. The\nmodel, that was implemented as an R function, provides a p-value for\nthe existence of a change-point, as well as point and interval\nestimates for its location and the slopes below and above it. An\napplication to two different scenarios is presented. First,\nrelationship between Spanish renal cancer mortality (period 1994-2003)\nand distance to metallurgical facilities (provided by the EPER\nregister) at municipal level was analysed, adjusting by age-group, sex\nand socio-economic indexes. Second, we look for changes in time trend\nof breast cancer incidence (adjusted by age) taken from Spanish\nregistries covering 16 of the 50 Spanish provinces in the last 30\nyears. The results are as follows: In the first scenario, we found a\nsignificant change point (at 5 Km, CI 95% 3 13 Km away from point\nsource) for men. Below this point, relative risk decreased with\ndistance and above it, the trend stabilizes. No change point was found\nfor women. In the second, breast cancer incidence increased in Spain\nduring the 70s, 80s and 90s (at a rate of 2.4 per year) and levelled\nin the XXI century (change point found at 1999 CI 95% 1996 2001). As\nconclusion, it seems that change point models offer a good alternative\nfor the linear dose-response relationships when using regression in a\nset of different epidemiological situations. "} {"Title":"RGG: An XML-based GUI Generator for R Scripts","Author":"Ilhami Visne and Klemens Vierlinger and Friedrich Leisch and Albert Kriegner","Session":"foc-gui_build-1-1","Keywords":"user interfaces-java","Abstract":"R is the leading open source statistics software with many analysis\npackages developed by the user community. However, the use of R\nrequires programming skills. We have developed a software tool, called\nR GUI Generator (RGG), which enables generation of Graphical User\nInterfaces (GUI) for R scripts by adding a few simple XML-tags. An RGG\nfile (.rgg), which contains R code and GUI elements, serves as a\ntemplate for the GUI engine. The GUI engine loads the RGG file and at\nruntime creates and arranges GUI elements from the XML tags. User-GUI\ninteractions are converted into the corresponding R code, which\nreplace the XML tags. As a result a new R script is generated from the\ntemplate. The project’s aim is to provide R developers with a tool to\nmake R based statistical computing available to a wider audience less\nfamiliar with script based programming. The project further includes\nthe development of a repository and documentation system for R-GUIs\nbeing developed by community. The project’s website is at\nhttp://rgg.r-forge.r-project.org."} {"Title":"GridR: Distributed Data Analysis using R","Author":"Dennis Wegener and Stefan Rüping and Michael Mock","Session":"foc-highperf-3-1","Keywords":"high performance computing","Abstract":"In the last couple of years, the amount of data to be\nanalyzed in different areas grows rapidly. Examples range from natural\nsciences (e.g. astronomy or particle physics), business data (e.g. a\nhigh increase use data volume is expected by the use of RFID\ntechnology), life sciences (such as high-throughput genomics and\npost-genomics technologies) or data generated by normal users on the\ninternet (see Google, Youtube, etc.). The enormous growth of the\namount of data is complemented by advances in distributed computing\ntechnology enabling the data analyst to handle this amount of data in\nreasonable time. Two main streams of current distributed technology\ndevelopment and research are particularly useful in this respect: the\ngrid technology is aiming at making data stores and computing\nfacilities which are geographically widely spread available for a\ncommon, global data analysis. The other stream of development is\ncluster-based computing which transforms large amounts of standard\ncomputers into high-performance computing bases. However, even if the\nabove mentioned advances in distributed computing technology make\navailable the computing and storage resources for handling large\namounts of data, they introduce another level of complexity in the\nsystem, such that the traditional data analyst, with a strong\nbackground in statistics and application domain knowledge, might be\noverwhelmed by the complexity of the underlying distributed\ntechnology. For instance, an application developer using R might not\nbe interested in any details of how web services are built. Therefore,\nongoing research aims at bridging the gap between advanced distributed\ncomputing technology and traditional statistical software. The\nAdvancing Clinico-Genomics Trials on Cancer project (ACGT) aims at\nproviding a data analysis environment that allows the exploitation of\nan enormous pool of data collected in European cancer treatments. In\nthe context of this project, the GridR package was developed, which\nwas one of the first attempts to connect R to a grid environment - to\ngrid-enable R. "} {"Title":"Commercial meets Open Source - Tuning STATISTICA with R","Author":"Christian Weiß","Session":"foc-gui_frontend-1-2","Keywords":"user interfaces-embedding","Abstract":"R is an extremely powerful environment for statistical computing: It\nprovides packages designed for different areas such as data mining,\neconometrics, epidemiology, biostatistics, it offers methods from\ndifferent statistical disciplines like time series analysis,\nstatistical process control, bootstrapping, cluster analysis, and\nothers. Besides its mere extent, R differs from competing statistics\nenvironments also in the fact that it reflects the state-ofthe-art in\nstatistical sciences. And not to forget: R is freely available. On the\nother hand, R is not particularly user-friendly: It does not offer a\ngraphical userinterface, where the repertoire of methods is fully\nintegrated and available also for users, who have not learnt the R\nlanguage. It does not offer a powerful spreadsheet environment, which\nenables an intuitive way of data manipulation. Therefore, (potential)\nusers from applied sciences and industry often do not have the heart\nto work with R. In this talk, I propose to combine the power of R with\nthe comfort of a commercial package like STATISTICA. STATISTICA can be\nused as an easily operated interface with a respectable basic\nequipment of statistical procedures, see Weiß (2006). But if required,\none can easily integrate specialised statistical procedures and\nsophisticated techniques offered by R into the user interface of\nSTATISTICA. Besides the base version of STATISTICA with its Visual\nBasic development environment, and besides R together with the\nrequired packages, the user only needs to install the R DCOM Server of\nBaier & Neuwirth (2007). The necessary procedure and essential\ncommands to access R from STATISTICA are explained, also refer to\nStatSoft (2003). A number of examples highlight situations, where R\ncan be used to extend the functionality of STATISTICA. Among others,\nwe explain how an ARL calculator for computing average run lengths of\nEWMA and CUSUM control charts can be programmed, using the spc package\nof Knoth (2007). The ARL calculator supports the design of these\ncontrol charts, which are themselves available through STATISTICA. "} {"Title":"A Compendium Platform for Reproducible, R-based Research with a focus on Statistics Education","Author":"Patrick Wessa","Session":"foc-teach-2-1","Keywords":"teaching","Abstract":"This paper discusses a new Compendium Platform (CP) that allows us to\ncreate Reproducible Research in R that is easily accessible for anyone\nwho has access to the internet (freestatistics.org). The platform is\nbased on the R Framework (wessa.net) and primarily focuses on\nICT-based Statistics Education within a pedagogical paradigm of\nindividual and social constructivism which received a great deal of\ninterest in the academic community (Von Glasersfeld (1987), Erick\nSmith (1999), Eggen and Kauchak (2001), and Nyaradzo Mvududu\n(2003)). The basic idea is to create an environment where students are\nallowed to interact with each other (and the tutor) about a series of\nresearch-related activities (such as assignments or workshops) based\non the R language and the R Framework. The novelty about this approach\nlies in the fact that the newly developed CP empowers students to\neasily archive, exchange, reproduce, and reuse R computations. The\nunderlying technology facilitates the creation of a learning\nenvironment that supports social constructivism which is very similar\nto the real world of applied statistical research. More importantly,\nthe CP allows us to obtain physical measurements of the actual\nlearning process of students based on detailed information about the\nuse of the statistical software, and the socially constructivist\nlearning activities (based on peer review of statistical analysis in\nR). The CP was thoroughly tested in two undergraduate statistics\ncourses with large student populations. During these courses a large\nnumber of physical and survey-based measurements were obtained and\nstudied. The preliminary analysis of the relationships between\nlearning attitudes, social interaction (through group work and peer\nreview), learning experiences, software usability, usage of archived R\ncomputations, and exam scores (that are related to statistical\ncompetences rather than knowledge) is presented. One of the most\ninteresting results is that social interaction through peer review\nbased on Reproducible Research (which is used as a ”learning activity”\nrather than an ”evaluation tool”) is very beneficial for the learning\nexperiences of students, and exam scores. Also, there is a strong,\npositive relationship between the use of the CP and exam performance -\neven if other important factors are taken into account. Another\ninteresting result is that a large majority of students have a\npositive perception about the new system as a learning tool and prefer\nthe constructivist approach based on Compendia above traditional\nlearning methods. In addition, it is (very) briefly illustrated how\nCompendia of Reproducible Research can be used to: • write\nCompendium-based course materials • detect plagiarism and free-riding\n• quickly identify (and find solutions for) bugs and computing-related\nproblems • estimate the workload of an assignment • support new forms\nof collaboration that lead to improved solutions in R Finally, some\nimportant aspects about the near and distant future of the CP and the\nunderlying R Framework (for the purpose of education, scientific\nresearch, and publishing) are illustrated and discussed."} {"Title":"Analyzing paired-comparison data in R using probabilistic choice models","Author":"Florian Wickelmaier","Session":"foc-numerics-1-4","Keywords":"psychometrics","Abstract":"When human subjects are required to evaluate a set of options or\nstimuli with respect to some attribute, the simplest data that can be\nobtained are binary paired-comparison judgments. Such data might\nresult from so-called sensory evaluation studies, where participants\nare asked to judge which of two audio samples sounds brighter or more\nnatural, which of two coffee brands tastes better, or from surveys\nwhere subjects are to indicate which political party they would vote\nfor or which insurance package they would rather buy. It is the goal\nof the analysis of the data to arrive at a scaling of the options\ninvolved. A well known model for paired-comparison data is the\nBradley-Terry-Luce (BTL) model that relates the pairwise choice\nprobabilities to scale values representing the weight or strength of\neach option. Often in empirical studies, however, it is found that the\ndata do not meet the restrictions imposed by the BTL model, one of\nthem being that the choices are made independently of the context\nintroduced by a given pair. In psychology, more general models have\nbeen developed, the most prominent one being the eliminationby-aspects\n(EBA) model (Tversky, 1972; Tversky & Sattath, 1979), which does not\nrequire context independence of the judgments. Although these general\nmodels seem to be promising alternatives to the BTL model, they have\nnot been frequently applied, presumably due to the lack of easy-to-use\nsoftware for their fitting and testing. The presentation will\nillustrate the analysis of paired-comparison data using the eba\npackage in R (Wickelmaier & Schmid, 2004). It will be demonstrated\nwith examples from empirical research that, whenever similarity among\nthe options of a choice set plays a role, the modeling is more\nsuccessful when more complex choice models, such as EBA, are employed."} {"Title":"Deploying Data Mining in Government - Experiences With R/Rattle","Author":"Graham J. Williams","Session":"invited","Keywords":"invited","Abstract":"Whilst R and its many packages provide an incredibly broad and\ncomprehensive environment for data mining in practise, there are many\nchallenges in bringing its power to the common data mining\npractitioner. It is a sad fact that many analysts today only feel\ncomfortable with the usually limiting graphical user interfaces. Yet,\nwe can unleash the full power of analytics only through languages like\nR. In this presentation I will reflect on how we are bringing the power\nof R to a large community of data analysts and new data miners through\nthe development of the award winning open source Rattle package for\nR. I will present some case examples of using R from the Australian\nTaxation Office and discuss how we tackled various problems in using\ndata mining tools in practise."} {"Title":"Real-Time Market Data Interfaces in R","Author":"Rory Winston","Session":"foc-finance-1-3","Keywords":"finance","Abstract":"Historically, R usage tends to centre more around offline, rather than\nreal- time data analysis. However, there are some reasons why a\nreal-time market data interface can be of benefit. In this talk, I will\ntalk about an interface to the Reuters market data system that I built\nwhilst working on a real- time foreign exchange algorithmic trading\nproject. This proved to offer some surprising benefits, and the\naddition of a market data query interface into R, combined with its\nvast library of analysis functions and easily extensible native\ninterface, makes it an incredibly powerful tool."} {"Title":"Computational Finance and Financial Engineering: The R/Rmetrics Software Environment","Author":"Diethelm Würtz and Yohan Chalabi","Session":"kal-ts-1-2","Keywords":"finance","Abstract":"R/Rmetrics has become the premier open source solution for teaching\nfinancial market analysis and valuation of financial instruments. With\nhundreds of functions build on modern methods R/Rmetrics combines\nexplorative data analysis and statistical modeling. Rmetrics is\nembedded in R, both building an environment which creates for students\na first class system for applications in statistics and finance. In\nthe heart of the software environment are powerful time/date and time\nseries management tools, functions for analyzing financial time\nseries, functions for forecasting, decision making and trading,\nfunctions for the valuation of financial instruments, and functions\nfor portfolio design, optimization and risk management. In this talk\nwe give an overview on R/Rmetrics and present new directions and\nrecent developments. "} {"Title":"Statistical Animations Using R","Author":"Yihui Xie","Session":"foc-teach-1-1","Keywords":"teaching","Abstract":"Animated graphs that demonstrate statistical ideas and methods can\nboth attract interest and assist understanding. This paper describes\napproaches that may be used to create animations, and gives a brief\noverview to the R package animation. It gives examples of the use of\nanimations in teaching statistics and in the presentation of\nstatistical reports. Animations can add insight and interest to\ntraditional static approaches to teaching statistics, making\nstatistics a more interesting and appealing subject. "} {"Title":"Using R for Spatial Shift-Share Analysis","Author":"Gian Pietro Zaccomer and Luca Grassetti","Session":"foc-econom-1-2","Keywords":"econometrics, spatial","Abstract":"During the second half of the 20th century, the Shift-Share Analysis\n(SSA) have been largely applied in the economic growth\nstudies. Starting from the formulation adopted by Dunn (1960), the\nliterature proposed various decomposition procedures based on the\nidentification of three or more components. The SSA has always been\nconsidered a spatial statistics tool but only with Nazara and Hewings\n(2004) the spatial dimension has been actually considered in the model\nspecification. The authors, in fact, introduced the effect of\ninteraction between territorial units by means of a spatial weights\nmatrix. The proposed model is based on a generic row standardized\nweighting matrix. Consequently, the authors did not face the problem\nof weight construction. Zaccomer (2006) proposed a solution based on\nthe variables deriving from the italian register of businesses. The\ninformation derived from this register can be used to define two\nimportant decomposition factors: the economic activity in NACE-ATECO\nclassification and the firm legal status. In the cited article, instead\nof the well known spatial weighting systems based on contiguity or on\ngeneric distance functions, the author proposed an economic concept of\nneighborhood. In fact, the considered matrices are based on a given\neconomic subdivision as, for example, the Local Labor Systems (LLS) or\nthe Industrial Districts (ID). The neighborhood defined by the\n“economic contiguity” can be considered the best choice if the units’\npartition is based on supplementary information about the studied\nphenomenon. For example the ID are based on the observation of firms’\nproductive network and can be used to study the labour growth\nrates. In this work we aim to study the flexibility of spatial shift\nshare model applied to analysis of labour growth rates obseved in the\nlocal system of Friuli Venezia Giulia and its LLS. All computational\nissues, plots and prints functions are developed using R (R\nDevelopment Core Team, 2007). "} {"Title":"Some Perspectives of Graphical Methods for Genetic Data","Author":"Jing Hua Zhao and Qihua Tan and Shengxu Li and Jian'an Luan","Session":"foc-biostat_model-1-1","Keywords":"biostatistics-modeling","Abstract":"Recent initiatives have made genetic data on single-nucleotide\npolymorphisms (SNPs) in humans widely available. The association study\nbetween these SNPs and a host of measures in humans and other species\nhas led to a vigorous development of analytical tools as with a great\nunderstanding of the genetic basis of common diseases. Among many\naspects of the data analysis, there is a need to synthesise the\ngraphical methods involved. I give a brief account of the background,\nprovide examples in recent analyses, and draw attention to further\nwork. Specifically, the examples provided are from a number of\naspects: 1. Phenotypic data. While this includes the usual summary\nstatistics it may also be specific to genetic context such as\npedigree-drawing, 2. Genotypic data. This includes plotting missing\ndata, Hardy-Weinberg equilibrium (HWE), and the correlation between\nneighbouring SNPs (LD). 3. Assessment of population substructure and\ngenotype-phenotype association. This includes scree plot, Manhattan\nplot, Q-Q plot, SNP-based summary plot and regional association\nplot. 5. Representation of pathways. Other examples may arise from\nstudy of power, meta-analysis and interactions. In addition, a\ncomparison will be made between graphics from CRAN packages with\npopular standalone programs such as LD plot."} {"Title":"Modelling biodiversity in R: the untb package","Author":"Robin Hankin","Session":"foc-environ-2-2","Keywords":"environmetrics-misc","Abstract":"The distribution of abundance amongst species with similar ways of life is a\nclassical problem in ecology. The Unified Neutral Theory of Biodiversity\n(UNTB), due to Hubbell, states that observed population dynamics may be\nexplained on the assumption of per capita equivalence amongst individuals. One\ncan thus dispense with differences between species, and differences between\nabundant and rare species: all individuals behave alike in respect of their\nprobabilities of reproducing and death. It is a striking fact that such a\nparsimonious theory results in a non-trivial dominancediversity curve (that is,\nthe simultaneous existence of both abundant and rare species) and even more\nstriking that the theory predicts abundance curves that match observations\nacross a wide range of ecologies. The UNTB, being a statistical hypothesis, is\nwell-suited to simulation using the R computer language. Here I discuss the untb\npackage for numerical simulation of ecological drift under the unified neutral\ntheory. A range of visualization, analytical, and simulation tools are provided\nin the package and these are presented with examples and discussion."}