# STATISTICS AND IT (INFORMATION SCIENCE AND TECHNOLOGY

## Learning outcomes of the course unit

The Statistics and Computer Science course aims to use statistical computer programs, with the ability to choose the ones suitable for robustness and power to the problem and the characteristics of the data, correctly interpret the outputs, illustrate the methods and logical steps on which are founded, to motivate the choice of tests.

The course explains the descriptive statistics methods, the main theoretical distributions and the inferential tests that are more frequently used in research and in the profession (chi square, Student t, ANOVA crossed and nested with two or more factors or levels with possible interaction, linear regression and correlation).

In addition to the parametric methods, the most common nonparametric methods are discussed and reported in computer programs, particularly useful when the variability of the data is large and / or the measures are approximated.

At the end of the course the student should have acquired the ability to process data collected in nature and in the laboratory, to present them correctly in company reports and / or publications in international journals, to understand and evaluate the univariate and bivariate statistical analyzes reported in the journals International.

In particular, the student should be able to:

D1. Knowledge and ability to understand.

Understanding the concepts of inferential statistics, formulating the null hypothesis and the alternative hypothesis, interpreting the probability derived from the data.

D2. Ability to apply knowledge and understanding.

Use computer programs choosing the descriptive methods and inferential tests appropriate to the scientific problem that motivated the collection of data.

D3. Autonomy of judgment.

For the same scientific question, various tests are often possible; it is important to choose the most appropriate parametric or non parametric test, motivating the choice in terms of robustness and power.

D4. Communication skills.

Explain the reasons for the choice of the test, illustrate the concepts on which the method is based, correctly interpret the probability obtained and its meaning in the discipline and the problem that generated the research.

D5. Learning ability

The course leads to a knowledge of statistical methods that allows the use of the most complete and widespread international texts, to understand and evaluate the inferential analyzes published in international journals.

## Prerequisites

In the presentation of concepts and methods, the course starts from an elementary level for which the mathematical and calculus knowledge acquired in mathematics, physics and chemistry courses is more than sufficient.

## Course contents summary

In the first part of the course the descriptive statistics methods are presented and discussed, from the tabular and graphic representations to the estimation of the indices or statistics. In the second part the models of theoretical distributions are illustrated with examples, starting from the combinatorial calculation and describing the binomial, the poissonian, the hypergeometric and the normal. In the third part, the prevailing one, inference tests are explained with illustrations of the theory and various applications to the profession and research of biotechnologists: chi squared, t of Student, especially the ANOVA in its various experimental designs, the regression and the linear correlation. Finally, for situations with great data variability and outlier presence, several non-parametric tests are presented and applied.

## Course contents

1 - Types of scale and measurement. Descriptive statistics for univariate distributions. Construction of tables and graphical representation for quantitative variables and qualitative variables: histograms, polygons, spaced rectangles, circular diagrams. Pictograms and the lie factor. Indices of central tendency, dispersion, symmetry and kurtosis. Number of decimals and significant digits.

Descriptive statistics exercises using the PAST program.

2 - Combinatorial calculus, binomial, poissonian, hypergeometric distribution. Normal distribution and normal reduced. Exercises with reduced normal use and tables z.

3 - Comparisons between rates and probability. The chi square distribution. Test for the goodness of the adaptation; conditions of validity and correction of Yates. Contingency tables 2 x 2 and R x C, for small and large samples: Fisher exact method and z test in 2 x 2 tables.

The G method or log-likelihood ratio in tests for the goodness of adaptation and in contingency tables.

Exercises on the chi-square test for the goodness of the adaptation and in contingency tables with PAST

4 - Alpha error and beta error; a priori and a posteriori power. Estimation of sample sizes for comparison between averages with normal distribution. Number of data for a measurement with the desired accuracy

5 - Student's t distribution. Test for the average of a sample and the confidence interval of the mean. Comparison between the averages of two dependent samples and two independent samples. Test for homogeneity of variance; F test, Bartlett test, Levene test. Overview of methods for comparing two averages with different variances. Estimation of the minimum dimensions of the two samples, with the t distribution and the z distribution. The balance of 2 samples.

Student t-test exercises with the PAST program, with equal and different variances.

6 - Analysis of variance (ANOVA) to a criterion (one-way): the comparison between two or more averages. Fisher-Snedecor Distribution F and relationship with Student's t distribution. Conditions of validity of ANOVA and test for homoschedasticity with k samples: Hartley test, Cochran test, Bartlett test, Levene test and its variants. Multiple a priori or planned comparisons; multiple a posteriori or post-hoc comparisons: the alpha risk and the Bonferroni principle; the Bonferroni-Dunn methods, Tukey HSD, SNK and the sequential methods, the Dunnett test, the Duncan test. Applications of ANOVA and multiple comparisons with the PAST program.

ANOVA exercises with the PAST program.

7 - Analysis of variance with two (two way) and with more crossed criteria. Methods to reduce the number of observations: the Latin squares. Relative efficiency of an experimental design. Loss of data into tables with two or more factors crossed. Analysis of the interaction between two factors, with repeated measures. Interpretation of the interaction, with graphic representations. Hierarchical or nested analysis at two or more levels. Interaction in multi-factor, crossed, nested and mixed ANOVA.

Assumptions of validity of ANOVA, transformations of data; the Box-Cox method for the most appropriate transformation.

8 - Descriptive statistics for bivariate distributions. Simple linear regression: estimate of the angular coefficient b and of the intercept a; significance and confidence interval of the angular coefficient and of the intercept. Choice of the sample for the significance of the angular coefficient and of the intercept. The R-framework determination coefficient. The regression for the origin: advantages and disadvantages. Reverse prediction or calibration. Comparison between the angular coefficients of two independent samples. Concepts on the analysis of covariance (comparisons of Y averages with different Xs).

The linear regression with repeated Y. Calculation of the terms of the regression by the polynomial coefficients.

## Recommended readings

A)

Lamberto Soliani (2015) Statistica di base. Piccin, Padova.

B) Per la statistica non parametrica:

Soliani Lamberto (2018) Statistica non parametrica, classica e moderna, Piccin, Padova.

Testi internazionali di riferimento:

- Sokal R. R. and F. J. Rohlf 2012. Biometry: the principles and practice of statistics in biological research. 4th edition. W. H. Freeman and Co.: New York. 937 pp

- Zar Jerrold (2010). Biostatistical Analysis, Fifth Edition. Pearson Education International, New Jersey, 944 pp

Testi internazionali gratuiti in rete, con argomenti utili al chimico

- EPA 530/R-09-007, March 2009, Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities. Unified Guidance, Environmental Protection Agency, United States (pp. 888).

- EM 1110-1-4014, 31 Jan 2008, Environmental Quality - ENVIRONMENTAL STATISTICS, Department of the Army, U. S. Army Corps of Engineers (pp. 544).

## Teaching methods

Frontal lectures, use international software

## Assessment methods and criteria

Modality of verification of learning

Oral interview with discussion of examples, to verify the learning of the concepts and methods of the inferential statistics, the ability to illustrate the information output and to interpret the results. The vote depends on the extent of the program studied, on the individual topics discussed in class, on the correctness of the hypothesis formulated and on the statistical procedure used, on the correctness of the conclusions drawn from the test result and the scientific language used.

## Other informations

In the examination period there will be an appeal per week, except in July and August where there will be two.

For exact and updated information and for interviews with the teacher send e-mail.

E-mail: lamberto.soliani@unipr.it