Statistical Analysis


Statistics and Computer Information technology draw parallel analogies as both are interested making sense of raw data. Whereas, over the years their actual interests differed in terms of the fact that computing was interested in making non-readable machine language intelligible to the user, while statistics was focused towards summarising existing data that was already intelligible to the user in ways that are more accessible and if possible insightful to the user. In more recent times however, this distinction has narrowed and computing has advanced beyond ordinary machine unintelligible data processing to something similar to what would be of interest to the statistician. Terms like big data, business intelligence and statistical machine learning are buzz words associated with ubiquitous computing today.

It is therefore cogent not only for the computing professional to be up to date with statistical methods, but statistical and computational methods ought to be easily translated between each other by the professional. It is the aim of this blog to give the reader the necessary tools to understand and as well as implement basic statistical methods using computational algorithms.

In this blog we will be implementing the algorithms using the go programming language. Although this can easily be ported to other languages.
As the focus of this article is mainly on the algorithms, we will provide a quick tour of ths statistical concepts while providing references for further reading. It is the goal of this article to reinforce the statistical concepts introduced using the provided examples along with the algorithmic implementations.

Data analysis

  1. Data
    1. Discrete - values can be precisely counted
    2. Continuous - values measured on a continuous scale.
  2. Grouped Data ![[Pasted image 20230802051333.png]]
  3. Frequency (f) - the number of occasions on which each value, class, occurs.
    1. Relative frequency - each frequency expressed as a percentage or fraction of the total frequency. ![[Pasted image 20230802051438.png]]
  4. Histogram - A Graphical representation of a frequency or relative frequency distribution. The frequency of any one class is given by the area of its column. If the class intervals are constant, the height of the rectangle indicates the frequency on the vertical scale. ![[Pasted image 20230802051742.png]]
  5. Frequency polygon - the figure formed by joining the centre points of the tops of the rectangles of a frequency histogram with straight lines and extended to include the two zero frequency columns on the sides. ![[Pasted image 20230802051856.png]]
  6. Frequency curve - obtained by ‘smoothing’ the boundary of the frequency polygon, or by plotting centre values and joining with a smooth curve.

Measures of Central tendency

  1. Mean (arithmetic mean) $\bar{x}=\frac{\Sigma xf}{n}=\frac{\Sigma xf}{\Sigma f}$
  2. Mode - the value of the variable that occurs most often. For grouped distribution: $mode = L+\left(\frac{l}{l+u}\right)c$ ![[Pasted image 20230802051952.png]]
  3. median - the value of the middle term when all values are put in ascending or descending order. With an even number of terms, the median is the average of the two middle terms.

Measures of dispersion

  1. Standard Deviation ![[Pasted image 20230802052219.png]]
  2. Normal distribution curve- large numbers of observations symmetric about the mean ![[Pasted image 20230802053001.png]]
    • 68% observations lie within $\pm 1$ sd of the mean
    • 95% observeration lie within 2 sd of the mean
    • 99.7 observatins lie within 3 sd of the mean. ![[Pasted image 20230802053030.png]]
  3. Standardised normal curve - the axis of symetry of the normal curve becomes the vertical axis with a scale of relative frequency. The horizontal axis carries a scale of z-values indicated as multiples of the standard deviation. The curve therefore represents a distribution with zero mean and unit standard deviation.

Examples

![[Pasted image 20230802053101.png]]