[Statistics 101] (3) Summarizing Data

Summarizing Categorical Data

The percentage of individuals falling into each category: Pie Graphs
Use crosstabs or two-way tables to display information from 2 categories at once

[Example] You get the following data from the survey.

The number of individuals who took part in the survey: 1,000
The number of males: 400
The number of females: 600

Age < 30: 550 (250 males and 300 females) Age >= 30: 450 (150 males and 300 females)

Summarizing Numerical Data

Mean (Arithmetic Mean)

Average: obtained by adding all values and dividing by the number of values in the data set.
Outliners can drive the average upward or downward significantly.
The average of an entire population is denoted with the Greek letter μ (mu). – population mean
The average of a sample from the population is denoted with the letter x and bar on it. – sample mean

Median

The numeric value separating the higher half of a sample from the lower half
- Order the numbers from smallest to largest
- If the data set has an odd number of values, choose the one that is exactly in the middle
- If the data set has an odd number of values, take the two numbers in the middle and average them to find the median

Inter-Quartile Range (IQR)

Put the data in ascending order
Divide the data into 2 groups using a Median (High group & Low group)
Find the median of the low group (Q1)
Find the median of the high group (Q3)
IQR = Q3 – Q1

Box and Whiskers

Box: From Q1 to Q3
Whiskers: (Q1 – 1.5*IQR) ~ Q1 and Q3 ~ (Q3 + 1.5*IQR)
Values outside of the box and whiskers are outliners (extreme values)

Mode

The mode is an element that occurs most frequently.

Standard Deviation

It measures the spread from the Mean.
Roughly, it means the average distance from the average (center).

The standard deviation of an entire population is denoted with the Greek letter σ (sigma). – population standard deviation
The standard deviation of a sample from the population is denoted with the letter s. – sample standard deviation

[Note] When you calculate the sample standard deviation, divide by ‘(n-1)’ instead of ‘n’. It is called Bessel’s correction that makes sure the deviation is not biased.

Interpretations of the standard deviation
- s ≥ 0
- A small standard deviation means that the values are close to the middle of the data set, on average.
- A standard deviation is affected by outliners.
- A standard deviation has the same unit as the original data.

Standard Scores (Z-values)

Determine how many deviations a value from the mean.
A Z-value of +2 means that a value is two standard deviations above the mean.

Empirical Rule

For nearly symmetric mound-shaped(bell-shaped) data sets,
- 68 % of the data lie within ONE standard deviation of the mean
- 95 % of the data lie within TWO standard deviation of the mean

Degree of Freedom (df)

The degrees of freedom (df) of an estimate is the number of independent pieces of information on which the estimate is based.

Suppose you want to check the height of Martians and you have met only 2 of them so far. Their heights are 190 cm and 210 cm.

You can get the mean of a sample: x = ((190+210))/2=200 [cm]

To get the standard deviation, you need to calculate the squared distance between a sample value and the mean of a sample.

d1 = (190-200)^2 = 100
d2 = (210-200)^2 = 100

Are two values (190 and 210) independent each other when calculating standard deviation?

The answer is NO. Because both of them contributed to calculate the mean (x). If you know one value and the mean (x), you can get the other value.

In general, when the sample size is ‘n’ and you know the mean of the sample, you do not need to know all of n values. If you know (n-1) of them, you can calculate the last value. In short, when calculating standard deviation, (n-1) values are independent and you can say the sample has (n-1) degree of freedom.

That’s the reason why (n-1) is used to calculate the standard deviation.

In general, the degrees of freedom is equal to the number of values minus the number of parameters in question.

For example, the standard deviation is defined as

It has ‘n’ values and 1 parameter. Therefore, the degree of freedom is (n-1).

Other Types of Means

Geometric Mean

Geometric Mean is used to describe compound proportional growth.

Suppose you invest $1000 that yields 10% the first year and 20% the second year. After one year, you will have $1100 (1000*1.1), and after two years, you will have $1320 (1100*1.2).

What is the average rate of return per year?

The arithmetic mean is (10 +20)/2 = 15%. But in 2 years, the result is (1000*1.15)*1.15 = $1322.5.

The geometric mean is calculated like this:

The geometric mean of the example is

Using the geometric mean, the result is (1000*1.1489)*1.1489 = $1320

Harmonic Mean

Geometric Mean is used to describe the average of rates.

Suppose you are traveling between A and B. The distance between A and B is 120 km. You traveled from A to B at 60km/h and traveled back from B to A at 40km/h. What is the average speed of the round trip?

The arithmetic mean is 50 km/h. But the situation is more complex. It took 2 hours from A to B and 3 hours from B to A. Therefore, you traveled 240km in 5 hours. The average speed should be 48km/h (=240/5), not 50km/h.

The harmonic mean is calculated like this:

The geometric mean of the example is

[Statistics 101] (3) Summarizing Data

Summarizing Categorical Data