[Statistics 101] (3) Summarizing Data

Summarizing Categorical Data

  • The percentage of individuals falling into each category: Pie Graphs
  • Use crosstabs or two-way tables to display information from 2 categories at once

[Example] You get the following data from the survey.

The number of individuals who took part in the survey: 1,000
The number of males: 400
The number of females: 600

Age < 30: 550 (250 males and 300 females) Age >= 30: 450 (150 males and 300 females)

Summarizing Numerical Data

Mean (Arithmetic Mean)

  • Average: obtained by adding all values and dividing by the number of values in the data set.
  • Outliners can drive the average upward or downward significantly.
  • The average of an entire population is denoted with the Greek letter μ (mu). – population mean
  • The average of a sample from the population is denoted with the letter x and bar on it. – sample mean

Median

  • The numeric value separating the higher half of a sample from the lower half
    • Order the numbers from smallest to largest
    • If the data set has an odd number of values, choose the one that is exactly in the middle
    • If the data set has an odd number of values, take the two numbers in the middle and average them to find the median

Inter-Quartile Range (IQR)

  • Put the data in ascending order
  • Divide the data into 2 groups using a Median (High group & Low group)
  • Find the median of the low group (Q1)
  • Find the median of the high group (Q3)
  • IQR = Q3 – Q1

Box and Whiskers

  • Box: From Q1 to Q3
  • Whiskers: (Q1 – 1.5*IQR) ~ Q1 and Q3 ~ (Q3 + 1.5*IQR)
  • Values outside of the box and whiskers are outliners (extreme values)

Mode

  • The mode is an element that occurs most frequently.

Standard Deviation

  • It measures the spread from the Mean.
  • Roughly, it means the average distance from the average (center).
  • The standard deviation of an entire population is denoted with the Greek letter σ (sigma). – population standard deviation
  • The standard deviation of a sample from the population is denoted with the letter s. – sample standard deviation
  • [Note] When you calculate the sample standard deviation, divide by ‘(n-1)’ instead of ‘n’. It is called Bessel’s correction that makes sure the deviation is not biased.
  • Interpretations of the standard deviation
    • s ≥ 0
    • A small standard deviation means that the values are close to the middle of the data set, on average.
    • A standard deviation is affected by outliners.
    • A standard deviation has the same unit as the original data.

Standard Scores (Z-values)

  • Determine how many deviations a value from the mean.
  • A Z-value of +2 means that a value is two standard deviations above the mean.

Empirical Rule

  • For nearly symmetric mound-shaped(bell-shaped) data sets,
    • 68 % of the data lie within ONE standard deviation of the mean
    • 95 % of the data lie within TWO standard deviation of the mean

Degree of Freedom (df)

The degrees of freedom (df) of an estimate is the number of independent pieces of information on which the estimate is based.

Suppose you want to check the height of Martians and you have met only 2 of them so far. Their heights are 190 cm and 210 cm.

You can get the mean of a sample: x = ((190+210))/2=200 [cm]

To get the standard deviation, you need to calculate the squared distance between a sample value and the mean of a sample.

d1 = (190-200)^2 = 100
d2 = (210-200)^2 = 100

Are two values (190 and 210) independent each other when calculating standard deviation?

The answer is NO. Because both of them contributed to calculate the mean (x). If you know one value and the mean (x), you can get the other value.

In general, when the sample size is ‘n’ and you know the mean of the sample, you do not need to know all of n values. If you know (n-1) of them, you can calculate the last value. In short, when calculating standard deviation, (n-1) values are independent and you can say the sample has (n-1) degree of freedom.

That’s the reason why (n-1) is used to calculate the standard deviation.

In general, the degrees of freedom is equal to the number of values minus the number of parameters in question.

For example, the standard deviation is defined as

It has ‘n’ values and 1 parameter. Therefore, the degree of freedom is (n-1).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s