### Statistics: A Very Short Introduction

David J. Hand, Oxford University Press 2008

Statistics - the technology of extracting meaning from data, for handling uncertainty, predicting the future, making inferences about the unknown, and summarizing data.

Technology - application of science and its discoveries.

Data - usually numbers for results, measurements, counts, other processes that will be associated with meaning.

Goodhart's law - "Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes..." as the system evolves to meet that quota. Ex. Measuring programmers efficiency in lines of code(and sometimes speed) will lead to the rise of overabstraction, complexity and bugs; often at the cost of project failures, and extreme costs to make changes late into development.

Prosecutor's Fallacy - the probability doesn't dictate the reality of the situation, a probability taken out of context(limited information, or emphasizing a wrong statistic) can lead to false conclusions. Chance of criminal having all properties of limp, tall, wearing red, white is very low, thus you are the criminal. But, in a large population, the statistic will be 1/20 for, 19/20 against.

Publication bias - only publishing academic research that has significant findings(as opposed to publishing that there isn't anything significant), which could have been reached by doing a study N amount of times, and on Nth time, our "significant finding" occurred, thus disregarding all the other times against.

Variables - a characteristic of the object in study.

Value - quantifiable unit associated with a variable.

Ordinal - concern of the order of the values.

Magnitude - concern of the quantity.

Ratio - comparing the X of something with the Y(usually standard) something else.

Absolute scale - treating units as a well defined discrete value.

Distribution - of values, statistic summaries characterize the distribution like: similar, large, small, typical.

Average - a representative value to the numbers in the set.

Arithmetic mean - a statistic; summing the values and dividing by the quantity of values.

Median - a representative value of centre value where half the numbers are larger and half are smaller.

Mode - the value that is most frequent in the sample.

Dispersion - how widely dispersed the data is around the average, how many are larger/smaller, the bunching, how different are the values from each other.

Range - a measure of dispersion as the difference between the smallest and largest values.

Variance/mean squared deviation - squared difference between the value and the mean(makes all values positive, such that positive and negative values don't cancel each other out)

Standard deviation - the squared rooted(original form) of the mean squared deviation.

Skewness - measures the asymmetry of the distribution of values(right >--/left --< tailed)

Quantiles - looking at parts of the distribution

Quartiles - upper/lower quartile looks top and bottom 25%.

Decides - dividing the set into tenths.

Percentiles - dividing the set into one-hundredths.

Selection bias - bias introduced by the selection of people, groups, data that is not random, thus not a representation of the population.

Incomplete data - missing data, due to unrelated reasons, selection bias, may be the nature of the questions; impedes showing clear relationships.

Incorrect data - definitions are ambiguous, as are classification from multiple sources. Measuring precision, rounding errors, recording errors, arithmetic errors, standardization inconsistencies, units of measurement.

Error propagation - a single value can have exponential effects on the outcome, especially in long chain processing. Even harder to detect are systems where multiple errors cancel each other out creating deceptively acceptable result, but with incorrect internal calculations.

Outlier - value different from all others, by chance or some error in collection.

Observational data - cannot manipulate the data capture process, ex. recording galaxy data. Good for observing causation.

Experimental data - used to try different conditions, purposely manipulate/control the process ex. effectiveness of a stimulant during different levels of tiredness. Good for observing causation when randomization is applied to the population, creating greater confidence for observed results.

Law of large numbers - the more data that is collected, the greater the accuracy, but at a cost.

Experimental design - create groups of people who are as uniform as possible via randomization, and A/B test, with one group getting A and the other receiving B. A measure of the population imbalance will provide confidence with conclusions.

Double blind - random allocation to eliminate subconscious bias, where neither patient or experimenter knows if which sample they are getting.

Factorial experimental design - subdividing our tests based on the increase of parameters, thus the matrix of experimental combinations is maximized with a limited resource.

Representative - a smaller sample which has the same proportions of a larger one.

Survey sampling - using a representative sample analyze for practicalities sake on cost, time spend, time influenced changes.

Randomly allocating and sampling - allocating is to people in a group as sample is people from a population.

Sampling frame - a list of the population from which we can randomly choose, else bias.

Simple random sampling - random people from a sample frame.

Cluster sampling - smaller grouping based on factor like location.

Stratum - a non overlapping subgroup.

Strata - two or more subgroups.

Stratified sampling - dividing the population in the sampling frame into a strata, while maintaining proportionality.

Central Limit Theorem - when independent random variables are added, their properly normalized sum tends toward a normal distribution (a bell curve).

Probability calculus - assigning numbers between 0 and 1 to uncertain event to represent probability.

Degree of belief - based on the persons information relating to the events happening, their subjective probability will be different, or change as more info is available.

Frequentists - the probability of the event is the proportion of times the event would happen in identical circumstances were repeated infinite times.

Classical probability - successful outcomes as a ratio to failures, with none being more likely than the other, as the probability is a sum of the elements it is composed of.(confusing, see below)

Independence - the occurrence of one does not affect the probability of the other.

Dependent - the probability that one will occur depending on whether the other has.

Completely dependent - when an event determines the outcome of another event, ex. head us, thus tails is down.

Joint probability - the probability hat two events will both occur.

Conditional probability - joint probability of two events is closely related to the probability that an event will occur if another has.

Bayes's theorem - for rewriting joint and conditional probabilities, thus A times the probability of B given A, is equal to probability of B times A given B. Both are equal to the joint probability of A and B.

Sample - some of the values of the objects in study, a subset of the population of values.

Random variable - outcome cannot be predicted.

Cumulative probability distribution - tells us the probability of drawing a value lower than one we choose. The probability of finding a value less than the value X gets larger the larger X is.

Probability density - the areas between two values under a curve.

Bernoulli distribution - probability p and probability 1-p, for two values certain for one to come up.

Binomial distribution - extends Bernoulli distribution, the number of successes in a sequence of n independent experiments where the result is a boolean value.

Statistics does not require extensive tedious arithmetic manipulation. Software has evolved the discipline by aiding perception, new instruments for monitoring and guiding, systems for decision making, providing insight, analysis and understanding of structure and patterns in the data.

Statistical application{decision making, forecasting, realtime monitoring, fraud detection, census enumeration, analysis of gene sequences}.

Poor material to work with yields poor results.

Often raw data is not numbers from a human perspective but is physical phenomena, however it can be measured, recorded, and translated into numerical representations.

How statistics are represented can differ by which summaries are emphasized. Sometimes data may be inadequate to answer a particular question. Statistics can also be gamed by focusing efforts on improving the value of one measure at the expense of others(number of shoplifting crime down due to increase spending in security, at the expense of the rise of personal theft).

New discoveries may contradict conclusions within a short period of time, thus the complication of the topic leads to mistrust.

Elementary misunderstandings of basic statistics can misinform people.

Data is evidence, providing grounding, linking ideas to reality, ready to validate and test our understanding. However, a poor match of understanding can be a consequence of poor data quality.

We can use data to guide us, and use statistical methods to extract information from data to influence behaviour of the system to reach a favourable conclusion.

Every advanced country has their own national statistical office.

Greater statistics is everything related to learning from data at every step regardless of different data-analytic disciplines.

Statistics is not about calculation but rather investigation.

Data usually has objects of study, and characteristics of the objects to study.

Ambiguity may be removed by expressing things in numerical form, and are universally understood.

Data is nature's evidence, seen through the lens of the measuring instrument.

Extreme values may alter our statistics unproportionately depending on what representation is used.

Discarding data or making data up based on averages leads to inaccuracy and simplification respectively, another dimension must be accounted for within the statistical model for the probability of missing data. Thus the data collection phase must try its best to minimize the problem.

Intelligent guesses may fix problems like an age of 210 in university age survey.

Computers automate detection and correction, given that the parameters are programmed correctly. Cross checking can be programmed into verifying that the data entered is correct.

Subjective approaches are useful for assigning probabilities to unique events, and is no longer an objective property of the external world but rather a property of it's interaction between observer and environment.

Classical approach is to compose the probability of equally likely elementary events, dealing with known certainties; like a die.

All approaches conform to the same axioms but offer a different mapping to the real world, thus the calculus is the same but the theory is different which may lead to different conclusions.

The joint probability of X is the same as the probability of A times conditional probability of B.

The sum of the die probabilities of 2,4, 6, when they can't occur together is equal to the sum of even numbers. Or the sum of heads or tails is the sum of the result and the opposite.

Certain shapes under the curve of a probability graph arise in natural phenomena.

Random variables and their distributions

Chapter 4 probability page 69 poisson distribution

**Surrounded by statistics****Simple Descriptions****Collecting good data****Probability****Estimation and inference****Statistical models and methods****Statistical computing**

**Notable People:**John Chambers, Blaise Pascal, Pierre de Fermat, Chrastiaan Huygens, Jacob Bernoulli, Pierre Simon Laplace, Abraham De Moivre, Simeon-Denis Poisson, Antoine Cournot, John Venn, Andrei Kolmogorov(Kolmogorov’s axioms),

**Terms:**

Statistics - the technology of extracting meaning from data, for handling uncertainty, predicting the future, making inferences about the unknown, and summarizing data.

Technology - application of science and its discoveries.

Data - usually numbers for results, measurements, counts, other processes that will be associated with meaning.

Goodhart's law - "Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes..." as the system evolves to meet that quota. Ex. Measuring programmers efficiency in lines of code(and sometimes speed) will lead to the rise of overabstraction, complexity and bugs; often at the cost of project failures, and extreme costs to make changes late into development.

Prosecutor's Fallacy - the probability doesn't dictate the reality of the situation, a probability taken out of context(limited information, or emphasizing a wrong statistic) can lead to false conclusions. Chance of criminal having all properties of limp, tall, wearing red, white is very low, thus you are the criminal. But, in a large population, the statistic will be 1/20 for, 19/20 against.

Publication bias - only publishing academic research that has significant findings(as opposed to publishing that there isn't anything significant), which could have been reached by doing a study N amount of times, and on Nth time, our "significant finding" occurred, thus disregarding all the other times against.

Variables - a characteristic of the object in study.

Value - quantifiable unit associated with a variable.

Ordinal - concern of the order of the values.

Magnitude - concern of the quantity.

Ratio - comparing the X of something with the Y(usually standard) something else.

Absolute scale - treating units as a well defined discrete value.

Distribution - of values, statistic summaries characterize the distribution like: similar, large, small, typical.

Average - a representative value to the numbers in the set.

Arithmetic mean - a statistic; summing the values and dividing by the quantity of values.

Median - a representative value of centre value where half the numbers are larger and half are smaller.

Mode - the value that is most frequent in the sample.

Dispersion - how widely dispersed the data is around the average, how many are larger/smaller, the bunching, how different are the values from each other.

Range - a measure of dispersion as the difference between the smallest and largest values.

Variance/mean squared deviation - squared difference between the value and the mean(makes all values positive, such that positive and negative values don't cancel each other out)

Standard deviation - the squared rooted(original form) of the mean squared deviation.

Skewness - measures the asymmetry of the distribution of values(right >--/left --< tailed)

Quantiles - looking at parts of the distribution

Quartiles - upper/lower quartile looks top and bottom 25%.

Decides - dividing the set into tenths.

Percentiles - dividing the set into one-hundredths.

Selection bias - bias introduced by the selection of people, groups, data that is not random, thus not a representation of the population.

Incomplete data - missing data, due to unrelated reasons, selection bias, may be the nature of the questions; impedes showing clear relationships.

Incorrect data - definitions are ambiguous, as are classification from multiple sources. Measuring precision, rounding errors, recording errors, arithmetic errors, standardization inconsistencies, units of measurement.

Error propagation - a single value can have exponential effects on the outcome, especially in long chain processing. Even harder to detect are systems where multiple errors cancel each other out creating deceptively acceptable result, but with incorrect internal calculations.

Outlier - value different from all others, by chance or some error in collection.

Observational data - cannot manipulate the data capture process, ex. recording galaxy data. Good for observing causation.

Experimental data - used to try different conditions, purposely manipulate/control the process ex. effectiveness of a stimulant during different levels of tiredness. Good for observing causation when randomization is applied to the population, creating greater confidence for observed results.

Law of large numbers - the more data that is collected, the greater the accuracy, but at a cost.

Experimental design - create groups of people who are as uniform as possible via randomization, and A/B test, with one group getting A and the other receiving B. A measure of the population imbalance will provide confidence with conclusions.

Double blind - random allocation to eliminate subconscious bias, where neither patient or experimenter knows if which sample they are getting.

Factorial experimental design - subdividing our tests based on the increase of parameters, thus the matrix of experimental combinations is maximized with a limited resource.

Representative - a smaller sample which has the same proportions of a larger one.

Survey sampling - using a representative sample analyze for practicalities sake on cost, time spend, time influenced changes.

Randomly allocating and sampling - allocating is to people in a group as sample is people from a population.

Sampling frame - a list of the population from which we can randomly choose, else bias.

Simple random sampling - random people from a sample frame.

Cluster sampling - smaller grouping based on factor like location.

Stratum - a non overlapping subgroup.

Strata - two or more subgroups.

Stratified sampling - dividing the population in the sampling frame into a strata, while maintaining proportionality.

Central Limit Theorem - when independent random variables are added, their properly normalized sum tends toward a normal distribution (a bell curve).

Probability calculus - assigning numbers between 0 and 1 to uncertain event to represent probability.

Degree of belief - based on the persons information relating to the events happening, their subjective probability will be different, or change as more info is available.

Frequentists - the probability of the event is the proportion of times the event would happen in identical circumstances were repeated infinite times.

Classical probability - successful outcomes as a ratio to failures, with none being more likely than the other, as the probability is a sum of the elements it is composed of.(confusing, see below)

Independence - the occurrence of one does not affect the probability of the other.

Dependent - the probability that one will occur depending on whether the other has.

Completely dependent - when an event determines the outcome of another event, ex. head us, thus tails is down.

Joint probability - the probability hat two events will both occur.

Conditional probability - joint probability of two events is closely related to the probability that an event will occur if another has.

Bayes's theorem - for rewriting joint and conditional probabilities, thus A times the probability of B given A, is equal to probability of B times A given B. Both are equal to the joint probability of A and B.

Sample - some of the values of the objects in study, a subset of the population of values.

Random variable - outcome cannot be predicted.

Cumulative probability distribution - tells us the probability of drawing a value lower than one we choose. The probability of finding a value less than the value X gets larger the larger X is.

Probability density - the areas between two values under a curve.

Bernoulli distribution - probability p and probability 1-p, for two values certain for one to come up.

Binomial distribution - extends Bernoulli distribution, the number of successes in a sequence of n independent experiments where the result is a boolean value.

**Briefs:**

Statistics does not require extensive tedious arithmetic manipulation. Software has evolved the discipline by aiding perception, new instruments for monitoring and guiding, systems for decision making, providing insight, analysis and understanding of structure and patterns in the data.

Statistical application{decision making, forecasting, realtime monitoring, fraud detection, census enumeration, analysis of gene sequences}.

Poor material to work with yields poor results.

Often raw data is not numbers from a human perspective but is physical phenomena, however it can be measured, recorded, and translated into numerical representations.

How statistics are represented can differ by which summaries are emphasized. Sometimes data may be inadequate to answer a particular question. Statistics can also be gamed by focusing efforts on improving the value of one measure at the expense of others(number of shoplifting crime down due to increase spending in security, at the expense of the rise of personal theft).

New discoveries may contradict conclusions within a short period of time, thus the complication of the topic leads to mistrust.

Elementary misunderstandings of basic statistics can misinform people.

Data is evidence, providing grounding, linking ideas to reality, ready to validate and test our understanding. However, a poor match of understanding can be a consequence of poor data quality.

We can use data to guide us, and use statistical methods to extract information from data to influence behaviour of the system to reach a favourable conclusion.

Every advanced country has their own national statistical office.

Greater statistics is everything related to learning from data at every step regardless of different data-analytic disciplines.

Statistics is not about calculation but rather investigation.

Data usually has objects of study, and characteristics of the objects to study.

Ambiguity may be removed by expressing things in numerical form, and are universally understood.

Data is nature's evidence, seen through the lens of the measuring instrument.

Extreme values may alter our statistics unproportionately depending on what representation is used.

Discarding data or making data up based on averages leads to inaccuracy and simplification respectively, another dimension must be accounted for within the statistical model for the probability of missing data. Thus the data collection phase must try its best to minimize the problem.

Intelligent guesses may fix problems like an age of 210 in university age survey.

Computers automate detection and correction, given that the parameters are programmed correctly. Cross checking can be programmed into verifying that the data entered is correct.

Subjective approaches are useful for assigning probabilities to unique events, and is no longer an objective property of the external world but rather a property of it's interaction between observer and environment.

Classical approach is to compose the probability of equally likely elementary events, dealing with known certainties; like a die.

All approaches conform to the same axioms but offer a different mapping to the real world, thus the calculus is the same but the theory is different which may lead to different conclusions.

The joint probability of X is the same as the probability of A times conditional probability of B.

The sum of the die probabilities of 2,4, 6, when they can't occur together is equal to the sum of even numbers. Or the sum of heads or tails is the sum of the result and the opposite.

Certain shapes under the curve of a probability graph arise in natural phenomena.

Random variables and their distributions

Chapter 4 probability page 69 poisson distribution

## Comments

## Post a Comment