Course 25: Introduction to Statistics in Data Analyst

182 Views
Oct 4, 2021
Course 25: Introduction to Statistics in Data Analyst

Statistics for a Data Analyst are an ultra powerful tool for a data scientist . Generally speaking, statistics is the use of mathematics to perform technical analysis of data. A basic visualization like a pie chart can give you high-level insights, but with statistics it’s possible to mine data in a much more focused and information-driven way. Statistics help to come up with concrete conclusions about our data.

By using statistics, we can get more detailed information about the exact structure of our data and how we can optimally apply other data science techniques to gain more information. In this article, we’ll look at 5 statistical concepts you need to know and how they can be applied most effectively.

1. Origin

The origin of the word “statistics” goes back to classical Latin status which, through a series of successive developments, resulted in statistical French , first attested in 1771.

classic latin

status

 

state

state

 

statista

 

(1633)

Statesman

Italian

statistics

 

(1672)

modern latin

statisticus

 

(1771)

French

statistical

It was around the same time that statistik appeared in German, while English speakers used the expression political arithmetic until 1798, when the word statistics entered that language.

2. Statistics and statistics

Originally, this discipline therefore concerns state affairs.

Currently, we generally distinguish statistics (plural) of the statistic (singular)

The statistics relate to the systematic study of social facts that define a state by digital processes (censuses, surveys, censuses, …)

Among the first works dealing with this aspect, we can mention “Le Détail de la France”, written by Boisguilbert in 1697 and 1722, “Description of France” , due to Pigonal de la Force

The second meaning does not appear until around 1830. This is the one that is discussed in this course. We will define statistics as a set of mathematical interpretation techniques applied to phenomena (e.g. social facts) for which an exhaustive study of all the factors is impossible because of their large number or their complexity. We will call:

  • individual an individual element considered in the statistical analysis

  • population the set of all individuals considered

  • sample the portion of the population tested

We can further divide the statistics into two main areas:

  • the descriptive statistics , which is interested in collecting and data formatting and determining a number of characteristic variables of the population

  • the statistical inference , which aims to draw conclusions about the population from a sample study

3. History

At all times, heads of state have wanted to determine the power of the nations they ruled using partial or complete censuses (population, territory, production, etc.)

From 3000 BC , we finds mention of collections of observations on goods and people in Mesopotamia .

In 1200 BC , evaluations of agricultural production were carried out in China .

At the beginning of our Era takes place a count of the riches of the Roman Empire , mentioned in the Gospel of Luke.

In the Middle Ages , surveys were carried out on the order of Charlemagne and then of William the Conqueror . In both cases, the goal is to get a more precise idea of ​​the country’s wealth.

In the XVII th century , to avoid burdensome and costly censusWilliam Petty (1623-1687) developed a method of counting the population of London on the basis of the average proportions between:

  • the houses

  • fires per house

  • family composition

In the XIX th century , the actual census show the importance and, in 1853, held in Brussels on st International Congress of Statistics , under the leadership of Adolphe Quetelet (1796-1874, Belgian astronomer and mathematician, a of the founders of statistical science).
The objective of this congress is to standardize the techniques of compilation of national statistics, in order to facilitate comparisons.

Atbeginning of the XX th century , a debate between the supporters census (carried out on all of the population) and surveys (conducted on a representative sample of the population).

The censuses are not always possible or desirable. In some cases, they may be too expensive (such as, for example, surveys of the entire population of a country). They can also contain errors. Sometimes they are downright aberrant (measuring the average strength of a type of car by throwing all cars of that type against a wall would be commercially unacceptable ).

To overcome these drawbacks, we have recourse to a statistical survey , which consists of

deduce the properties of any
one population from
one analytical sample .

It is essential that the sample is selected and analyzed adequately. In particular, the sample must be representative of the population. An unrepresentative sample is said to be biased .

The importance of the choice of sample is illustrated by the case of straw votes .

At the beginning of XX th century, many American newspapers are making “straw vote”by asking millions of people for their written opinion a few weeks before the elections.

In 1936, the Literary Digest predicted, using a sample of 2,400,000 voters, the victory of Republican candidate N. Landon.

George Gallup , thanks to a poll of 4000 wisely chosen people, predicts the victory of Democrat Franklin D. Roosevelt.

The latter’s victory spells the end of the straw votes, the samples of which are often biased (the Literary Digest cards had been sent to telephone subscribers and car owners; this wealthy electorate was more favorable to the Republicans).

Leave a Reply

Your email address will not be published. Required fields are marked *