In the introduction to the Data Science world, one needs to build the right frame surrounding the topic. This is usually done via a set of straight to the point books that I will be summarising in this blog.
The second book that I start with is written by Jeffrey Leek. Its title is “The Elements of Data Analytic Style“. You can get it here. It is a primer on basic statistical concepts that are worth having in mind when embarking on a scientific journey.
This summary is a way to acknowledge value in a twofold manner: first, I praise the book and congratulate the author and second, I share with the community a very personal summary of the books.
Let me try to share with you the main learning points I collected from this book. As always, here it goes my personal disclaimer: the reading of this very personal and non-comprehensive summary by no means replaces the reading of the book it refers to; on the contrary, this post is an invite to read the entire work.
In approximately 100 pages, the book provides the following key points:
Type of analysis
Figure 2.1, titled the data analysis question type flow chart is the foundation of the book. It classifies the different types of data analysis. The basic one is a descriptive one (reporting results with no interpretation). A step further is a exploratory analysis (will the proposed statements be still valid in a qualitative way using a different sample?).
If this also holds true in a quantitative manner, then we are in an inferential analysis. If we can use a subset of measures to predict some others then we can talk about a predictive analysis. The next step, certainly less frequent, is the possibility to seek a cause, then we are in a casual analysis. Finally, and very rarely, if we go beyond statistics and find a deterministic relation, then those are the mechanistic analysis.
Correlation does not imply causation
This is key to understand. The additional element to really grasp it is the existence of confounding elements i.e. additional variables, not touched by the statistical work we are embarked on, that connect the variables we are studying. Two telling examples are mentioned in the book:
– The consumption of ice cream and the murder rate are correlated. However, there is no causality. There is a confounder: the temperature.
– Shoe size and literacy are correlated. However there is a confounder here: age.
Other typical mistakes
Overfitting: Using a single unsplit data set for both model building and testing.
Data dredging: Fitting a large number of models to a data set.
Components of a data set
It is not only the raw data, but also the tidy data set, a code book describing each of the variables and its values in the tidy data set and a script on how to reach the tidy data set from the raw data.
The data set should be understood even if you, as producer or curator of the data set, are not there.
Type of variables
Continuous, ordinal, categorical, missing and censored.
Some useful tips
– The preferred way to graphically represent data: plot your data.
– Explore your data thoroughly before jumping to statistical analysis.
– Use a linear regression analysis to compare it with the initial scatterplot of the original data.
– More data usually beats better algorithms.
Section 9 provides some hints on how to write an analysis. Section 10 does a similar role on how to create graphs. Section 11 hints how to present the analysis to the community. Section 12 cares about how to make the entire analysis reproducible. Section 14 provides a checklist and Section 15 additional references.
This is a Security Bloggers Network syndicated blog post authored by itsecuriteer. Read the original post at: Security and risk