Book Review: Executive Data Science by Brian Caffo, Roger D. Peng and Jeffrey Leek

In the introduction to the Data Science world, one needs to build the right frame surrounding the topic. This is usually done via a set of straight to the point books that I mention or summarise in this blog. This is the third one. All of them appear with the “data science” label.

The third book that I start with is written by Brian Caffo, Roger D. Peng and Jeffrey Leek. Its title is “Executive Data Science“. You can get it here. If you need to choose only one among the three books I talked about in this blog, probably the more comprehensive one will be this one.

The collection of bullet points that I have extracted from the book is a way to acknowledge value in a twofold manner: first, I praise the book and congratulate the authors and second, I try to condense in some lines a very personal collection of points extracted from the book.

As always, here it goes my personal disclaimer: the reading of this very personal and non-comprehensive list of bullet points by no means replaces the reading of the book it refers to; on the contrary, this post is an invite to read the entire work.

In approximately 150 pages, the book provides literally the following key points (please consider all bullet points as using inverted commas i.e. they show text coming from the book):

– “Descriptive statistics have many uses, most notably helping us get familiar with a data set”.
– Inference is the process of making conclusions about populations from samples.
– The most notable example of experimental design is randomization.
– Two types of learning: supervised and unsupervised.
– Machine Learning focuses on learning.
– Code and software play an important role to see if the data that you have is suitable for answering the question that you have.
– The five phases of a data science project are: question, exploratory data analysis, formal modeling, interpretation and communication.
– There are two common languages for analyzing data. The first one is the R programming language. R is a statistical programming language that allows you to pull data out of a database, analyze it, and produce visualizations very quickly. The other major programming language that’s used for this type of analysis is Python. Python is another similar language that allows you to pull data out of databases, analyze and manipulate it, visualize it, and connected to
downstream production.
– Documentation basically implies a way to integrate the analysis code and the figures and the plots that have been created by the data scientist with plain text that can be used to explain what’s going on. One example is the R Markdown
framework. Another example is iPython notebooks.
– Shiny by R studio is a way to build data products that you can share with people who don’t necessarily have a lot of data science experience.
– Data Engineer and Data Scientist: A data engineer builds out your system for actually computing on that infrastructure. A data scientist needs to be able to do statistics.
– Data scientists: They usually know how to use R or Python, which are general purpose data science languages that people use to analyze data. They know how to do some kind of visualization, often interactive visualization with something like D3.js. And they’ll likely know SQL in order to pull data out of a relational
database.
– kaggle.com is also mentioned as a data science web site.

The authors also provide useful comments on creating, managing and growing a data science team. They start with the basics e.g. “It’s very helpful to right up front have a policy on the Code of Conduct”.

– Data science is an iterative process.
– The authors also mention the different types of data science questions (as already mentioned in the summary of the book titled “The Elements of Data Analytic Style“.
– They also provide an exploratory data analysis checklist.
– Some words on how to start with modeling.
– Instead of starting to discuss causal analysis, they talk about associational analysis.
– They also provide some tips on data cleaning, interpretation and communication.
– Confounding: The apparent relationship or lack of relationship between A and B may be due to their joint relationship with C.
– A/B testing: giving two options.
– It’s important not to confuse randomization, a strategy used to combat lurking and confounding variables and random sampling, a stategy used to help with generalizability.
– p-value and null hypothesis are also mentioned.
– Finally they link to knit.

Happy data-ing!

Find your way

This is a Security Bloggers Network syndicated blog post authored by itsecuriteer. Read the original post at: Security and risk