Data Science

From Sinfronteras
Revision as of 22:49, 15 July 2020 by Adelo Vieira (talk | contribs) (Diploma in Predictive Data Analytics assessment)
Jump to: navigation, search

This is a protected page.

ncosgrave@cct.ie

Moodle links: https://moodle.cct.ie/course/view.php?id=1604


Contents

Projects portfolio



Data Analytics courses

Data Analytics - Machine Learning Courses


  • Posts


  • Python for Data Science and Machine Learning Bootcamp - Nivel básico
https://www.udemy.com/course/python-for-data-science-and-machine-learning-bootcamp/
  • Machine Learning, Data Science and Deep Learning with Python - Nivel básico - Parecido al anterior
https://www.udemy.com/course/data-science-and-machine-learning-with-python-hands-on/
  • Data Science: Supervised Machine Learning in Python - Nivel más alto
https://www.udemy.com/course/data-science-supervised-machine-learning-in-python/
  • Mathematical Foundation For Machine Learning and AI
https://www.udemy.com/course/mathematical-foundation-for-machine-learning-and-ai/
  • The Data Science Course 2019: Complete Data Science Bootcamp
https://www.udemy.com/course/the-data-science-course-complete-data-science-bootcamp/


  • Coursera - By Stanford University



  • Columbia University - COURSE FEES USD 1,400



Possible sources of data


Irish Government Data Portal https://data.gov.ie/
UK Government Data Portal https://data.gov.uk/
UK National Health Service Data https://digital.nhs.uk/data-and-information
EU Open Data Portal http://data.europa.eu/euodp/en/data/
US Government Data Portal https://www.data.gov/
Canadian Government Data Portal https://open.canada.ca/en/open-data
Indian Government Open Data https://data.gov.in/
World Bank https://data.worldbank.org/
International Monetary Fund https://www.imf.org/en/Data
World Health Organisation http://www.who.int/gho/en/
UNICEF https://data.unicef.org/
Federal Drug Administration https://www.fda.gov/Drugs/InformationOnDrugs/ucm079750.htm
Google Public Data Explorer https://www.google.com/publicdata/directory
Human Rights Data Analysis Group https://hrdag.org/
Armed Conflict Data http://www.pcr.uu.se/research/UCDP/
Amazon Web Services Open Data Registry https://registry.opendata.aws/
Pew Research Datasets http://www.pewinternet.org/datasets/
CERN Open Data http://opendata.cern.ch/
Kaggle https://www.kaggle.com/
UCI Machine Learning Repository https://archive.ics.uci.edu/ml/index.php
Open Data Network https://www.opendatanetwork.com/
Linked Open Data - University of Münster https://www.uni-muenster.de/LODUM/
US National Climate Data https://www.ncdc.noaa.gov/data-access/quick-links#loc-clim
US Medicare Hospital Quality Data https://data.medicare.gov/data/hospital-compare
Yelp Data https://www.yelp.com/dataset/challenge
US Census Data https://www.census.gov/data.html
Broad Institute Cancer Program Data http://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgi
National Centers for Environmental Information https://www.ncdc.noaa.gov/data-access
Centers for Disease Control and Prevention https://www.cdc.gov/datastatistics/
Open Data Monitor https://opendatamonitor.eu/
Plenario http://plenar.io/
British Film Institute http://www.bfi.org.uk/education-research/film-industry-statistics-research
Edinburgh University Datasets http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.html
DataHub http://datahub.io



What is data


Data types - Levels of Measurement

Data types (Levels of Measurement)


Provides Nominal Ordinal Interval Ratio
The "order" of values is known
"Counts", aka, "Frequency of Distribution"
Mode
Median
Mean
Can quantify the difference between each value
Can add or subtract values
Can multiple and divide values
Has "true zero"


In Practice:

  • Most schemes accommodate just two levels of measurement: nominal and ordinal
  • Nominal attributes are also called "categorical", "enumerated", or "discrete". However, "enumerated" and "discrete" imply order
  • There is one special case: dichotomy (otherwise known as a "boolean" attribute)
  • Ordinal attributes are called "numeric", or "continuous", however "continuous" implies mathematical continuity



Nominal


Ordinal


Interval


Ratio


What is Metadata


What is an example


What is a dataset


What is Data Analytics

What is Data Science - Data Analytics - Predictive Data Analytics - Data Analysis - Data Mining - AI - Machine Learning


Data analysis is the process of inspecting, cleansing, transforming, and modeling data with a goal of discovering useful information, suggesting conclusions, and supporting decision-making.


Data mining is a particular data analysis technique that focuses on the modeling and knowledge discovery for predictive rather than purely descriptive purposes, while business intelligence covers data analysis that relies heavily on aggregation, focusing on business information.


Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.[1] Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information (with intelligent methods) from a data set and transform the information into a comprehensible structure for further use. https://en.wikipedia.org/wiki/Data_mining


What’s the Difference Between Data Analytics and Data Analysis: https://www.getsmarter.com/blog/career-advice/difference-data-analytics-data-analysis/#:~:text=Data%20analysis%20and%20data%20analytics,data%20analysis%20is%20a%20subcomponent.


Al tratar de encontrar una definición para Machine Learning me di cuanta de que muchos expertos coinciden en que no hay una definición standard para ML.


En este post se explica bien la definición de ML: https://machinelearningmastery.com/what-is-machine-learning/

Estos vídeos también son excelentes para entender what ML is:

https://www.youtube.com/watch?v=f_uwKZIAeM0
https://www.youtube.com/watch?v=ukzFI9rgwfU
https://www.youtube.com/watch?v=WXHM_i-fgGo
https://www.coursera.org/lecture/machine-learning/what-is-machine-learning-Ujm7v


Una de las definiciones más citadas es la definición de Tom Mitchell. This author provides in his book Machine Learning a definition in the opening line of the preface:

Tom Mitchell

The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience.

So, in short we can say that ML is about write computer programs that improve themselves.


Tom Mitchell also provides a more complex and formal definition:

Tom Mitchell

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Don't let the definition of terms scare you off, this is a very useful formalism. It could be used as a design tool to help us think clearly about:

E: What data to collect.
T: What decisions the software needs to make.
P: How we will evaluate its results.

Suppose your email program watches which emails you do or do not mark as spam, and based on that learns how to better filter spam. In this case: https://www.coursera.org/lecture/machine-learning/what-is-machine-learning-Ujm7v

E: Watching you label emails as spam or not spam.
T: Classifying emails as spam or not spam.
P: The number (or fraction) of emails correctly classified as spam/not spam.



Styles of Learning - Types of Machine Learning


Machine learning types.jpg



Supervised Learning

https://en.wikipedia.org/wiki/Supervised_learning


Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs.

In other words, It infers a function from labeled training data consisting of a set of training examples.

In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal).

A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.


https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/

The majority of practical machine learning uses supervised learning.

Supervised learning is when you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.

The goal is to approximate the mapping function so well that when you have new input data () that you can predict the output variables () for that data.

It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. We know the correct answers, the algorithm iteratively makes predictions on the training data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable level of performance.


https://www.datascience.com/blog/supervised-and-unsupervised-machine-learning-algorithms

Supervised machine learning is the more commonly used between the two. It includes such algorithms as linear and logistic regression, multi-class classification, and support vector machines. Supervised learning is so named because the data scientist acts as a guide to teach the algorithm what conclusions it should come up with. It’s similar to the way a child might learn arithmetic from a teacher. Supervised learning requires that the algorithm’s possible outputs are already known and that the data used to train the algorithm is already labeled with correct answers. For example, a classification algorithm will learn to identify animals after being trained on a dataset of images that are properly labeled with the species of the animal and some identifying characteristics.



  • Supervised Learning - Regression


  • Supervised Learning - Classification
  • Naive Bayes
  • Decision Trees
  • K-Nearest Neighbour
  • Perceptrons - Neural Networks and Support Vector Machines



Unsupervised Learning


  • Unsupervised Learning - Clustering


  • Unsupervised Learning - Association Rules



Reinforcement Learning


Descriptive Data Analysis

  • Rather than find hidden information in the data, descriptive data analysis looks to summarize the dataset.
  • They are commonly implemented measures included in the descriptive data analysis:
  • Central tendency (Mean, Mode, Median)
  • Variability (Standard deviation, Min/Max)



Central tendency

https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-median.php

A central tendency (or measure of central tendency) is a single value that attempts to describe a set of data by identifying the central position within that set of data.

The mean (often called the average) is most likely the measure of central tendency that you are most familiar with, but there are others, such as the median and the mode.

The mean, median and mode are all valid measures of central tendency, but under different conditions, some measures of central tendency become more appropriate to use than others. In the following sections, we will look at the mean, mode and median, and learn how to calculate them and under what conditions they are most appropriate to be used.



Mean

Mean (Arithmetic)

The mean (or average) is the most popular and well known measure of central tendency.

The mean is equal to the sum of all the values in the data set divided by the number of values in the data set.

So, if we have values in a data set and they have values the sample mean, usually denoted by (pronounced x bar), is:


The mean is essentially a model of your data set. It is the value that is most common. You will notice, however, that the mean is not often one of the actual values that you have observed in your data set. However, one of its important properties is that it minimises error in the prediction of any one value in your data set. That is, it is the value that produces the lowest amount of error from all other values in the data set.


An important property of the mean is that it includes every value in your data set as part of the calculation. In addition, the mean is the only measure of central tendency where the sum of the deviations of each value from the mean is always zero.



When not to use the mean

The mean has one main disadvantage: it is particularly susceptible to the influence of outliers. These are values that are unusual compared to the rest of the data set by being especially small or large in numerical value. For example, consider the wages of staff at a factory below:

Staff 1 2 3 4 5 6 7 8 9 10
Salary

The mean salary for these ten staff is $30.7k. However, inspecting the raw data suggests that this mean value might not be the best way to accurately reflect the typical salary of a worker, as most workers have salaries in the $12k to 18k range. The mean is being skewed by the two large salaries. Therefore, in this situation, we would like to have a better measure of central tendency. As we will find out later, taking the median would be a better measure of central tendency in this situation.


Another time when we usually prefer the median over the mean (or mode) is when our data is skewed (i.e., the frequency distribution for our data is skewed). If we consider the normal distribution - as this is the most frequently assessed in statistics - when the data is perfectly normal, the mean, median and mode are identical. Moreover, they all represent the most typical value in the data set. However, as the data becomes skewed the mean loses its ability to provide the best central location for the data because the skewed data is dragging it away from the typical value. However, the median best retains this position and is not as strongly influenced by the skewed values. This is explained in more detail in the skewed distribution section later in this guide.



Mean in R
mean(iris$Sepal.Width)



Median

The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data. In order to calculate the median, suppose we have the data below:

65 55 89 56 35 14 56 55 87 45 92

We first need to rearrange that data into order of magnitude (smallest first):

14 35 45 55 55 56 56 65 87 89 92

Our median mark is the middle mark - in this case, 56. It is the middle mark because there are 5 scores before it and 5 scores after it. This works fine when you have an odd number of scores, but what happens when you have an even number of scores? What if you had only 10 scores? Well, you simply have to take the middle two scores and average the result. So, if we look at the example below:

65 55 89 56 35 14 56 55 87 45

We again rearrange that data into order of magnitude (smallest first):

14 35 45 55 55 56 56 65 87 89

Only now we have to take the 5th and 6th score in our data set and average them to get a median of 55.5.



Median in R

median(iris$Sepal.Length)



Mode

The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the most popular option. An example of a mode is presented below:

Mode-1.png


Normally, the mode is used for categorical data where we wish to know which is the most common category, as illustrated below:

Mode-1a.png


We can see above that the most common form of transport, in this particular data set, is the bus. However, one of the problems with the mode is that it is not unique, so it leaves us with problems when we have two or more values that share the highest frequency, such as below:

Mode-2.png


We are now stuck as to which mode best describes the central tendency of the data. This is particularly problematic when we have continuous data because we are more likely not to have any one value that is more frequent than the other. For example, consider measuring 30 peoples' weight (to the nearest 0.1 kg). How likely is it that we will find two or more people with exactly the same weight (e.g., 67.4 kg)? The answer, is probably very unlikely - many people might be close, but with such a small sample (30 people) and a large range of possible weights, you are unlikely to find two people with exactly the same weight; that is, to the nearest 0.1 kg. This is why the mode is very rarely used with continuous data.


Another problem with the mode is that it will not provide us with a very good measure of central tendency when the most common mark is far away from the rest of the data in the data set, as depicted in the diagram below:

Mode-3.png


In the above diagram the mode has a value of 2. We can clearly see, however, that the mode is not representative of the data, which is mostly concentrated around the 20 to 30 value range. To use the mode to describe the central tendency of this data set would be misleading.



To get the Mode in R
install.packages("modeest")
library(modeest)
> mfv(iris$Sepal.Width, method = "mfv")



Skewed Distributions and the Mean and Median

We often test whether our data is normally distributed because this is a common assumption underlying many statistical tests. An example of a normally distributed set of data is presented below:

Skewed-1.png

When you have a normally distributed sample you can legitimately use both the mean or the median as your measure of central tendency. In fact, in any symmetrical distribution the mean, median and mode are equal. However, in this situation, the mean is widely preferred as the best measure of central tendency because it is the measure that includes all the values in the data set for its calculation, and any change in any of the scores will affect the value of the mean. This is not the case with the median or mode.

However, when our data is skewed, for example, as with the right-skewed data set below:

Skewed-2.png

we find that the mean is being dragged in the direct of the skew. In these situations, the median is generally considered to be the best representative of the central location of the data. The more skewed the distribution, the greater the difference between the median and mean, and the greater emphasis should be placed on using the median as opposed to the mean. A classic example of the above right-skewed distribution is income (salary), where higher-earners provide a false representation of the typical income if expressed as a mean and not a median.

If dealing with a normal distribution, and tests of normality show that the data is non-normal, it is customary to use the median instead of the mean. However, this is more a rule of thumb than a strict guideline. Sometimes, researchers wish to report the mean of a skewed distribution if the median and mean are not appreciably different (a subjective assessment), and if it allows easier comparisons to previous research to be made.



Summary of when to use the mean, median and mode

Please use the following summary table to know what the best measure of central tendency is with respect to the different types of variable:

Type of Variable Best measure of central tendency
Nominal Mode
Ordinal Median
Interval/Ratio (not skewed) Mean
Interval/Ratio (skewed) Median

For answers to frequently asked questions about measures of central tendency, please go to: https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-median-faqs.php



Measures of Variation


Range

The Range just simply shows the min and max value of a variable.

In R:

> min(iris$Sepal.Width)
> max(iris$Sepal.Width)
> range(iris$Sepal.Width)


Range can be used on Ordinal, Ratio and Interval scales



Quartile

https://statistics.laerd.com/statistical-guides/measures-of-spread-range-quartiles.php

Quartiles tell us about the spread of a data set by breaking the data set into quarters, just like the median breaks it in half.

For example, consider the marks of the 100 students, which have been ordered from the lowest to the highest scores.


  • The first quartile (Q1): Lies between the 25th and 26th student's marks.
    • So, if the 25th and 26th student's marks are 45 and 45, respectively:
      • (Q1) = (45 + 45) ÷ 2 = 45
  • The second quartile (Q2): Lies between the 50th and 51st student's marks.
    • If the 50th and 51th student's marks are 58 and 59, respectively:
      • (Q2) = (58 + 59) ÷ 2 = 58.5
  • The third quartile (Q3): Lies between the 75th and 76th student's marks.
    • If the 75th and 76th student's marks are 71 and 71, respectively:
      • (Q3) = (71 + 71) ÷ 2 = 71


In the above example, we have an even number of scores (100 students, rather than an odd number, such as 99 students). This means that when we calculate the quartiles, we take the sum of the two scores around each quartile and then half them (hence Q1= (45 + 45) ÷ 2 = 45) . However, if we had an odd number of scores (say, 99 students), we would only need to take one score for each quartile (that is, the 25th, 50th and 75th scores). You should recognize that the second quartile is also the median.


Quartiles are a useful measure of spread because they are much less affected by outliers or a skewed data set than the equivalent measures of mean and standard deviation. For this reason, quartiles are often reported along with the median as the best choice of measure of spread and central tendency, respectively, when dealing with skewed and/or data with outliers. A common way of expressing quartiles is as an interquartile range. The interquartile range describes the difference between the third quartile (Q3) and the first quartile (Q1), telling us about the range of the middle half of the scores in the distribution. Hence, for our 100 students:



However, it should be noted that in journals and other publications you will usually see the interquartile range reported as 45 to 71, rather than the calculated

A slight variation on this is the which is half the Hence, for our 100 students:




Quartile in R
quantile(iris$Sepal.Length)
Result 
0%    25%    50%    75%    100% 
4.3   5.1    5.8    6.4    7.9

0% and 100% are equivalent to min max values.



Box Plots

boxplot(iris$Sepal.Length,
       col = "blue", 
       main="iris dataset", 
       ylab = "Sepal Length")



Variance

https://statistics.laerd.com/statistical-guides/measures-of-spread-absolute-deviation-variance.php

Another method for calculating the deviation of a group of scores from the mean, such as the 100 students we used earlier, is to use the variance. Unlike the absolute deviation, which uses the absolute value of the deviation in order to "rid itself" of the negative values, the variance achieves positive values by squaring each of the deviations instead. Adding up these squared deviations gives us the sum of squares, which we can then divide by the total number of scores in our group of data (in other words, 100 because there are 100 students) to find the variance (see below). Therefore, for our 100 students, the variance is 211.89, as shown below:


  • Variance describes the spread of the data.
  • It is a measure of deviation of a variable from the arithmetic mean.
  • The technical definition is the average of the squared differences from the mean.
  • A value of zero means that there is no variability; All the numbers in the data set are the same.
  • A higher number would indicate a large variety of numbers.



Variance in R
var(iris$Sepal.Length)



Standard Deviation

https://statistics.laerd.com/statistical-guides/measures-of-spread-standard-deviation.php

The standard deviation is a measure of the spread of scores within a set of data. Usually, we are interested in the standard deviation of a population. However, as we are often presented with data from a sample only, we can estimate the population standard deviation from a sample standard deviation. These two standard deviations - sample and population standard deviations - are calculated differently. In statistics, we are usually presented with having to calculate sample standard deviations, and so this is what this article will focus on, although the formula for a population standard deviation will also be shown.


The sample standard deviation formula is:



The population standard deviation formula is:


  • The Standard Deviation is the square root of the variance.
  • This measure is the most widely used to express deviation from the mean in a variable.
  • The higher the value the more widely distributed are the variable data values around the mean.
  • Assuming the frequency distributions approximately normal, about 68% of all observations are within +/- 1 standard deviation.
  • Approximately 95% of all observations fall within two standard deviations of the mean (if data is normally distributed).



Standard Deviation in R
sd(iris$Sepal.Length)



Z Score

  • z-score represents how far from the mean a particular value is based on the number of standard deviations.
  • z-scores are also known as standardized residuals
  • Note: mean and standard deviation are sensitive to outliers
> x <-((iris$Sepal.Width) - mean(iris$Sepal.Width))/sd(iris$Sepal.Width)
> x 
> x[77] #choose a single row # or this
> x <-((iris$Sepal.Width[77]) - mean(iris$Sepal.Width))/sd(iris$Sepal.Width)
> x



Shape of Distribution


Skewness

  • Skewness is a method for quantifying the lack of symmetry in the distribution of a variable.
  • Skewness value of zero indicates that the variable is distributed symmetrically. Positive number indicate asymmetry to the left, negative number indicates asymmetry to the right.
Skewness.png



Skewness in R
> install.packages("moments") and library(moments)
> skewness(iris$Sepal.Width)



Histograms in R

> hist(iris$Petal.Width)



Kurtosis

  • Kurtosis is a measure that gives indication in terms of the peak of the distribution.
  • Variables with a pronounced peak toward the mean have a high Kurtosis score and variables with a flat peak have a low Kurtosis score.



Kurtosis in R
> kurtosis(iris$Sepal.Length)



Correlation & Simple and Multiple Regression

  • 17/06: Recorded class - Correlation & Regration



Correlation

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. https://en.wikipedia.org/wiki/Correlation_and_dependence


Correlation1.png


Where moderate to strong correlations are found, we can use this to make a prediction about one the value of one variable given what is known about the other variables.

The following are examples of correlations:

  • there is a correlation between ice cream sales and temperature.
  • Phytoplankton population at a given latitude and surface sea temperature
  • Blood alcohol level and the odds of being involved in a road traffic accident



Measuring Correlation


Pearsons r - The Correlation Coefficient

Karl Pearson (1857-1936)

The correlation coefficient, developed by Karl Pearson, provides a much more exact way of determining the type and degree of a linear correlation between two variables.

Pearson's r, also known as the Pearson product-moment correlation coefficient, is a measure of the strength of the relationship between two variables and is given by this equation:



Where and are the means of the x (independent) and y (dependent) variables, respectively, and and are the individual observations for each variable.


The direction of the correlation:

  • Values of Pearson's r range between -1 and +1.
  • Values greater than zero indicate a positive correlation, with 1 being a perfect positive correlation.
  • Values less than zero indicate a negative correlation, with -1 being a perfect negative correlation.


The degree of the correlation:

Degree of correlation Interpretation
0.8 to 1.0 Very strong
0.6 to 0.8 Strong
0.4 to 0.6 Moderate
0.2 to 0.4 Weak
0 to 0.2 Very weak or non-existent



The coefficient of determination

The value is termed the coefficient of determination because it measures the proportion of variance in the dependent variable that is determined by its relationship with the independent variables. This is calculated from two values:


  • The total sum of squares:
  • The residual sum of squares:


The total sum of squares is the sum of the squared differences between the actual values () and their mean. The residual sum of squares is the sum of the squared differences between the predicted values () and their respective actual values.



is used to gain some idea of the goodness of fit of a model. It is a measure of how well the regression predictions approximate the actual data points. An of 1 means that predicted values perfectly fit the actual data.



Testing the "generalizability" of the correlation

Having determined the value of the correlation coefficient (r) for a pair of variables, you should next determine the likelihood that the value of r occurred purely by chance. In other words, what is the likelihood that the relationship in your sample reflects a real relationship in the population.

Before carrying out any test, the alpha () level should be set. This is a measure of how willing we are to be wrong when we say that there is a relationship between two variables. A commonly-used level in research is 0.05.

An level to 0.05 means that you could possibly be wrong up to 5 times out of 100 when you state that there is a relationship in the population based on a correlation found in the sample.

In order to test whether the correlation in the sample can be generalized to the population, we must first identify the null hypothesis and the alternative hypothesis .

This is a test against the population correlation co-efficient (), so these hypotheses are:


  • - There is no correlation in the population
  • - There is correlation


Next, we calculate the value of the test statistic using the following equation:



So for a correlation coefficient value of -0.8, an value of 0.9 and a sample size of 102, this would be:



Checking the t-tables for an level of 0.005 and a two-tailed test (because we are testing if is less than or greater than 0) we get a critical value of 2.056. As the value of the test statistic (25.29822) is greater than the critical value, we can reject the null hypothesis and conclude that there is likely to be a correlation in the population.



Correlation Causation

Even if you find the strongest of correlations, you should never interpret it as more than just that... a correlation.


Causation indicates a relationship between two events where one event is affected by the other. In statistics, when the value of one event, or variable, increases or decreases as a result of other events, it is said there is causation.

Let's say you have a job and get paid a certain rate per hour. The more hours you work, the more income you will earn, right? This means there is a relationship between the two events and also that a change in one event (hours worked) causes a change in the other (income). This is causation in action! https://study.com/academy/lesson/causation-in-statistics-definition-examples.html


Given any two correlated events A and B, the following relationships are possible:

  • A causes B
  • B causes A
  • A and B are both the product of a common underlying cause, but do not cause each other
  • Any relationship between A and B is simply the result of coincidence.


Although a correlation between two variables could possibly indicate the presence of

  • a causal relationship between the variables in either direction(x causes y, y causes x); or
  • the influence of one or more confounding variables, another variable that has an influence on both variables

It can also indicate the absence of any connection. In other words, it can be entirely spurious, the product of pure chance. In the following slides, we will look at a few examples...



Examples

Causality or coincidence?



Simple Linear Regression

SimpleLinearRegression2.png


The purpose of regression analysis is to:

  • Predict the value of the dependent variable as a function of the value(s) of at least one independent variable.
  • Explain how changes in an independent variable are manifested in the dependent variable
  • The dependent variable is the variable that is to be predicted unexplained.
  • An independent variable is the variable or variables that is used to predict or explain the dependent variable


The regression equation:

  • Dependent variable: :
  • Independent variable: :
  • Slope:
The slope is the amount of change in units of for each unitchange in .
  • intercept: :



Multiple Linear Regression

With Simple Linear Regression, we saw that we could use a single independent variable (x) to predict one dependent variable (y). Multiple Linear Regression is a development of Simple Linear Regression predicated on the assumption that if one variable can be used to predict another with a reasonable degree of accuracy then using two or more variables should improve the accuracy of the prediction.



Uses for Multiple Linear Regression:

When implementing Multiple Linear Regression, variables added to the model should make a unique contribution towards explaining the dependent variable. In other words, the multiple independent variables in the model should be able to predict the dependent variable better than any one of the variables would do ina Simple Linear Regression model.



The Multiple Linear Regression Model:



Multicolinearity:

Before adding variables to the model, it is necessary to check for correlation between the independent variables themselves. The greater degree of correlation between two independent variables, the more information they hold in common about the dependent variable. This is known as multicolinearity.


Because it is difficult to properly apportion the information each independent variable carries about the dependent variable, including highly correlated independent variables in the model can result in unstable estimates for the coefficients. Unstable coefficient estimates result in unrepeatable studies.



Adjusted :

Recall that the coefficient of determination is a measure of how well our model as a whole explains the values of the dependent variable. Because models with larger numbers of independent variables will inevitably explain more variation in the dependent variable, the adjusted value penalises models with a large number of independent variables. As such, adjusted can be used to compare the performance of models with different numbers of independent variables.



RapidMiner Linear Regression examples


  • Example 1:
  • In the parameters for the split Dataoperator, click on theEditEnumerationsbutton and enter two rows in the dialog box that opens. The first value should be 0.7 and the second should be 0.3. You can, of course, choose other values for the train and test split, provided that they sum to 1.
  • If you want the regression to be reproducible, check the «Use Local Random Seedbox» and enter a seed value of your choosing in the local random seedbox.


  • Linear Regression operator:
  • Set feature selection to none.
  • If you are doing multiple linear regression, check the eliminate collinear features box.
  • If you want to have a Y-intercept calculated, check the use bias box.
  • Set the ridge parameter to 0.


  • After running the model, clicking on the linear Regression tab in the results, will show you the coefficient values, t statistic and .
  • Note that if the is less than your chosen , you can also reject the null hypothesis.



RapidMiner Linear regression-examples1 fig1.png


RapidMiner Linear regression-examples1 fig2.png


RapidMiner Linear regression-examples1 fig3.png




Naive Bayes

Lecture and Tutorial: https://moodle.cct.ie/mod/scorm/player.php?a=4&currentorg=tuto&scoid=8&sesskey=wc2PiHQ6F5&display=popup&mode=normal

Note, on all the Naive Bayes examples given, the Performance operator is Performance (Binomial Classification)



Naïve Bayes is a classification technique that uses data about prior events to derive a probability of future events.

Bayesian classifiers utilize training data to calculate an observed probability for each class based on feature values. When such classifiers are later used on unlabeled data, they use those observed probabilities to predict the most likely class, given the features in the new data. It is a very simple idea, but it gives us a classification method that often delivers results on par with more sophisticated algorithms.

The Naïve Bayes algorithm is named as such because it makes a couple of naïve assumptions about the data. In particular, it assumes that all of the features in a data-set are equally important and independent. These assumptions are rarely true of most of the real-world applications. However, in most cases when these assumptions are violated, Naïve Bayes still performs fairly well. This is true even in extreme circumstances where strong dependencies are found among the features. Due to the algorithm's versatility and accuracy across many types of conditions, Naïve Bayes is often a strong first candidate for classification learning tasks.



Bayesian classifiers are typically best applied to problems in which the information from numerous attributes should be considered simultaneously in order to estimate the probability of an outcome.

While many algorithms ignore features that have weak effects, Bayesian methods utilize all available evidence to subtly change the predictions.

If a large number of features have relatively minor effects, taken together their combined impact could be quite large. Bayesian probability theory is rooted in the idea that the estimated likelihood of an event should be based on the evidence at hand.



Bayesian classifiers have been used for:

  • Text classification, such as spam filtering, author identification, and topic modeling
  • A common application of the algorithm uses the frequency of the occurrence of words in past emails to identify junk email.
  • In weather forecast, the chance of rain describes the proportion of prior days with similar measurable atmospheric conditions in which precipitation occurred. A 60 percent chance of rain, therefore, suggests that in 6 out of 10 days on record where there were similar atmospheric conditions, it rained.
  • Intrusion detection and anomaly detection on computer networks
  • Diagnosis of medical conditions, given a set of observed symptoms.



Probability Primer

The probability of an event can be estimated from observed data by dividing the number of trials in which an event occurred by the total number of trials.

  • For instance, if it rained 3 out of 10 days, the probability of rain can be estimated as 30 percent.
  • Similarly, if 10 out of 50 email messages are spam, then the probability of spam can be estimated as 20 percent.
  • The notation P(A) is used to denote the probability of event A, as in P(spam) = 0.20


Events are possible outcomes, such as sunny and rainy weather, a heads or tails result in a coin flip, or spam and not spam email messages.

A trial is a single opportunity for the event to occur, such as a day's weather, a coin flip, or an email message. Naive Bayes



Independent and dependent events

If the two events are totally unrelated, they are called independent events. For instance, the outcome of a coin flip is independent of whether the weather is rainy or sunny.

On the other hand, a rainy day depends and the presence of clouds are dependent events. The presence of clouds is likely to be predictive of a rainy day. In the same way, the appearance of the word Viagra is predictive of a spam email.

If all events were independent, it would be impossible to predict any event using data about other events. Dependent events are the basis of predictive modeling.



Likelihood and Marginal Likelihood

  • The probability that the word 'Viagra' was used in previous spam messages is called the likelihood.
  • The probability that the word 'Viagra' appeared in any email (spam or ham) is known as the marginal likelihood.



Mutually exclusive and collectively exhaustive

In probability theory and logic, a set of events is Mutually exclusive or disjoint if they cannot both occur at the same time. A clear example is the set of outcomes of a single coin toss, which can result in either heads or tails, but not both. https://en.wikipedia.org/wiki/Mutual_exclusivity

A set of events is jointly or collectively exhaustive if at least one of the events must occur. For example, when rolling a six-sided die, the events 1, 2, 3, 4, 5, and 6 (each consisting of a single outcome) are collectively exhaustive, because they encompass the entire range of possible outcomes. https://en.wikipedia.org/wiki/Collectively_exhaustive_events


If a trial has n outcomes that cannot occur simultaneously, such as heads or tails, or spam and ham (non-spam), then knowing the probability of n-1 outcomes reveals the probability of the remaining one.

In other words, if there are two outcomes and we know the probability of one, then we automatically know the probability of the other. For example, given the value P(spam) = 0.20 , we are able to calculate P(ham) = 1 – 0.20 = 0.80



Joint Probability

Joint Probability (Independence)


Often, we are interested in monitoring several non-mutually exclusive events for the same trial. If some other events occur at the same time as the event of interest, we may be able to use them to make predictions.

In the case of spam detection, consider, for instance, a second event based on the outcome that the email message contains the word Viagra. For most people, this word is only likely to appear in a spam message. Its presence in a message is therefore a very strong piece of evidence that the email is spam.

We know that 20 percent of all messages were spam and 5 percent of all messages contained the word "Viagra". Our job is to quantify the degree of overlap between these two probabilities. In other words, we hope to estimate the probability of both spam and the word "Viagra" co-occurring, which can be written as P(spam ∩ Viagra).

If we assume that P(spam) and P(Viagra) are independent, we could then easily calculate P(spam ∩ Viagra) - the probability of both events happening at the same time (note that they are not independent, however!).

Because 20 percent of all messages are spam, and 5 percent of all emails contain the word Viagra, we could assume that 5 percent of 20 percent (0.05 * 0.20 = 0.01 ), or 1 percent of all messages are spam contain the word Viagra.

More generally, for any two independent events A and B, the probability of both happening is: P(A ∩ B) = P(A) * P(B)

In reality, it is far more likely that P(spam) and P(Viagra) are highly dependent, which means that this calculation is incorrect. So we need to employ conditional probability.



Conditional probability - The Bayes' Theorem

Thomas Bayes (1763): An essay toward solving a problem in the doctrine of chances, Philosophical Transactions fo the Royal Society, 370-418.


The relationship between dependent events can be described using Bayes' Theorem, as shown in the following equation.



The notation P(A|B) can be read as the probability of event A given that event B occurred. This is known as conditional probability, since the probability of A is dependent or conditional on the occurrence of event B.



Prior Probability

Suppose that you were asked to guess the probability that an incoming email was spam. Without any additional evidence, the most reasonable guess would be the probability that any prior message was spam (that is, 20 percent in the preceding example). This estimate is known as the prior probability.



Posterior Probability

Now suppose that you obtained an additional piece of evidence. You are told that the incoming email contains the word 'Viagra'.

By applying Bayes' theorem to the evidence, we can compute the posterior probability that measures how likely the message is to be spam.

In the case of spam classification, if the posterior probability is greater than 50 percent, the message is more likely to be spam than ham, and it can potentially be filtered out.

The following equation is the Bayes' theorem for the given evidence:


BayesTheorem-Posterior probability.png



Applying Bayes' Theorem - Example

We need information on how frequently Viagra has occurred as spam or ham. Let's assume these numbers:

Viagra
Frequency Yes No Total
Spam 4 16 20
Ham 1 79 80
Total 5 95 100


  • The likelihood table reveals that P(Viagra|spam) = 4/20 = 0.20, indicating that the probability is 20 percent that a spam message contains the term Viagra.
  • Additionally, since the theorem says that P(B|A) x P(A) = P(A ∩ B), we can calculate P(spam ∩ Viagra) as P(Viagra|spam) x P(spam) = (4/20) x (20/100) = 0.04.
  • This is four times greater than the previous estimate under the faulty independence assumption.


To compute the posterior probability we simply use:



  • Therefore, the probability is 80 percent that a message is spam, given that it contains the word "Viagra".
  • Therefore, any message containing this term should be filtered.



Applying Bayes' Theorem - Example

  • Let's extend our spam filter by adding a few additional terms to be monitored: "money", "groceries", and "unsubscribe".
  • We will assume that the Naïve Bayes learner was trained by constructing a likelihood table for the appearance of these four words in 100 emails, as shown in the following table:


ApplyingBayesTheorem-Example.png


As new messages are received, the posterior probability must be calculated to determine whether the messages are more likely to be spam or ham, given the likelihood of the words found in the message text. For example, suppose that a message contains the terms Viagra and Unsubscribe, but does not contain either Money or Groceries. Using Bayes' theorem, we can define the problem as shown in the equation on the next slide, which captures the probability that a message is spam, given that the words 'Viagra' and Unsubscribe are present and that the words 'Money' and 'Groceries' are not.

As mentioned on the previous slide, using Bayes' theorem, we can define the problem as shown in the equation below, which captures the probability that a message is spam, given that the words 'Viagra' and Unsubscribe are present and that the words 'Money' and 'Groceries' are not.



For a number of reasons, this is computationally difficult to solve. As additional features are added, tremendous amounts of memory are needed to store probabilities for all of the possible intersecting events.



Class-Conditional Independence

The work becomes much easier if we can exploit the fact that Naïve Bayes assumes independence among events. Specifically, Naïve Bayes assumes class-conditional independence, which means that events are independent so long as they are conditioned on the same class value.

Assuming conditional independence allows us to simplify the equation using the probability rule for independent events, which you may recall is P(A ∩ B) = P(A) * P(B). This results in a much easier-to-compute formulation:


ApplyingBayesTheorem-ClassConditionalIndependance.png


Using the values in the likelihood table, we can start filling numbers in these equations. Because the denominatero si the same in both cases, it can be ignored for now. The overall likelihood of spam is then:



While the likelihood of ham given the occurrence of these words is:



Because 0.012/0.002 = 6, we can say that this message is six times more likely to be spam than ham. However, to convert these numbers to probabilities, we need one last step.


The probability of spam is equal to the likelihood that the message is spam divided by the likelihood that the message is either spam or ham:



The probability that the message is spam is 0.857. As this is over the threshold of 0.5, the message is classified as spam.



Naïve Bayes - Problems

Suppose we received another message, this time containing the terms: Viagra, Money, Groceries, and Unsubscribe. The likelihood table of spam is:



Surely this is a misclassification? right?. This problem might arise if an event never occurs for one or more levels of the class. for instance, the term Groceries had never previously appeared in a spam message. Consequently, P(spam|groceries) = 0%

Because probabilities in Naïve Bayes are multiplied out, this 0% value causes the posterior probability of spam to be zero, giving the presence of the word Groceries the ability to effectively nullify and overrule all of the other evidence.

Even if the email was otherwise overwhelmingly expected to be spam, the zero likelihood for the word Groceries will always result in a probability of spam being zero.


A solution to this problem involves using something called the Laplace estimator



Naïve Bayes - Laplace Estimator

The Laplace estimator, named after the French mathematician Pierre-Simon Laplace, essentially adds a small number to each of the counts in the frequency table, which ensures that each feature has a nonzero probability of occurring with each class.

Typically, the Laplace estimator is set to 1, which ensures that each class-feature combination is found in the data at least once. The Laplace estimator can be set to any value and does not necessarily even have to be the same for each of the features.

Using a value of 1 for the Laplace estimator, we add one to each numerator in the likelihood function. The sum of all the 1s added to the numerator must then be added to each denominator. The likelihood of spam is therefore:



While the likelihood of ham is:


This means that the probability of spam is 80 percent and the probability of ham is 20 percent; a more plausible result that the one obtained when Groceries alone determined the result.



Naïve Bayes - Numeric Features

Because Naïve Bayes uses frequency tables for learning the data, each feature must be categorical in order to create the combinations of class and feature values comprising the matrix.

Since numeric features do not have categories of values, the preceding algorithm does not work directly with numeric data.

One easy and effective solution is to discretize numeric features, which simply means that the numbers are put into categories knows as bins. For this reason, discretization is also sometimes called binning.

This method is ideal when there are large amounts of training data, a common condition when working with Naïve Bayes.

There is also a version of Naïve Bayes that uses a kernel density estimator that can be used on numeric features with a normal distribution.




RapidMiner Examples

  • Example 1:



  • Example 2:
Download the directory including the data, video explanation and RapidMiner process file at File:NaiveBayes-RapidMiner Example2.zip



  • Example 3:



Decision Trees

  • 10/06: Recorded class - Decision tree



What is a Decision Tree

A predictive model based on a series of branching Boolean tests, essentially representing knowledge as logical structures [Noel Cosgrave slides]



How Decision Trees Work

  • Decision tree learners build models in the form of tree structure.
  • The model is a series of logical decisions on an attribute (tests for which the answer is true or false)
  • Decision nodes split the data using these tests
  • Leaf nodes assign a predicted class based on a combination of the decisions in higher nodes
  • Data travels through the tree from root to leaf nodes
  • The tree generalizes the data. It produces a compact description of the data


  • A decision rule is the path from the root node to the leaf node
  • Each rule is mutually exclusive
  • Every row (or observation) in the data is covered by a single rule
  • Covering a data observation means that the observation satisfies the conditions of the rule


  • A dataset can have many possible decision trees
  • In practice, we want small & accurate trees
  • Smaller trees are more general
  • Smaller trees are also more accurate
  • Easier to understand by humans (and to communicate!)


  • What is the inductive bias in Decision Tree learning?
  • Shorter trees are preferred over longer trees.
  • Prefer trees that place high information gain attributes close to the root


  • More succinct hypotheses are preferred. Why?
  • There are fewer succinct hypotheses than complex ones
  • If a succinct hypothesis explains the data this is unlikely to be a coincidence
  • However, not every succinct hypothesis is a reasonable one.



Example 1:

The model for predicting the future success of a movie can be represented as a simple tree.

DecisionTree-NoelCosgraveSlides 1.png


DecisionTree-NoelCosgraveSlides 2.png



Example 2:

This decision tree is used to decide whether or not to provide a loan to a customer


DecisionTree-NoelCosgraveSlides 3.png


  • 15 training examples
  • Yes = approval, No = rejection
  • The x/y values mean that x out of y training examples that reach this leaf node has the class of the leaf. This is the confidence
  • The x value is the support count
  • x/total training examples is the support



The ID3 algorithm

ID3 (Quinlan, 1986) is an early algorithm for learning Decision Trees. The learning of the tree is top-down. The algorithm is greedy, it looks at a single attribute and gain in each step. This may fail when a combination of attributes is needed to improve the purity of a node.


At each split, the question is "which attribute should be tested next? Which logical test gives us more information?". This is determined by the measures of entropy and information gain. These are discussed later.


A new decision node is then created for each outcome of the test and examples are partitioned according to this value. The process is repeated for each new node until all the examples are classified correctly or there are no attributes left.



The C5.0 algorithm

C5.0 (Quinlan, 1993) is a refinement of C4.5 which in itself improved upon ID3. It is the industry standard for producing decision trees. It performs well for most problems out-of-the-box. Unlike other machine-learning techniques, it has high explanatory power, it can be understood and explained.

.

.

.



Example in RapidMiner



K-Nearest Neighbour

  • 15/06: Recorded class - K-Nearest Neighbour



Boosting


Gradient boosting

https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d

https://freakonometrics.hypotheses.org/tag/gradient-boosting

https://en.wikipedia.org/wiki/Gradient_boosting

https://www.researchgate.net/publication/326379229_Exploring_the_clinical_features_of_narcolepsy_type_1_versus_narcolepsy_type_2_from_European_Narcolepsy_Network_database_with_machine_learning

http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html


Boosting is a machine learning ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. Boosting is based on the question posed by Kearns and Valiant (1988, 1989): "Can a set of weak learners create a single strong learner?" A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification. https://en.wikipedia.org/wiki/Gradient_boosting

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.


Boosting is a sequential process; i.e., trees are grown using the information from a previously grown tree one after the other. This process slowly learns from data and tries to improve its prediction in subsequent iterations. Let's look at a classic classification example:


How does boosting work.png


Four classifiers (in 4 boxes), shown above, are trying hard to classify + and - classes as homogeneously as possible. Let's understand this picture well:

  • Box 1: The first classifier creates a vertical line (split) at D1. It says anything to the left of D1 is + and anything to the right of D1 is -. However, this classifier misclassifies three + points.
  • Box 2: The next classifier says don't worry I will correct your mistakes. Therefore, it gives more weight to the three + misclassified points (see the bigger size of +) and creates a vertical line at D2. Again it says, anything to the right of D2 is - and left is +. Still, it makes mistakes by incorrectly classifying three - points.
  • Box 3: The next classifier continues to bestow support. Again, it gives more weight to the three - misclassified points and creates a horizontal line at D3. Still, this classifier fails to classify the points (in a circle) correctly.
  • Remember that each of these classifiers has a misclassification error associated with them.
  • Boxes 1,2, and 3 are weak classifiers. These classifiers will now be used to create a strong classifier Box 4.
  • Box 4: It is a weighted combination of the weak classifiers. As you can see, it does a good job of classifying all the points correctly.

That's the basic idea behind boosting algorithms. The very next model capitalizes on the misclassification/error of the previous model and tries to reduce it.



Perceptrons - Neural Networks and Support Vector Machines

  • 22/06: Recorded class - The perception Algorithm - Suppor Vector Machines & Neural Networks



Association Rules - Market Basket Analysis


Association Rules example in RapidMiner

Ensure you have the Weka extension installed. To do this, click onGet more operators from the marketplace at the bottom left of the RapidMiner window.


This kind of analysis is done with the goal of discovery patterns on data.



Clustering

Clustering is the task of finding groups of data that are similar when no class label is available.


This is a type of unsupervised learning because there is no training stage. Also, because it is unsupervised learning and as such, there is no "ground truth", the results are frequently subjective.


Clustering can be used as an exploratory technique to discover naturally occurring groups that can be later used in classification.


X-means clustering is a development of k-means that refines cluster assignment that uses an information criterion such as the Akaike information criterion (AIC) or Bayesian information criterion (BIC) to keep the best splits.


Unlike supervised learning and in common with all unsupervised approaches, a clustering algorithm runs on the whole data set. There is no train/test split.

It crates cluster labels, usually just a, b, c,... or 1, 2, 3,..., and assigns each observation to one of the cluster labels (exclusive clustering) or one or more cluster labels (fuzzy clustering). As such, there is no intrinsic meaning to cluster labels.


The assignment of an observation to a cluster label is inferred from some similarity (or dissimilarity) measure.


No model is generated, so if we obtain new data we have to go through the whole process again from the beginning.


For example, let's say that we have a list of customers and we want to divide the customers into a few groups. In this case, we can use a clustering algorithm to try to find out patterns of groups (the best way to separate our customer)



RapidMiner example 1


  • We can try with different values of k (number of clusters): 2, 3, 4, 5 and compare performances.



Time Series Analysis

E-Lecture: https://moodle.cct.ie/mod/scorm/view.php?id=61932



Text Analytics / Mining



Model Evaluation

E-learning link: https://moodle.cct.ie/mod/scorm/player.php?a=5&currentorg=tuto&scoid=10&sesskey=4EXk0T1DT7&display=popup&mode=normal



Why evaluate models

When we build machine learning models, whether, for classification or regression, we need some indication of how the model will perform on previously unseen data. We need a measure of model quality.


Also, when we build multiples models fo different types (Naïve Bayes and Decision tree, for example) wee need a means of inter-comparing the performance of the models.



Evaluation of regression models

Understanding Regression Error Metrics in Python: https://www.dataquest.io/blog/understanding-regression-error-metrics/

Regression Error

The evaluation of regression models involves calculation on the errors (also known as residuals or innovations).

Errors are the differences between the predicted values, represented as and the actual values, denoted .

Regression errors.png
5 6 1
6.5 5.5 1
8 9.5 1.5
8 6 2
7.5 10 2.5
Mean Absolute Error - MAE
The Mean Absolute Error (MAE) is calculated taking the sum of the absolute differences between the actual and predicted values (i.e. the errors with the sign removed) and multiplying it by the reciprocal of the number of observations.

Note that the value returned by the equation is dependent on the range of the values in the dependent variable. it ks scale dependent.

MAE is preferred by many as the evaluation metric of choice as it gives equal weight to all errors, irrespective of their magnitude.


Mean Squared Error - MSE

The Mean Squared Error (MSE) is very similar to the MAE, except that it is calculated taking the sum of the squared differences between the actual and predicted values and multiplying it by the reciprocal of the number of observations. Note that squaring the differences also removes their sign.


As with MAE, the value returned by the equation is dependent on the range of the values in the dependent variable. It is scale dependent.


Root Mean Squared Error

The Root Mean Squared Error (MSE) is basically the same as MSE, except that it is calculated taking the square root of sum of the squared differences between the actual and predicted values and multiplying it by the reciprocal of the number of observations.


As with MAE and MSE, the value returned by the equation is dependent on the range of the values in the dependent variable. It is scale dependent.


MSE and its related metric, RMSE, have been both criticized because they both give heavier weight to larger magnitude errors (outliers). However, this property may be desirable in some circumstances, where large magnitude errors are undesirable, even in small numbers.


Relative Error

The relative error (also known as approximation error)is an average measure of the difference between an actual value and the estimate of the value and the given by the average of the absolute of the difference between the values over the actual value.


Corroborar la fórmula porque creo que hay un error en la slide del prof.


Mean Absolute Percentage Error

Mean Absolute Percentage Error (MAPE) is a scale-independent measure of the performance of a regression model. It is calculated by summing the absolute value of the difference between the actual value and the predicted value, divided by the actual value. This is then multiplied by the reciprocal of the number of observations. This is then multiplied by 100 to obtain a percentage.


Although it offers a scale-independent measure, MAPE is not without problems:

  • It can not be employed if any of the actual values are true zero, as this would result in a division by zero error.
  • Where predicted values frequently exceed the actual values, the percentage error can exceed 100%
  • It penalizes negative errors more than positive errors, meaning that models that routinely predict below the actual values will have a higher MAPE.


R squared

, or the Coefficient of Determination, is the ratio of the amount of variance explained by a model and the total amount of variance in the dependent variable and is the rage [0,1].


Values close to 1 indicate that a model will be better at predicting the dependent variable.


R squared is calculated by summing up the squared differences between the predicted values and the actual values (the top part of the equation) and dividing that by the squared deviation of the actual values from their mean (the bottom part of the equation). The resulting value is then subtracted from 1.


A high is not necessarily an indicator of a good model, as it could be the result of overfitting.


Spearman’s rho

Spearman’s rho is a measure of the linear relationship between two variables. Although similar to Pearson’s correlation, it differs in that the value is calculated after the numeric values are replaced with their rank.


Converting the values to ranks results in the smallest value on having rank of 1, the second smallest having a rank of 2, and so on.The same ranking is carried out on the values. A standard Pearson’scorrelation is then carried out on the ranked data.


...

Given the data in the table below:
7 2
3 5
9 11
11 10
After ranking the data would be:
2 1
1 2
3 4
4 3

When the correlation betwen ranking is ##



Evaluation of classification models


Confusion Matrix or Coincidence Matrix
ConfusionMatrix.png
ConfusionMatrix-Example.png
Accuracy
This is the number of examples correctly predicted as a fraction of the total number.


Balanced Accuracy

If the balance in your response variable is close to perfect, i.e. if the number of examples for each class to be predicted are close to each other, and if your emphasis is on the number of corrected predictions, then accuracy is an appropriate metric. However, if your dataset exhibits class imbalance, accuracy is likely to give misleading results. In such cases, Balanced Accuracy is likely to give a much better indication of how well classes are being predicted.


In fact, Accuracy is often not a good measure of the performance of a model. Take the example of predicting a nasty, but treatable, illness. 1 in every 10000 people has some disposition to the illness. If we detect this, it is treatable, if not it's fatal.

If we assume our classifier always predicts 'no' as it is lazy and doesn't take into account the data, it will be correct 99.99% of the time. So it will have 99.99% accuracy.

Such a classifier is clearly not doing what it was designed to do, and because it fails to detect the condition of interest, ti is, therefore, worse than useless.

This is a problem of class imbalance: when one or more classes are (often massively) more prevalent than others.


For reasons such as this, we need other notions of performance and quality for Data Mining and Machine learning methods.


Sensitivity and Specificity

Sensitivity: Proportion of positive examples correctly classified.


Specificity: Proportion of negative examples correctly classified


Classification is often a balance between conservative and aggressive making.

For example, we could predict that everybody has the fatal disease or we could predict that nobody has the disease. Sensitivity and Specificity capture this trade-off. These terms come from the medical domain.


Precision and Recall

These are very closely related to sensitivity and specificity; but whereas the former come from the medical domain, these come from the domain of information retrieval. As for sensitivity and specificity, for more real-world problems, it is difficult to have a model be highly precise and also to exhibit high recall.


Precision:

Otherwise termed the positive predicted value, is the proportion of predicted positive examples that are truly positive. High precision means that only very likely positives are predicted as positive. Precise models are trustworthy.

For the fatal disease case, hight precision means identifying those who are sufferers.


Recall:

Recall is a measure of how complete the results are.

Basically the same as sensitivity, but with a subtle difference in interpretation.

High recall means capturing a large portion of the positive examples.

For prediction the fatal disease, high recall means the majority of those who have the disease are identified


...




The F1-Score

The F1-Score (also called the F-Score or the F-measure) is a way to combine both precisions and recall into a single measure. It is a value in the range [0,1] with 1 indicating perfect precision and recall.


This makes it easier to compare models, but it does not address the trade-off between precision and recall as it regards them to be equally important.

The F1-Score uses the harmonic mean instead of the arithmetic mean in order to place a higher emphasis on the positive count.


We could assign wights precision or recall elements of the F1 Score, but it is difficult to do this without the rights being arbitrary.

Instead of weighting the F1-Score, we can use it in combination with other more globally encapsulating measures of a model's strengths and weaknesses.


Matthews Correlation Coefficient

The F1-Score is adequate as a metric when precision and recall are considered equally important, or when the relative weighting between the two can be determined non-arbitrarily.

An alternative for cases where that does not apply is the Matthews Correlation Coefficient. It returns a value in the interval , where -1 suggests total disagreement between predicted values and actual values, 0 is indicative that any agreement is the product of random chance and +1 suggests perfect prediction.


So, if the value is -1, every value that is true will be predicted as false and everyone that is false will be predicted as true. If the value is 1, every value that is true will be predicted as such and every value that is false will be predicted as such.


Unlike any of the metrics we have seen in previous slides, the Matthews Correlation coefficient takes into account all four categories in the confusion matrix.


Cohen's Kappa

Cohen's Kappa is a measure of the amount of agreement between two raters classifying N items into C mutually-exclusive categories.


It is defined by the equation given below, where is the observed agreement between raters and is the hypothetical agreement that would be expected to occur by random chance.


Landis and Koch (1977) suggest an interpretation of the magnitude of the results as follows:


...


  • Calculate

    The agreement on the positive class is 72 instances and on the negative class is 24 instances. So the agreement is 96 instances out of a total of 120 Note this is the same as the accuracy

  • Calculate the probability of random agreement on the «positive» class:

The probability that both actual and predicted would agree on the positive class at random is the proportion of the total the positive class makes up for each of actual and predicted.

For the actual class, this is:

For the predicted class this is:

The total probability that both actual and predicted will randomly agree on the positive class is

  • Calculate the probability of random agreement on the «negative» class:

The probability that both actual and predicted would agree on the negative class at random is the proportion of the total the negative class makes up for each of actual and predicted.

For the actual class, this is

For the predicted class this is

The total probability that both actual and predicted will randomly agree on the negative class is

  • Calculate

The probability is simply the sum of the results of the calculations previously calculated:

  • Calculate kappa:

This indicates a 'fair agreement' according to the scale suggested by Lanis and Koch (1977)

The receiver Operator Characteristic Curve

The Receiver Operating Characteristic Curve has its origins in radio transmission, but in this context is a method to visually evaluate the performance of a classifier. It is a 2D trap with the true positive rate on the x-axis and the false positive rate on the y-axis.


There are 4 keys points on a ROC curve:

  • (0,0): classifier doesn't do anything
  • (1,1): classifier always predict true
  • (0,1): perfect classifier that never issues a false positive
  • Line y = x: random classification (coin toss); the standards base line


Any classifier is:

  • better the closer it is to the point (0,1)
  • conservative if it is on the left-hand side of the graph
  • liberal if they are in on the upper right of the graph


To create a ROC curve we do the following:

  • Rank the prediction of ta classifier by confidence in (or probability of) correct classification
  • Order them (highest first)
  • Plot each prediction's impact on the true positive rate and false-positive rate.


Classifiers are considered conservative if they make positive classifications in the presence of strong evidence, so they make fewer false-positive errors, typically at the cost of low true positive rates.

Classifiers are considered liberal if they make positive classifications with weak evidence so they classify nearly all positives correctly, typically at the cost of high false-positive rates.


May real-world data sets are dominated by negative instances. The left-hand side of the ROC curve is, therefore, more interesting.

ROC curve.png
The Area Under the ROC Curve - AUC

Although the ROC curve can provide a Quik visual indication of the performance of a classifier, they can be difficult to interpret.


It is possible to reduce the curve to a meaningful number (a scalar) by computing the area under the curve.


AUC falls in the range [0,1], with 1 indicating a perfect classifier, 0.5 a classifier no better than a random choice and 0 a classifier that predicts everything incorrectly.


A convention for interpreting AUC is:

  • 0.9 - 1.0 = A (outstanding)
  • 0.8 - 0.9 = B (excellent / good)
  • 0.7 - 0.8 = C (acceptable / fair)
  • 0.6 - 0.7 = D (poor)
  • 0.5 - 0.6 = F (no discrimination)


Note that ROC curves with similar AUCs may be shaped very differently, so the AUC can be misleading and shouldn't be computed without some qualitative examination of the ROC curve itself.


...



References

Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977 Mar;33(1):159-174. DOI: 10.2307/2529310.



Python for Data Science


NumPy and Pandas


Data Visualization with Python


Text Analytics in Python


Dash - Plotly


Scrapy


R


R tutorial


RapidMiner



Assessments




Diploma in Predictive Data Analytics assessment

  • Assessment brief: /home/adelo/1-system/1-disco_local/1-mis_archivos/.stockage/desktop-dis/it_cct/Diploma_in_Predictive_Data_Analytics/0-PredictiveAnalyticsProject.pdf


  • Possible sources of data for the project
https://moodle.cct.ie/mod/page/view.php?id=61395


  • User Review Datasets
https://kavita-ganesan.com/user-review-datasets/#.Xw-CWXVKhaQ
http://www.cs.cornell.edu/people/pabo/movie-review-data/



Notas

  • There is an error in the slide 41. MAE = aquí (ver the recording)