Difference between revisions of "Página de pruebas 3"

From Sinfronteras
Jump to: navigation, search
(Measuring Correlation - The Correlation Coefficient)
 
(655 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Correlation & Simple and Multiple Regression==
+
{{Sidebar}}
  
* 17/06: Recorded class - Correlation & Regration
+
<html><buttonclass="averte" onclick="aver()">aver</button></html>
:* https://drive.google.com/drive/folders/1TW494XF-laGGJiLFApz8bJfMstG4Yqk_
+
 
 +
<html>
 +
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.min.js"></script>
 +
<script>
 +
function aver() {
 +
  link = "http://wiki.sinfronteras.ws/index.php?title=P%C3%A1gina_de_pruebas_3+&+action=edit"
 +
  link2 = link.replace("amp;","")
 +
  window.location = link2
 +
  sleep(2);
 +
  window.document.getElementById('firstHeading').style.color = "red"
 +
}
 +
$(document).ready( function() {
 +
    $('#totalItems, #enteredItems').keyup(function(){
 +
        window.document.getElementById('firstHeading').style.color = "red"
 +
    }); 
 +
    window.document.getElementById('firstHeading').style.color = "red"
 +
});
 +
</script>
 +
</html>
 +
 
 +
<br />
 +
==Projects portfolio==
 +
 
 +
 
 +
<br />
 +
==Data Analytics courses==
 +
 
 +
 
 +
<br />
 +
==Possible sources of data==
 +
 
 +
 
 +
<br />
 +
==What is data==
 +
 
 +
 
 +
<br />
 +
===Qualitative vs quantitative data===
 +
 
 +
 
 +
<br />
 +
====Discrete and continuous data====
 +
 
 +
 
 +
<br />
 +
===Structured vs Unstructured data===
 +
 
 +
 
 +
<br />
 +
===Data Levels and Measurement===
 +
 
 +
 
 +
<br />
 +
===What is an example===
 +
 
 +
 
 +
<br />
 +
===What is a dataset===
 +
 
 +
 
 +
<br />
 +
===What is Metadata===
 +
 
 +
 
 +
<br />
 +
==What is Data Science==
 +
 
 +
 
 +
<br />
 +
===Supervised Learning===
 +
 
 +
 
 +
 
 +
<br />
 +
===Unsupervised Learning===
 +
 
 +
 
 +
<br />
 +
===Reinforcement Learning===
 +
 
 +
 
 +
<br />
 +
==Some real-world examples of big data analysis==
 +
 
 +
 
 +
<br />
 +
==Statistic==
 +
 
 +
 
 +
<br />
 +
==Descriptive Data Analysis==
 +
 
 +
 
 +
<br />
 +
===Central tendency===
 +
 
 +
 
 +
<br />
 +
====Mean====
 +
 
 +
 
 +
<br />
 +
=====When not to use the mean=====
 +
 
 +
 
 +
<br />
 +
====Median====
 +
 
 +
 
 +
<br />
 +
====Mode====
 +
 
 +
 
 +
<br />
 +
====Skewed Distributions and the Mean and Median====
 +
 
 +
 
 +
<br />
 +
====Summary of when to use the mean, median and mode====
 +
measures-central-tendency-mean-mode-median-faqs.php
 +
 
 +
 
 +
<br />
 +
===Measures of Variation===
 +
 
 +
 
 +
<br />
 +
====Range====
 +
 
 +
 
 +
<br />
 +
====Quartile====
 +
 
 +
 
 +
<br />
 +
====Box Plots====
 +
 
 +
 
 +
 
 +
<br />
 +
====Variance====
 +
 
 +
 
 +
<br />
 +
====Standard Deviation====
 +
 
 +
 
 +
<br />
 +
==== Z Score ====
 +
 
 +
 
 +
<br />
 +
===Shape of Distribution===
 +
 
 +
 
 +
<br />
 +
====Probability distribution====
 +
 
 +
 
 +
<br />
 +
=====The Normal Distribution=====
 +
 
 +
 
 +
<br />
 +
====Histograms====
 +
 
 +
 
 +
<br />
 +
====Skewness====
 +
 
 +
 
 +
<br />
 +
====Kurtosis====
 +
 
 +
 
 +
<br />
 +
====Visualization of measure of variations on a Normal distribution====
 +
 
 +
 
 +
<br />
 +
==Simple and Multiple regression==
  
  
 
<br />
 
<br />
 
===Correlation===
 
===Correlation===
In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. https://en.wikipedia.org/wiki/Correlation_and_dependence
 
  
  
[[File:Correlation1.png|800px|thumb|center|]]
+
<br />
 +
====Measuring Correlation====
 +
 
 +
 
 +
<br />
 +
=====Pearson correlation coefficient - Pearson s r=====
 +
 
 +
 
 +
<br />
 +
=====The coefficient of determination <math>R^2</math>=====
 +
 
 +
 
 +
<br />
 +
====Correlation <math>\neq</math> Causation====
 +
 
 +
 
 +
<br />
 +
====Testing the "generalizability" of the correlation ====
 +
 
 +
 
 +
<br />
 +
===Simple Linear Regression===
 +
 
 +
 
 +
<br />
 +
===Multiple Linear Regression===
 +
 
 +
 
 +
<br />
 +
===RapidMiner Linear Regression examples===
 +
 
 +
 
 +
<br />
 +
==K-Nearest Neighbour==
 +
 
 +
 
 +
<br />
 +
==Decision Trees==
 +
 
 +
 
 +
<br />
 +
===The algorithm===
 +
 
 +
 
 +
<br />
 +
====Basic explanation of the algorithm====
 +
 
 +
 
 +
<br />
 +
====Algorithms addressed in Noel s Lecture====
 +
 
 +
 
 +
<br />
 +
=====The ID3 algorithm=====
 +
 
 +
 
 +
<br />
 +
=====The C5.0 algorithm=====
 +
 
 +
 
 +
<br />
 +
===Example in RapidMiner===
 +
 
 +
 
 +
<br />
 +
==Random Forests==
 +
https://www.youtube.com/watch?v=J4Wdy0Wc_xQ&t=4s
 +
 
 +
 
 +
<br />
 +
==Naive Bayes==
 +
 
 +
 
 +
<br />
 +
===Probability===
 +
 
  
 +
<br />
 +
===Independent and dependent events===
  
Where moderate to strong correlations are found, we can use this to make a prediction about one the value of one variable given what is known about the other variables.
 
  
The following are examples of correlations:
+
<br />
* there is a correlation between ice cream sales and temperature.
+
===Mutually exclusive and collectively exhaustive===
* Phytoplankton population at a given latitude and surface sea temperature
 
* Blood alcohol level and the odds of being involved in a road traffic accident
 
  
  
 
<br />
 
<br />
====Measuring Correlation - The Correlation Coefficient====
+
===Marginal probability===
Karl Pearson (1857-1936)
+
The marginal probability is the probability of a single event occurring, independent of other events. A conditional probability, on the other hand, is the probability that an event occurs given that another specific event has already occurred. https://en.wikipedia.org/wiki/Marginal_distribution
 +
 
 +
 
 +
<br >
 +
===Joint Probability===
 +
 
  
The correlation coefficient, developed by Karl Pearson, provides a much more exact way of determining the type and degree of a linear correlation between two variables.
+
<br />
 +
===Conditional probability===
  
  
 
<br />
 
<br />
=====Pearson<spam>'<spam>s r=====
+
====Kolmogorov definition of Conditional probability====
Pearson's r
+
 
  
Pearson's r, also known as the '''Pearson product-moment correlation coefficient''', is a measure of the strength of the relationship between two variables and is given by this equation:
+
<br />
 +
====Bayes s theorem====
  
<math>
 
r = \frac{\sum_{i=1}^{n}((x_i - \bar{x})(y_i - \bar{y}))}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2\sum_{i=1}^n(y_i - \bar{y})^2}}
 
</math>
 
  
Where <math>\bar{x}</math> and <math>\bar{y}</math> are the means of the x (independent) and y (dependent) variables, respectively, and <math>x_i</math> and <math>y_i</math> are the individual observations for each variable.
+
<br />
 +
=====Likelihood and Marginal Likelihood=====
  
  
'''The direction of the correlation:'''
+
<br />
* Values of Pearson's r range between -1 and +1.
+
=====Prior Probability=====
* Values greater than zero indicate a positive correlation, with 1 being a perfect positive correlation.
 
* Values less than zero indicate a negative correlation, with -1 being a perfect negative correlation.
 
  
  
'''The degree of the correlation:'''
+
<br />
{| class="wikitable"
+
=====Posterior Probability=====
|+
 
!Degree of correlation
 
!Interpretation
 
|-
 
|0.8  to  1.0
 
|Very strong
 
|-
 
|0.6  to  0.8
 
|Strong
 
|-
 
|0.4  to 0.6
 
|Moderate
 
|-
 
|0.2  to 0.4
 
|Weak
 
|-
 
|0  to  0.2
 
|Very weak or non-existent
 
|}
 
  
  
 
<br />
 
<br />
 +
===Applying Bayes' Theorem===
 +
 +
 +
<br />
 +
====Scenario 1 - A single feature====
 +
 +
 +
<br />
 +
====Scenario 2 - Class-conditional independence====
 +
 +
 +
<br />
 +
====Scenario 3 - Laplace Estimator====
 +
 +
 +
<br />
 +
===Naïve Bayes -  Numeric Features===
 +
 +
 +
<br />
 +
===RapidMiner Examples===
 +
 +
 +
<br />
 +
==Perceptrons - Neural Networks and Support Vector Machines==
 +
 +
 +
<br />
 +
==Boosting==
 +
 +
 +
<br />
 +
===Gradient boosting===
 +
 +
 +
<br />
 +
==K Means Clustering==
 +
 +
 +
<br />
 +
===Clustering class of the Noel course===
 +
 +
 +
<br />
 +
====RapidMiner example 1====
 +
 +
 +
<br />
 +
==Principal Component Analysis PCA==
  
=====The coefficient of determination <math>R^2</math>=====
 
The value <math>R^2</math> is termed the '''coefficient of determination''' because it measures the proportion of variance in the dependent variable that is determined by its relationship with the independent variables. This is calculated from two values:
 
  
 +
<br />
 +
==Association Rules - Market Basket Analysis==
  
* The total sum of squares: <math> SS_{tot} = \sum{i}^n(y_i - \hat{y}_i)^2 </math>
 
  
* The residual sum of squares: <math> SS_{res} = \sum_{i}^n(\hat{y}_{i} - \bar{y})^2 </math>
+
<br />
 +
===Association Rules example in RapidMiner===
  
  
The total sum of squares is the sum of the squared differences between the actual <math>y</math> values (<math>y_i</math>) and their mean. The residual sum of squares is the sum of the squared differences between the predicted <math>y</math> values (<math>\hat{y}_i</math>) and their respective actual values.
+
<br />
 +
==Time Series Analysis==
  
<math>
 
R^2 = 1 - \frac{SS_{res}}{SS_{tot}}
 
</math>
 
  
<math>R^2</math> is used to gain some idea of the '''goodness of fit''' of a model. It is a measure of how well the regression predictions approximate the actual data points. An <math>R^2</math> of 1 means that predicted values perfectly fit the actual data.
+
<br />
 +
==[[Text Analytics|Text Analytics / Mining]]==
  
  
 
<br />
 
<br />
 +
==Model Evaluation==
 +
  
=====Testing the "generalizability" of the correlation =====
+
<br />
Having determined the value of the correlation coefficient ('''r''') for a pair of variables, you should next determine the '''likelihood''' that the value of '''r''' occurred purely by chance. In other words, what is the likelihood that the relationship in your sample reflects a real relationship in the population.
+
===Why evaluate models===
  
Before carrying out any test, the alpha (<math>\alpha</math>) level should be set. This is a measure of how willing we are to be wrong when we say that there is a relationship between two variables. A commonly-used <math>\alpha</math> level in research is 0.05.
 
  
An <math>\alpha</math> level to 0.05 means that you could possibly be wrong up to 5 times out of 100 when you state that there is a relationship in the population based on a correlation found in the sample.
+
<br />
 +
===Evaluation of regression models===
  
In order to test whether the correlation in the sample can be generalised to the population, we must first identify the null hypothesis <math>H_0</math> and the alternative hypothesis <math>H_A</math>.
 
  
This is a test against the population correlation co-efficient (<math>\rho</math>), so these hypotheses are:
+
<br />
 +
===Evaluation of classification models===
  
* <math> H_0 : \rho = 0 </math> - There is no correlation in the
 
population
 
* <math> H_0 : \rho \neq 0 </math> - There is correlation
 
  
Next, we calculate the value of the test statistic using the following equation:
+
<br />
 +
===References===
 +
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977 Mar;33(1):159-174. DOI: 10.2307/2529310.
  
<math>
 
t^* = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}
 
</math>
 
  
So for a correlation coefficient <math>r</math> value of -0.8, an <math>r^2</math> value of 0.9 and a sample size of 102, this would be:
+
<br />
 +
==[[Python for Data Science]]==
  
<math>
 
t^* = \frac{0.8\sqrt{100}}{\sqrt{0.1}} = \frac{8}{0.3162278} = 25.29822
 
</math>
 
  
Checking the t-tables for an <math>\alpha</math> level of 0.005 and a two-tailed test (because we are testing if <math>\rho</math> is less than or greater than 0) we get a critical value of 2.056. As the value of the test statistic (25.29822) is greater than the critical value, we can reject the null hypothesis and conclude that there is likely to be a correlation in the population.
+
<br />
 +
===[[NumPy and Pandas]]===
  
  
 
<br />
 
<br />
 +
===[[Data Visualization with Python]]===
  
====Correlation <math>\neq</math> Causation====
 
  
Even if you find the strongest of correlations, you should never interpret it as more than just that... a correlation.
+
<br />
 +
===[[Text Analytics in Python]]===
 +
 
 +
 
 +
<br />
 +
===[[Dash - Plotly]]===
 +
 
 +
 
 +
<br />
 +
===[[Scrapy]]===
 +
 
  
 +
<br />
 +
==[[R]]==
  
<blockquote>
 
Causation indicates a relationship between two events where one event is affected by the other. In statistics, when the value of one event, or variable, increases or decreases as a result of other events, it is said there is causation.
 
  
Let's say you have a job and get paid a certain rate per hour. The more hours you work, the more income you will earn, right? This means there is a relationship between the two events and also that a change in one event (hours worked) causes a change in the other (income). This is causation in action! https://study.com/academy/lesson/causation-in-statistics-definition-examples.html
+
<br />
 +
===[[R tutorial]]===
  
  
Given any two correlated events A and B, the following relationships are possible:
+
<br />
* A causes B
+
==[[RapidMiner]]==
* B causes A
 
* A and B are both the product of a common underlying cause, but do not cause each other
 
* Any relationship between A and B is simply the result of coincidence.
 
</blockquote>
 
  
  
Although a correlation between two variables could possibly indicate the presence of
+
<br />
 +
==Assessments==
  
* a causal relationship between the variables in either direction(x causes y, y causes x); or
 
* the influence of one or more confounding variables, another variable that has an influence on both variables
 
  
It can also indicate the absence of any connection. In other words, it can be entirely spurious, the product of pure chance. In the following slides, we will look at a few examples...
+
<br />
 +
===Diploma in Predictive Data Analytics assessment===
  
  
 
<br />
 
<br />
 +
==Notas==
  
====Examples====
 
Causality or coincidence?
 
  
<div style="text-align: center;">
+
<br />
<pdf width="2000" height="600">File:Correlation_examples-Causality_vs_coincidence.pdf</pdf>
+
==References==
[[File:Correlation_examples-Causality_vs_coincidence.pdf]]
 
</div>
 
  
  
 
<br />
 
<br />

Latest revision as of 21:50, 10 March 2021



aver


Contents

Projects portfolio


Data Analytics courses


Possible sources of data


What is data


Qualitative vs quantitative data


Discrete and continuous data


Structured vs Unstructured data


Data Levels and Measurement


What is an example


What is a dataset


What is Metadata


What is Data Science


Supervised Learning


Unsupervised Learning


Reinforcement Learning


Some real-world examples of big data analysis


Statistic


Descriptive Data Analysis


Central tendency


Mean


When not to use the mean


Median


Mode


Skewed Distributions and the Mean and Median


Summary of when to use the mean, median and mode

measures-central-tendency-mean-mode-median-faqs.php



Measures of Variation


Range


Quartile


Box Plots


Variance


Standard Deviation


Z Score


Shape of Distribution


Probability distribution


The Normal Distribution


Histograms


Skewness


Kurtosis


Visualization of measure of variations on a Normal distribution


Simple and Multiple regression


Correlation


Measuring Correlation


Pearson correlation coefficient - Pearson s r


The coefficient of determination


Correlation Causation


Testing the "generalizability" of the correlation


Simple Linear Regression


Multiple Linear Regression


RapidMiner Linear Regression examples


K-Nearest Neighbour


Decision Trees


The algorithm


Basic explanation of the algorithm


Algorithms addressed in Noel s Lecture


The ID3 algorithm


The C5.0 algorithm


Example in RapidMiner


Random Forests

https://www.youtube.com/watch?v=J4Wdy0Wc_xQ&t=4s



Naive Bayes


Probability


Independent and dependent events


Mutually exclusive and collectively exhaustive


Marginal probability

The marginal probability is the probability of a single event occurring, independent of other events. A conditional probability, on the other hand, is the probability that an event occurs given that another specific event has already occurred. https://en.wikipedia.org/wiki/Marginal_distribution



Joint Probability


Conditional probability


Kolmogorov definition of Conditional probability


Bayes s theorem


Likelihood and Marginal Likelihood


Prior Probability


Posterior Probability


Applying Bayes' Theorem


Scenario 1 - A single feature


Scenario 2 - Class-conditional independence


Scenario 3 - Laplace Estimator


Naïve Bayes - Numeric Features


RapidMiner Examples


Perceptrons - Neural Networks and Support Vector Machines


Boosting


Gradient boosting


K Means Clustering


Clustering class of the Noel course


RapidMiner example 1


Principal Component Analysis PCA


Association Rules - Market Basket Analysis


Association Rules example in RapidMiner


Time Series Analysis


Text Analytics / Mining


Model Evaluation


Why evaluate models


Evaluation of regression models


Evaluation of classification models


References

Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977 Mar;33(1):159-174. DOI: 10.2307/2529310.



Python for Data Science


NumPy and Pandas


Data Visualization with Python


Text Analytics in Python


Dash - Plotly


Scrapy


R


R tutorial


RapidMiner


Assessments


Diploma in Predictive Data Analytics assessment


Notas


References