Difference between revisions of "Página de pruebas 3"

From Sinfronteras
Jump to: navigation, search
 
(619 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
{{Sidebar}}
 +
 +
<html><buttonclass="averte" onclick="aver()">aver</button></html>
 +
 +
<html>
 +
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.min.js"></script>
 +
<script>
 +
function aver() {
 +
  link = "http://wiki.sinfronteras.ws/index.php?title=P%C3%A1gina_de_pruebas_3+&+action=edit"
 +
  link2 = link.replace("amp;","")
 +
  window.location = link2
 +
  sleep(2);
 +
  window.document.getElementById('firstHeading').style.color = "red"
 +
}
 +
$(document).ready( function() {
 +
    $('#totalItems, #enteredItems').keyup(function(){
 +
        window.document.getElementById('firstHeading').style.color = "red"
 +
    }); 
 +
    window.document.getElementById('firstHeading').style.color = "red"
 +
});
 +
</script>
 +
</html>
 +
 +
<br />
 +
==Projects portfolio==
 +
 +
 +
<br />
 +
==Data Analytics courses==
 +
 +
 +
<br />
 +
==Possible sources of data==
 +
 +
 +
<br />
 +
==What is data==
 +
 +
 +
<br />
 +
===Qualitative vs quantitative data===
 +
 +
 +
<br />
 +
====Discrete and continuous data====
 +
 +
 +
<br />
 +
===Structured vs Unstructured data===
 +
 +
 +
<br />
 +
===Data Levels and Measurement===
 +
 +
 +
<br />
 +
===What is an example===
 +
 +
 +
<br />
 +
===What is a dataset===
 +
 +
 +
<br />
 +
===What is Metadata===
 +
 +
 +
<br />
 +
==What is Data Science==
 +
 +
 +
<br />
 +
===Supervised Learning===
 +
 +
 +
 +
<br />
 +
===Unsupervised Learning===
 +
 +
 +
<br />
 +
===Reinforcement Learning===
 +
 +
 +
<br />
 +
==Some real-world examples of big data analysis==
 +
 +
 +
<br />
 +
==Statistic==
 +
 +
 +
<br />
 +
==Descriptive Data Analysis==
 +
 +
 +
<br />
 +
===Central tendency===
 +
 +
 +
<br />
 +
====Mean====
 +
 +
 +
<br />
 +
=====When not to use the mean=====
 +
 +
 +
<br />
 +
====Median====
 +
 +
 +
<br />
 +
====Mode====
 +
 +
 +
<br />
 +
====Skewed Distributions and the Mean and Median====
 +
 +
 +
<br />
 +
====Summary of when to use the mean, median and mode====
 +
measures-central-tendency-mean-mode-median-faqs.php
 +
 +
 +
<br />
 +
===Measures of Variation===
 +
 +
 +
<br />
 +
====Range====
 +
 +
 +
<br />
 +
====Quartile====
 +
 +
 +
<br />
 +
====Box Plots====
 +
 +
 +
 +
<br />
 +
====Variance====
 +
 +
 +
<br />
 +
====Standard Deviation====
 +
 +
 +
<br />
 +
==== Z Score ====
 +
 +
 +
<br />
 +
===Shape of Distribution===
 +
 +
 +
<br />
 +
====Probability distribution====
 +
 +
 +
<br />
 +
=====The Normal Distribution=====
 +
 +
 +
<br />
 +
====Histograms====
 +
 +
 
<br />
 
<br />
{| class="wikitable"
+
====Skewness====
|+
 
|-
 
! colspan="3" style="vertical-align:top;" |Regression Error:
 
  
The evaluation of regression models involves calculation on the errors (also known as residuals or innovations).
 
  
Errors are the differences between the predicted values, represented as <math>\hat{y}</math> and the actual values, denoted <math>y</math>.
+
<br />
{|
+
====Kurtosis====
![[File:Regression_errors.png|300px|center|link=Special:FilePath/Regression_errors.png]]
 
!
 
{| class="wikitable"
 
!<math>y</math>
 
!<math>\hat{y}</math>
 
!<math>\left \vert y - \hat{y} \right \vert</math>
 
|-
 
|5
 
|6
 
|1
 
|-
 
|6.5
 
|5.5
 
|1
 
|-
 
|8
 
|9.5
 
|1.5
 
|-
 
|8
 
|6
 
|2
 
|-
 
|7.5
 
|10
 
|2.5
 
|}
 
|}
 
|-
 
! style="vertical-align:top;" |<h5 style="text-align:left; vertical-align:top">Mean Absolute Error - MAE</h5>
 
|The Mean Absolute Error (MAE) is calculated taking the sum of the absolute differences between the actual and predicted values (i.e. the errors with the sign removed) and multiplying it by the reciprocal of the number of observations.
 
  
Note that the value returned by the equation is dependent on the range of the values in the dependent variable. it ks '''scale dependent'''.
 
  
MAE is preferred by many as the evaluation metric of choice as it gives equal weight to all errors, irrespective of their magnitude.
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
<math>
 
MAE = \frac{}{} \sum_{i=1}^{n} \left \vert Y_i - \hat{Y}_i \right \vert
 
</math>
 
<div class="mw-collapsible-content">
 
<br /><math>
 
Accuracy = \frac{72 + 24}{72 + 24 + 16 + 6} = \frac{96}{120} = 0.8
 
</math>
 
</div>
 
</div>
 
|-
 
! style="vertical-align:top;" |<h5 style="text-align:left; vertical-align:top">Mean Squared Error - MSE</h5>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
The Mean Squared Error (MSE) is very similar to the MAE, except that it is calculated taking the sum of the squared differences between the actual and predicted values and multiplying it by the reciprocal of the number of observations. Note that squaring the differences also removes their sign.
 
<div class="mw-collapsible-content">
 
 
<br />
 
<br />
As with MAE, the value returned by the equation is dependent on the range of the values in the dependent variable. It is '''scale dependent'''.
+
====Visualization of measure of variations on a Normal distribution====
</div>
+
 
</div>
+
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
<math>
 
MSE = \frac{1}{n} \sum_{i=1}^n (Y_i - \hat{Y}_i)^2
 
</math>
 
<div class="mw-collapsible-content">
 
 
<br />
 
<br />
<math>
+
==Simple and Multiple regression==
Balanced Accuracy = \frac{\frac{72}{72 + 8} + \frac{24}{24 + 16} }{2} = \frac{0.9 + 0.6}{2} = 0.75
+
 
</math>
+
 
</div>
 
</div>
 
|-
 
! style="vertical-align:top;" |<h5 style="text-align:left; vertical-align:top">Root Mean Squared Error</h5>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
The Root Mean Squared Error (MSE) is basically the same as MSE, except that it is calculated taking the square root of sum of the squared differences between the actual and predicted values and multiplying it by the reciprocal of the number of observations.
 
<div class="mw-collapsible-content">
 
 
<br />
 
<br />
As with MAE and MSE, the value returned by the equation is dependent on the range of the values in the dependent variable. It is '''scale dependent'''.
+
===Correlation===
  
  
MSE and its related metric, RMSE, have been both criticized because they both give heavier weight to larger magnitude errors (outliers). However, this property may be desirable in some circumstances, where large magnitude errors are undesirable, even in small numbers.
 
</div>
 
</div>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
<math>
 
RMSE = \sqrt{\frac{1}{n} \sum_{i=i}^n (Y_i - \hat{Y}_i)^2 }
 
</math>
 
<div class="mw-collapsible-content">
 
 
<br />
 
<br />
<math>Sensitivity = \frac{72}{72 + 8} = 0.9</math>
+
====Measuring Correlation====
 +
 
  
<math>Sencitivity = \frac{24}{24 + 16} = 0.6</math>
 
</div>
 
</div>
 
|-
 
! style="vertical-align:top;" |<h5 style="text-align:left; vertical-align:top">Mean Absolute Percentage Error</h5>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
Mean Absolute Percentage Error (MAPE) is a '''scale-independent''' measure of the performance of a regression model. It is calculated by summing the absolute value of the difference between the actual value and the predicted value, divided by the actual value. This is then multiplied by the reciprocal of the number of observations. This is then multiplied by 100 to obtain a percentage.
 
<div class="mw-collapsible-content">
 
 
<br />
 
<br />
Although it offers a scale-independent measure, MAPE is not without problems:
+
=====Pearson correlation coefficient - Pearson s r=====
* It can not be employed if any of the actual values are true zero, as this would result in a division by zero error.
+
 
* Where predicted values frequently exceed the actual values, the percentage error can exceed 100%
+
 
* It penalizes negative errors more than positive errors, meaning that models that routinely predict below the actual values will have a higher MAPE.
 
</div>
 
</div>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
<math>
 
MAPE = \frac{1}{n} \sum_{i=1}^n \left \vert \frac{Y_i - \hat{Y}_i}{Y_i} \right \vert \times 100
 
</math>
 
<div class="mw-collapsible-content">
 
 
<br />
 
<br />
<math>Precision = \frac{72}{72 + 16} = 0.8182</math>
+
=====The coefficient of determination <math>R^2</math>=====
  
  
<math>Recall = \frac{72}{72 + 8} = 0.90</math>
+
<br />
</div>
+
====Correlation <math>\neq</math> Causation====
</div>
+
 
|-
+
 
! style="vertical-align:top;" |<h5 style="text-align:left; vertical-align:top">R squared</h5>
+
<br />
|
+
====Testing the "generalizability" of the correlation ====
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
+
 
<math>R^2</math>, or the Coefficient of Determination, is the ratio of the amount of variance explained by a model and the total amount of variance in the dependent variable and is the rage [0,1].
+
 
 +
<br />
 +
===Simple Linear Regression===
 +
 
 +
 
 +
<br />
 +
===Multiple Linear Regression===
 +
 
 +
 
 +
<br />
 +
===RapidMiner Linear Regression examples===
 +
 
 +
 
 +
<br />
 +
==K-Nearest Neighbour==
 +
 
 +
 
 +
<br />
 +
==Decision Trees==
 +
 
 +
 
 +
<br />
 +
===The algorithm===
 +
 
 +
 
 +
<br />
 +
====Basic explanation of the algorithm====
 +
 
 +
 
 +
<br />
 +
====Algorithms addressed in Noel s Lecture====
 +
 
 +
 
 +
<br />
 +
=====The ID3 algorithm=====
 +
 
 +
 
 +
<br />
 +
=====The C5.0 algorithm=====
 +
 
 +
 
 +
<br />
 +
===Example in RapidMiner===
 +
 
 +
 
 +
<br />
 +
==Random Forests==
 +
https://www.youtube.com/watch?v=J4Wdy0Wc_xQ&t=4s
 +
 
 +
 
 +
<br />
 +
==Naive Bayes==
 +
 
 +
 
 +
<br />
 +
===Probability===
 +
 
 +
 
 +
<br />
 +
===Independent and dependent events===
 +
 
 +
 
 +
<br />
 +
===Mutually exclusive and collectively exhaustive===
 +
 
 +
 
 +
<br />
 +
===Marginal probability===
 +
The marginal probability is the probability of a single event occurring, independent of other events. A conditional probability, on the other hand, is the probability that an event occurs given that another specific event has already occurred. https://en.wikipedia.org/wiki/Marginal_distribution
 +
 
 +
 
 +
<br >
 +
===Joint Probability===
 +
 
 +
 
 +
<br />
 +
===Conditional probability===
 +
 
 +
 
 +
<br />
 +
====Kolmogorov definition of Conditional probability====
 +
 
 +
 
 +
<br />
 +
====Bayes s theorem====
 +
 
 +
 
 +
<br />
 +
=====Likelihood and Marginal Likelihood=====
 +
 
 +
 
 +
<br />
 +
=====Prior Probability=====
 +
 
 +
 
 +
<br />
 +
=====Posterior Probability=====
 +
 
 +
 
 +
<br />
 +
===Applying Bayes' Theorem===
 +
 
 +
 
 +
<br />
 +
====Scenario 1 - A single feature====
 +
 
 +
 
 +
<br />
 +
====Scenario 2 - Class-conditional independence====
 +
 
 +
 
 +
<br />
 +
====Scenario 3 - Laplace Estimator====
 +
 
 +
 
 +
<br />
 +
===Naïve Bayes - Numeric Features===
 +
 
 +
 
 +
<br />
 +
===RapidMiner Examples===
 +
 
 +
 
 +
<br />
 +
==Perceptrons - Neural Networks and Support Vector Machines==
 +
 
 +
 
 +
<br />
 +
==Boosting==
 +
 
 +
 
 +
<br />
 +
===Gradient boosting===
 +
 
 +
 
 +
<br />
 +
==K Means Clustering==
 +
 
 +
 
 +
<br />
 +
===Clustering class of the Noel course===
 +
 
 +
 
 +
<br />
 +
====RapidMiner example 1====
 +
 
 +
 
 +
<br />
 +
==Principal Component Analysis PCA==
 +
 
 +
 
 +
<br />
 +
==Association Rules - Market Basket Analysis==
 +
 
 +
 
 +
<br />
 +
===Association Rules example in RapidMiner===
 +
 
 +
 
 +
<br />
 +
==Time Series Analysis==
 +
 
 +
 
 +
<br />
 +
==[[Text Analytics|Text Analytics / Mining]]==
 +
 
 +
 
 +
<br />
 +
==Model Evaluation==
 +
 
 +
 
 +
<br />
 +
===Why evaluate models===
 +
 
 +
 
 +
<br />
 +
===Evaluation of regression models===
 +
 
 +
 
 +
<br />
 +
===Evaluation of classification models===
 +
 
 +
 
 +
<br />
 +
===References===
 +
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977 Mar;33(1):159-174. DOI: 10.2307/2529310.
 +
 
 +
 
 +
<br />
 +
==[[Python for Data Science]]==
 +
 
 +
 
 +
<br />
 +
===[[NumPy and Pandas]]===
 +
 
 +
 
 +
<br />
 +
===[[Data Visualization with Python]]===
 +
 
 +
 
 +
<br />
 +
===[[Text Analytics in Python]]===
 +
 
 +
 
 +
<br />
 +
===[[Dash - Plotly]]===
 +
 
 +
 
 +
<br />
 +
===[[Scrapy]]===
 +
 
 +
 
 +
<br />
 +
==[[R]]==
 +
 
 +
 
 +
<br />
 +
===[[R tutorial]]===
 +
 
 +
 
 +
<br />
 +
==[[RapidMiner]]==
 +
 
 +
 
 +
<br />
 +
==Assessments==
 +
 
 +
 
 +
<br />
 +
===Diploma in Predictive Data Analytics assessment===
  
  
Values close to 1 indicate that a model will be better at predicting the dependent variable.
 
<div class="mw-collapsible-content">
 
 
<br />
 
<br />
R squared is calculated by summing up the squared differences between the predicted values and the actual values (the top part of the equation) and dividing that by the squared deviation of the actual values from their mean (the bottom part of the equation). The resulting value is then subtracted from 1.
+
==Notas==
  
  
A high <math>R^2</math> is not necessarily an indicator of a good model, as it could be the result of overfitting.
 
</div>
 
</div>
 
|
 
<div class="mw-collapsible mw-collapsed" data-expandtext="+/-" data-collapsetext="+/-">
 
<math>
 
R^2 = 1 - \frac{SS_{res}}{SS_{tot}}
 
= 1 - \frac{\sum_{i=1}^n(y_i - \hat{y}_i)^2}{\sum_{i=1}^n(y_i - \bar{y})^2}
 
</math>
 
<div class="mw-collapsible-content">
 
 
<br />
 
<br />
<math>
+
==References==
F1Score = \frac{2 \times 72}{2 \times 72 + 16 + 8} = 0.8571
 
</math>
 
</div>
 
</div>
 
|}
 
  
  
 
<br />
 
<br />

Latest revision as of 21:50, 10 March 2021



aver


Contents

Projects portfolio


Data Analytics courses


Possible sources of data


What is data


Qualitative vs quantitative data


Discrete and continuous data


Structured vs Unstructured data


Data Levels and Measurement


What is an example


What is a dataset


What is Metadata


What is Data Science


Supervised Learning


Unsupervised Learning


Reinforcement Learning


Some real-world examples of big data analysis


Statistic


Descriptive Data Analysis


Central tendency


Mean


When not to use the mean


Median


Mode


Skewed Distributions and the Mean and Median


Summary of when to use the mean, median and mode

measures-central-tendency-mean-mode-median-faqs.php



Measures of Variation


Range


Quartile


Box Plots


Variance


Standard Deviation


Z Score


Shape of Distribution


Probability distribution


The Normal Distribution


Histograms


Skewness


Kurtosis


Visualization of measure of variations on a Normal distribution


Simple and Multiple regression


Correlation


Measuring Correlation


Pearson correlation coefficient - Pearson s r


The coefficient of determination


Correlation Causation


Testing the "generalizability" of the correlation


Simple Linear Regression


Multiple Linear Regression


RapidMiner Linear Regression examples


K-Nearest Neighbour


Decision Trees


The algorithm


Basic explanation of the algorithm


Algorithms addressed in Noel s Lecture


The ID3 algorithm


The C5.0 algorithm


Example in RapidMiner


Random Forests

https://www.youtube.com/watch?v=J4Wdy0Wc_xQ&t=4s



Naive Bayes


Probability


Independent and dependent events


Mutually exclusive and collectively exhaustive


Marginal probability

The marginal probability is the probability of a single event occurring, independent of other events. A conditional probability, on the other hand, is the probability that an event occurs given that another specific event has already occurred. https://en.wikipedia.org/wiki/Marginal_distribution



Joint Probability


Conditional probability


Kolmogorov definition of Conditional probability


Bayes s theorem


Likelihood and Marginal Likelihood


Prior Probability


Posterior Probability


Applying Bayes' Theorem


Scenario 1 - A single feature


Scenario 2 - Class-conditional independence


Scenario 3 - Laplace Estimator


Naïve Bayes - Numeric Features


RapidMiner Examples


Perceptrons - Neural Networks and Support Vector Machines


Boosting


Gradient boosting


K Means Clustering


Clustering class of the Noel course


RapidMiner example 1


Principal Component Analysis PCA


Association Rules - Market Basket Analysis


Association Rules example in RapidMiner


Time Series Analysis


Text Analytics / Mining


Model Evaluation


Why evaluate models


Evaluation of regression models


Evaluation of classification models


References

Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977 Mar;33(1):159-174. DOI: 10.2307/2529310.



Python for Data Science


NumPy and Pandas


Data Visualization with Python


Text Analytics in Python


Dash - Plotly


Scrapy


R


R tutorial


RapidMiner


Assessments


Diploma in Predictive Data Analytics assessment


Notas


References