Difference between revisions of "Página de pruebas 3"

From Sinfronteras
Jump to: navigation, search
 
(663 intermediate revisions by the same user not shown)
Line 1: Line 1:
This is the dashboard we have just built to analyze laptops' data from Amazon.
+
{{Sidebar}}
  
It is currently displaying a dataset that includes laptops of different brands and series. Remember that this is a dataset that we have built by scraping data from Amazon; but the module that is intended to Load a new dataset is not ready. We are currently working on it. So when this module is ready, the application is gonna be able to scrape data from Amazon, from a page like this, in real-time.
+
<html><buttonclass="averte" onclick="aver()">aver</button></html>
  
The other page in which we are currently working on is the Sentiment analysis page. That is also a very impotant topic for the application but it't no ready yet.
+
<html>
 +
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.min.js"></script>
 +
<script>
 +
function aver() {
 +
  link = "http://wiki.sinfronteras.ws/index.php?title=P%C3%A1gina_de_pruebas_3+&+action=edit"
 +
  link2 = link.replace("amp;","")
 +
  window.location = link2
 +
  sleep(2);
 +
  window.document.getElementById('firstHeading').style.color = "red"
 +
}
 +
$(document).ready( function() {
 +
    $('#totalItems, #enteredItems').keyup(function(){
 +
        window.document.getElementById('firstHeading').style.color = "red"
 +
    }); 
 +
    window.document.getElementById('firstHeading').style.color = "red"
 +
});
 +
</script>
 +
</html>
  
So before showing the main features of the application, I wanted to show you how we are scraping the data from Amazon.
+
<br />
So this is the script we built with a Python framework called Scrapy, I'm gonna run it, so here we are scraping the data from amazon, it is quite fast, here we are just scraping about 35 laptops
+
==Projects portfolio==
  
It has created this file, let's see the file. It a JSON file with the detail of 35 computers, so we can see all the details of the laptps, the technical details and the reviews. So when this module is ready, it will run the script we have just seen and scrape the data from amazon in real-time.
 
  
So let's talk about the home page. So the home page has been designed to allow the user to discover and visualize the data. So you are able to customize the data you want to display by selecting the brand, series, and the range of prices.
+
<br />
 +
==Data Analytics courses==
  
So let's say that you want to analyze all the brands at the same time. When you select a brand it automatically selects all the series for this brand but then you can filter the brands you want. It takes a bit because is processing the data
 
  
Or I don't know maybe we are interested in expensive laptops. So let's select computers over 1000 dolars for example. You can see that now the application only proposes computers in this range of prices, there aren't a lot actually. You can see that they are gaming laptops which are usually expensive.
+
<br />
 +
==Possible sources of data==
  
But in other cases is better to analyze only one brand. Let's, for example, analyze Acer computers.
 
  
The firs charts that we have included is to compare average customer reviews and prices, You can see that the blue bars show the values for all items, that means for all the series of the brand,
+
<br />
but we have also included a red bar that displays the values for selected items only. For example, is you want to know the price of a specific computer, so let's select for example this one... you can see that this is a very expensive one.. 1088$ and that has actually a very good customer review score of 4.3, so it's apparently a very good computer
+
==What is data==
  
The second panel we have included is a Bubble chart that shows the Average customer reviews vs. Prices.
 
  
We have included this chart because actually one of the main faatures that can be analyzed when talking about sales, is the relationship between price and customer satisfaction. So with this kind of chart, we would try to determine a trend to establish a relationship between price and customer review.
+
<br />
 +
===Qualitative vs quantitative data===
  
One nice feature of these charts is that you can select the brand that you want to visualize. If you click in one brand this is going to be excluded the brand clicked from the chart, but if you double click, only the brand clicked will be shown
 
  
The other panel we have included in to visualize the most frequent words in customer reviews. Word clouds provide a nice visualization of the most frequent word. But if you need to be most precise, you can use the word count chart that provides the exact number of times a word has been mentioned.
+
<br />
 +
====Discrete and continuous data====
  
So let's for example analyse the information provided by the wordcloud. We can see that some of the most frequent words in customer reviews are:
 
  
Words like good of grate indicate that it is a computer that users have liked, but we already knew that customers liked this computer by analyzing the average customer reviews score that is 4.3.
+
<br />
 +
===Structured vs Unstructured data===
  
But a information that we didn't know and it's provided by the wordcloud with words like Gamming or game it that this is a Gaming laptop.
 
  
We finally found the word “Screen”. which is probably the word that provides the most important information from this word cloud. We can see that users are talking about the screen of this laptop, but we can not be sure if they are saying something good or bad about the screen. We can actually infer that is something good based on the good customer reviews score or based in the other words that are present in the word cloud that provided a positive sentiment like great and good but in the end we cannot be sure about what customers are saying about the screen. This is why there are other analyses that can bring more information, like sentiment analysis, which is the topic we are currently working on.
+
<br />
 +
===Data Levels and Measurement===
 +
 
 +
 
 +
<br />
 +
===What is an example===
 +
 
 +
 
 +
<br />
 +
===What is a dataset===
 +
 
 +
 
 +
<br />
 +
===What is Metadata===
 +
 
 +
 
 +
<br />
 +
==What is Data Science==
 +
 
 +
 
 +
<br />
 +
===Supervised Learning===
 +
 
 +
 
 +
 
 +
<br />
 +
===Unsupervised Learning===
 +
 
 +
 
 +
<br />
 +
===Reinforcement Learning===
 +
 
 +
 
 +
<br />
 +
==Some real-world examples of big data analysis==
 +
 
 +
 
 +
<br />
 +
==Statistic==
 +
 
 +
 
 +
<br />
 +
==Descriptive Data Analysis==
 +
 
 +
 
 +
<br />
 +
===Central tendency===
 +
 
 +
 
 +
<br />
 +
====Mean====
 +
 
 +
 
 +
<br />
 +
=====When not to use the mean=====
 +
 
 +
 
 +
<br />
 +
====Median====
 +
 
 +
 
 +
<br />
 +
====Mode====
 +
 
 +
 
 +
<br />
 +
====Skewed Distributions and the Mean and Median====
 +
 
 +
 
 +
<br />
 +
====Summary of when to use the mean, median and mode====
 +
measures-central-tendency-mean-mode-median-faqs.php
 +
 
 +
 
 +
<br />
 +
===Measures of Variation===
 +
 
 +
 
 +
<br />
 +
====Range====
 +
 
 +
 
 +
<br />
 +
====Quartile====
 +
 
 +
 
 +
<br />
 +
====Box Plots====
 +
 
 +
 
 +
 
 +
<br />
 +
====Variance====
 +
 
 +
 
 +
<br />
 +
====Standard Deviation====
 +
 
 +
 
 +
<br />
 +
==== Z Score ====
 +
 
 +
 
 +
<br />
 +
===Shape of Distribution===
 +
 
 +
 
 +
<br />
 +
====Probability distribution====
 +
 
 +
 
 +
<br />
 +
=====The Normal Distribution=====
 +
 
 +
 
 +
<br />
 +
====Histograms====
 +
 
 +
 
 +
<br />
 +
====Skewness====
 +
 
 +
 
 +
<br />
 +
====Kurtosis====
 +
 
 +
 
 +
<br />
 +
====Visualization of measure of variations on a Normal distribution====
 +
 
 +
 
 +
<br />
 +
==Simple and Multiple regression==
 +
 
 +
 
 +
<br />
 +
===Correlation===
 +
 
 +
 
 +
<br />
 +
====Measuring Correlation====
 +
 
 +
 
 +
<br />
 +
=====Pearson correlation coefficient - Pearson s r=====
 +
 
 +
 
 +
<br />
 +
=====The coefficient of determination <math>R^2</math>=====
 +
 
 +
 
 +
<br />
 +
====Correlation <math>\neq</math> Causation====
 +
 
 +
 
 +
<br />
 +
====Testing the "generalizability" of the correlation ====
 +
 
 +
 
 +
<br />
 +
===Simple Linear Regression===
 +
 
 +
 
 +
<br />
 +
===Multiple Linear Regression===
 +
 
 +
 
 +
<br />
 +
===RapidMiner Linear Regression examples===
 +
 
 +
 
 +
<br />
 +
==K-Nearest Neighbour==
 +
 
 +
 
 +
<br />
 +
==Decision Trees==
 +
 
 +
 
 +
<br />
 +
===The algorithm===
 +
 
 +
 
 +
<br />
 +
====Basic explanation of the algorithm====
 +
 
 +
 
 +
<br />
 +
====Algorithms addressed in Noel s Lecture====
 +
 
 +
 
 +
<br />
 +
=====The ID3 algorithm=====
 +
 
 +
 
 +
<br />
 +
=====The C5.0 algorithm=====
 +
 
 +
 
 +
<br />
 +
===Example in RapidMiner===
 +
 
 +
 
 +
<br />
 +
==Random Forests==
 +
https://www.youtube.com/watch?v=J4Wdy0Wc_xQ&t=4s
 +
 
 +
 
 +
<br />
 +
==Naive Bayes==
 +
 
 +
 
 +
<br />
 +
===Probability===
 +
 
 +
 
 +
<br />
 +
===Independent and dependent events===
 +
 
 +
 
 +
<br />
 +
===Mutually exclusive and collectively exhaustive===
 +
 
 +
 
 +
<br />
 +
===Marginal probability===
 +
The marginal probability is the probability of a single event occurring, independent of other events. A conditional probability, on the other hand, is the probability that an event occurs given that another specific event has already occurred. https://en.wikipedia.org/wiki/Marginal_distribution
 +
 
 +
 
 +
<br >
 +
===Joint Probability===
 +
 
 +
 
 +
<br />
 +
===Conditional probability===
 +
 
 +
 
 +
<br />
 +
====Kolmogorov definition of Conditional probability====
 +
 
 +
 
 +
<br />
 +
====Bayes s theorem====
 +
 
 +
 
 +
<br />
 +
=====Likelihood and Marginal Likelihood=====
 +
 
 +
 
 +
<br />
 +
=====Prior Probability=====
 +
 
 +
 
 +
<br />
 +
=====Posterior Probability=====
 +
 
 +
 
 +
<br />
 +
===Applying Bayes' Theorem===
 +
 
 +
 
 +
<br />
 +
====Scenario 1 - A single feature====
 +
 
 +
 
 +
<br />
 +
====Scenario 2 - Class-conditional independence====
 +
 
 +
 
 +
<br />
 +
====Scenario 3 - Laplace Estimator====
 +
 
 +
 
 +
<br />
 +
===Naïve Bayes -  Numeric Features===
 +
 
 +
 
 +
<br />
 +
===RapidMiner Examples===
 +
 
 +
 
 +
<br />
 +
==Perceptrons - Neural Networks and Support Vector Machines==
 +
 
 +
 
 +
<br />
 +
==Boosting==
 +
 
 +
 
 +
<br />
 +
===Gradient boosting===
 +
 
 +
 
 +
<br />
 +
==K Means Clustering==
 +
 
 +
 
 +
<br />
 +
===Clustering class of the Noel course===
 +
 
 +
 
 +
<br />
 +
====RapidMiner example 1====
 +
 
 +
 
 +
<br />
 +
==Principal Component Analysis PCA==
 +
 
 +
 
 +
<br />
 +
==Association Rules - Market Basket Analysis==
 +
 
 +
 
 +
<br />
 +
===Association Rules example in RapidMiner===
 +
 
 +
 
 +
<br />
 +
==Time Series Analysis==
 +
 
 +
 
 +
<br />
 +
==[[Text Analytics|Text Analytics / Mining]]==
 +
 
 +
 
 +
<br />
 +
==Model Evaluation==
 +
 
 +
 
 +
<br />
 +
===Why evaluate models===
 +
 
 +
 
 +
<br />
 +
===Evaluation of regression models===
 +
 
 +
 
 +
<br />
 +
===Evaluation of classification models===
 +
 
 +
 
 +
<br />
 +
===References===
 +
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977 Mar;33(1):159-174. DOI: 10.2307/2529310.
 +
 
 +
 
 +
<br />
 +
==[[Python for Data Science]]==
 +
 
 +
 
 +
<br />
 +
===[[NumPy and Pandas]]===
 +
 
 +
 
 +
<br />
 +
===[[Data Visualization with Python]]===
 +
 
 +
 
 +
<br />
 +
===[[Text Analytics in Python]]===
 +
 
 +
 
 +
<br />
 +
===[[Dash - Plotly]]===
 +
 
 +
 
 +
<br />
 +
===[[Scrapy]]===
 +
 
 +
 
 +
<br />
 +
==[[R]]==
 +
 
 +
 
 +
<br />
 +
===[[R tutorial]]===
 +
 
 +
 
 +
<br />
 +
==[[RapidMiner]]==
 +
 
 +
 
 +
<br />
 +
==Assessments==
 +
 
 +
 
 +
<br />
 +
===Diploma in Predictive Data Analytics assessment===
 +
 
 +
 
 +
<br />
 +
==Notas==
 +
 
 +
 
 +
<br />
 +
==References==
 +
 
 +
 
 +
<br />

Latest revision as of 21:50, 10 March 2021



aver


Contents

Projects portfolio


Data Analytics courses


Possible sources of data


What is data


Qualitative vs quantitative data


Discrete and continuous data


Structured vs Unstructured data


Data Levels and Measurement


What is an example


What is a dataset


What is Metadata


What is Data Science


Supervised Learning


Unsupervised Learning


Reinforcement Learning


Some real-world examples of big data analysis


Statistic


Descriptive Data Analysis


Central tendency


Mean


When not to use the mean


Median


Mode


Skewed Distributions and the Mean and Median


Summary of when to use the mean, median and mode

measures-central-tendency-mean-mode-median-faqs.php



Measures of Variation


Range


Quartile


Box Plots


Variance


Standard Deviation


Z Score


Shape of Distribution


Probability distribution


The Normal Distribution


Histograms


Skewness


Kurtosis


Visualization of measure of variations on a Normal distribution


Simple and Multiple regression


Correlation


Measuring Correlation


Pearson correlation coefficient - Pearson s r


The coefficient of determination


Correlation Causation


Testing the "generalizability" of the correlation


Simple Linear Regression


Multiple Linear Regression


RapidMiner Linear Regression examples


K-Nearest Neighbour


Decision Trees


The algorithm


Basic explanation of the algorithm


Algorithms addressed in Noel s Lecture


The ID3 algorithm


The C5.0 algorithm


Example in RapidMiner


Random Forests

https://www.youtube.com/watch?v=J4Wdy0Wc_xQ&t=4s



Naive Bayes


Probability


Independent and dependent events


Mutually exclusive and collectively exhaustive


Marginal probability

The marginal probability is the probability of a single event occurring, independent of other events. A conditional probability, on the other hand, is the probability that an event occurs given that another specific event has already occurred. https://en.wikipedia.org/wiki/Marginal_distribution



Joint Probability


Conditional probability


Kolmogorov definition of Conditional probability


Bayes s theorem


Likelihood and Marginal Likelihood


Prior Probability


Posterior Probability


Applying Bayes' Theorem


Scenario 1 - A single feature


Scenario 2 - Class-conditional independence


Scenario 3 - Laplace Estimator


Naïve Bayes - Numeric Features


RapidMiner Examples


Perceptrons - Neural Networks and Support Vector Machines


Boosting


Gradient boosting


K Means Clustering


Clustering class of the Noel course


RapidMiner example 1


Principal Component Analysis PCA


Association Rules - Market Basket Analysis


Association Rules example in RapidMiner


Time Series Analysis


Text Analytics / Mining


Model Evaluation


Why evaluate models


Evaluation of regression models


Evaluation of classification models


References

Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977 Mar;33(1):159-174. DOI: 10.2307/2529310.



Python for Data Science


NumPy and Pandas


Data Visualization with Python


Text Analytics in Python


Dash - Plotly


Scrapy


R


R tutorial


RapidMiner


Assessments


Diploma in Predictive Data Analytics assessment


Notas


References