Página de pruebas 3

1 Relational databases vs Non-relational databases
- 1.1 Relational Databases
- 1.2 Non-relational databases
- 1.3 Comparing the two
2 Projects portfolio
3 Data Analytics courses
4 Possible sources of data
5 What is data
- 5.1 Qualitative vs quantitative data
  - 5.1.1 Discrete and continuous data
- 5.2 Structured vs Unstructured data
- 5.3 Data Levels and Measurement
- 5.4 What is an example
- 5.5 What is a dataset
- 5.6 What is Metadata
6 What is Data Science
- 6.1 Supervised Learning
- 6.2 Unsupervised Learning
- 6.3 Reinforcement Learning
7 Some real-world examples of big data analysis
8 Statistic
9 Descriptive Data Analysis
- 9.1 Central tendency
- 9.2 Measures of Variation
- 9.3 Shape of Distribution
10 Simple and Multiple regression
- 10.1 Correlation
- 10.2 Simple Linear Regression
- 10.3 Multiple Linear Regression
- 10.4 RapidMiner Linear Regression examples
11 K-Nearest Neighbour
12 Decision Trees
- 12.1 The algorithm
  - 12.1.1 Basic explanation of the algorithm
  - 12.1.2 Algorithms addressed in Noel s Lecture
    - 12.1.2.1 The ID3 algorithm
    - 12.1.2.2 The C5.0 algorithm
- 12.2 Example in RapidMiner
13 Random Forests
14 Naive Bayes
- 14.1 Probability
- 14.2 Independent and dependent events
- 14.3 Mutually exclusive and collectively exhaustive
- 14.4 Marginal probability
- 14.5 Joint Probability
- 14.6 Conditional probability
  - 14.6.1 Kolmogorov definition of Conditional probability
  - 14.6.2 Bayes s theorem
- 14.7 Applying Bayes' Theorem
- 14.8 Naïve Bayes - Numeric Features
- 14.9 RapidMiner Examples
15 Perceptrons - Neural Networks and Support Vector Machines
16 Boosting
- 16.1 Gradient boosting
17 K Means Clustering
- 17.1 Clustering class of the Noel course
  - 17.1.1 RapidMiner example 1
18 Principal Component Analysis PCA
19 Association Rules - Market Basket Analysis
- 19.1 Association Rules example in RapidMiner
20 Time Series Analysis
21 Text Analytics / Mining
22 Model Evaluation
- 22.1 Why evaluate models
- 22.2 Evaluation of regression models
- 22.3 Evaluation of classification models
- 22.4 References
23 Python for Data Science
- 23.1 NumPy and Pandas
- 23.2 Data Visualization with Python
- 23.3 Text Analytics in Python
- 23.4 Dash - Plotly
- 23.5 Scrapy
24 R
- 24.1 R tutorial
25 RapidMiner
26 Assessments
- 26.1 Diploma in Predictive Data Analytics assessment
27 Notas
28 References

Relational databases vs Non-relational databases

https://www.jamesserra.com/archive/2015/08/relational-databases-vs-non-relational-databases/

Relational databases, which can also be called relational database management systems (RDBMS) or SQL databases. The most popular of these are Microsoft SQL Server, Oracle Database and MySQL.

Non-relational databases, also called NoSQL databases, the most popular being MongoDB, DocumentDB, Cassandra, Coachbase, HBase, Redis, and Neo4j. These databases are usually grouped into four categories: Key-value stores, Graph stores, Column stores, and Document stores (see Types of NoSQL databases).

All relational databases can be used to manage transaction-oriented applications (Online transaction processing (OLTP)), and most non-relational databases that are in the categories Document stores and Column stores can also be used for OLTP. OLTP databases can be thought of as "Operational" databases, characterized by frequent, short transactions that include updates and that touch a small amount of data and where concurrency of thousands of transactions is very important (examples including banking applications and online reservations). Integrity of data is very important so they support ACID transactions (Atomicity, Consistency, Isolation, Durability). This is opposed to data warehouses, which are considered "Analytical" databases characterized by long, complex queries that touch a large amount of data and require a lot of resources. Updates are infrequent. An example is analysis of sales over the past year.

Relational databases usually work with structured data, while non-relational databases usually work with semi-structured data (i.e. XML, JSON).

Relational Databases

A relational database is organized based on the relational model of data, as proposed by E.F. Codd in 1970. This model organizes data into one or more tables (or "relations") of rows and columns, with a unique key for each row. Generally, each entity type that is described in a database has its own table with the rows representing instances of that type of entity and the columns representing values attributed to that instance. Since each row in a table has its own unique key, rows in a table can be linked to rows in other tables by storing the unique key of the row to which it should be linked (where such unique key is known as a "foreign key"). Codd showed that data relationships of arbitrary complexity can be represented using this simple set of concepts.

Virtually all relational database systems use SQL (Structured Query Language) as the language for querying and maintaining the database.

The reasons for the dominance of relational databases are: simplicity, robustness, flexibility, performance, scalability and compatibility in managing generic data.

But to offer all of this, relational databases have to be incredibly complex internally. For example, a relatively simple SELECT statement could have dozens of potential query execution paths, which a query optimizer would evaluate at run time. All of this is hidden to users, but under the hood, the RDBMS determines the best “execution plan” to answer requests by using things like cost-based algorithms.

For large databases, especially ones used for web applications, the main concern is scalability. As more and more applications are created in environments that have massive workloads (i.e. Amazon), their scalability requirements can change very quickly and grow very large. Relational databases scale well, but usually only when that scaling happens on a single server (“scale-up”). When the capacity of that single server is reached, you need to “scale-out” and distribute that load across multiple servers, moving into so-called distributed computing. This is when the complexity of relational databases starts to cause problems with their potential to scale. If you try to scale to hundreds or thousands of servers the complexities become overwhelming.

Non-relational databases

A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.

Motivations for this approach include:

Simplicity of design. Not having to deal with the "impedance mismatch" between the object-oriented approach to write applications and the schema-based tables and rows of a relational database. For example, storing all the customer order info in one document as opposed to having to join many tables together, resulting in less code to write, debug, and maintain.

Better "horizontal" scaling to clusters of machines, which solves the problem when the number of concurrent users skyrockets for applications that are accessible via the web and mobile devices. Using documents makes it much easier to scale-out as all the info for that customer order is contained in one place as opposed to being spread out on multiple tables. NoSQL databases automatically spread data across servers without requiring application changes (auto-sharding), meaning that they natively and automatically spread data across an arbitrary number of servers, without requiring the application to even be aware of the composition of the server pool. Data and query load are automatically balanced across servers, and when a server goes down, it can be quickly and transparently replaced with no application disruption.

Finer control over availability. Servers can be added or removed without application downtime. Most NoSQL databases support data replication, storing multiple copies of data across the cluster or even across data centers, to ensure high availability and disaster recovery.

To easily capture all kinds of data "Big Data" which include unstructured and semi-structured data. Allowing for a flexible database that can easily and quickly accommodate any new type of data and is not disrupted by content structure changes. This is because document database are schemaless, allowing you to freely add fields to JSON documents without having to first define changes (schema-on-read instead of schema-on-write). You can have documents with a different number of fields than other documents. For example, a patient record that may or may not contain fields that list allergies.

Speed. The data structures used by NoSQL databases (i.e. JSON documents) differ from those used by default in relational databases, making many operations faster in NoSQL than relational databases due to not having to join tables (at the cost of increased storage space due to duplication of data – but storage space is so cheap nowadays so this is usually not an issue). In fact, most NoSQL databases do not even support joins.

Cost. NoSQL databases usually use clusters of cheap commodity servers, while RDBMS tend to rely on expensive proprietary servers and storage systems. Also, the licenses for RDBMS systems can be quite expensive while many NoSQL databases are open source and therefore free.

The particular suitability of a given NoSQL database depends on the problem it must solve.

NoSQL databases are increasingly used in big data and real-time web applications. They became popular with the introduction of the web, when databases went from a max of a few hundred users on an internal company application to thousands or millions of users on a web application. NoSQL systems are also called “Not only SQL” to emphasize that they may also support SQL-like query languages.

Many NoSQL stores compromise consistency (in the sense of the CAP theorem) in favor of availability and partition tolerance. Some reasons that block adoption of NoSQL stores include the use of low-level query languages, the lack of standardized interfaces, and huge investments in existing SQL. Also, most NoSQL stores lack true ACID transactions or only support transactions in certain circumstances and at certain levels (e.g., document level).

Comparing the two

One of the most severe limitations of relational databases is that each item can only contain one attribute. If we use a bank example, each aspect of a customer’s relationship with a bank is stored as separate row items in separate tables. So the customer’s master details are in one table, the account details are in another table, the loan details in yet another, investments in a different table, and so on. All these tables are linked to each other through the use of relations such as primary keys and foreign keys.

Non-relational databases, specifically a database’s key-value stores or key-value pairs, are radically different from this model. Key-value pairs allow you to store several related items in one “row” of data in the same table. We place the word “row” in quotes because a row here is not really the same thing as the row of a relational table. For instance, in a non-relational table for the same bank, each row would contain the customer’s details as well as their account, loan and investment details. All data relating to one customer would be conveniently stored together as one record.

Projects portfolio

Data Analytics courses

Possible sources of data

What is data

Qualitative vs quantitative data

Discrete and continuous data

Structured vs Unstructured data

Data Levels and Measurement

What is an example

What is a dataset

What is Metadata

What is Data Science

Supervised Learning

Unsupervised Learning

Reinforcement Learning

Some real-world examples of big data analysis

Statistic

Descriptive Data Analysis

Central tendency

Mean

When not to use the mean

Median

Mode

Skewed Distributions and the Mean and Median

Summary of when to use the mean, median and mode

measures-central-tendency-mean-mode-median-faqs.php

Measures of Variation

Range

Quartile

Box Plots

Variance

Standard Deviation

Z Score

Shape of Distribution

Probability distribution

The Normal Distribution

Histograms

Skewness

Kurtosis

Visualization of measure of variations on a Normal distribution

Simple and Multiple regression

Correlation

Measuring Correlation

Pearson correlation coefficient - Pearson s r

The coefficient of determination $R^{2}$

Correlation $\neq$ Causation

Testing the "generalizability" of the correlation

Simple Linear Regression

Multiple Linear Regression

RapidMiner Linear Regression examples

K-Nearest Neighbour

Decision Trees

The algorithm

Basic explanation of the algorithm

Algorithms addressed in Noel s Lecture

The ID3 algorithm

The C5.0 algorithm

Example in RapidMiner

Random Forests

https://www.youtube.com/watch?v=J4Wdy0Wc_xQ&t=4s

Naive Bayes

Probability

Independent and dependent events

Mutually exclusive and collectively exhaustive

Marginal probability

The marginal probability is the probability of a single event occurring, independent of other events. A conditional probability, on the other hand, is the probability that an event occurs given that another specific event has already occurred. https://en.wikipedia.org/wiki/Marginal_distribution

Joint Probability

Conditional probability

Kolmogorov definition of Conditional probability

Bayes s theorem

Likelihood and Marginal Likelihood

Prior Probability

Posterior Probability

Applying Bayes' Theorem

Scenario 1 - A single feature

Scenario 2 - Class-conditional independence

Scenario 3 - Laplace Estimator

Naïve Bayes - Numeric Features

RapidMiner Examples

Perceptrons - Neural Networks and Support Vector Machines

Boosting

Gradient boosting

K Means Clustering

Clustering class of the Noel course

RapidMiner example 1

Principal Component Analysis PCA

Association Rules - Market Basket Analysis

Association Rules example in RapidMiner

Time Series Analysis

Text Analytics / Mining

Model Evaluation

Why evaluate models

Evaluation of regression models

Evaluation of classification models

References

Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977 Mar;33(1):159-174. DOI: 10.2307/2529310.