Difference between revisions of "RapidMiner"

From Sinfronteras
Jump to: navigation, search
Line 1: Line 1:
 +
 +
<br />
 
https://rapidminer.com/
 
https://rapidminer.com/
  
Line 7: Line 9:
 
RapidMiner is a data science software platform developed by the company of the same name that provides an integrated environment for data preparation, machine learning, deep learning, text mining, and predictive analytics. It is used for business and commercial applications as well as for research, education, training, rapid prototyping, and application development and '''supports all steps of the machine learning process''' including data preparation, results visualization, model validation and optimization. RapidMiner is developed on an open core model. The RapidMiner Studio Free Edition, which is limited to 1 logical processor and 10,000 data rows is available under the AGPL license. Commercial pricing starts at $2,500 and is available from the developer.
 
RapidMiner is a data science software platform developed by the company of the same name that provides an integrated environment for data preparation, machine learning, deep learning, text mining, and predictive analytics. It is used for business and commercial applications as well as for research, education, training, rapid prototyping, and application development and '''supports all steps of the machine learning process''' including data preparation, results visualization, model validation and optimization. RapidMiner is developed on an open core model. The RapidMiner Studio Free Edition, which is limited to 1 logical processor and 10,000 data rows is available under the AGPL license. Commercial pricing starts at $2,500 and is available from the developer.
  
 +
 +
<br />
 
==Installation==
 
==Installation==
 
Descargamos el paquete y seguimos las instruciones en el sitio oficial: https://docs.rapidminer.com/latest/studio/installation/
 
Descargamos el paquete y seguimos las instruciones en el sitio oficial: https://docs.rapidminer.com/latest/studio/installation/
Line 14: Line 18:
 
We will need to create a RapidMiner account.
 
We will need to create a RapidMiner account.
  
 +
 +
<br />
 
==Training Videos==
 
==Training Videos==
 
https://rapidminer.com/training/videos/
 
https://rapidminer.com/training/videos/
Line 21: Line 27:
 
En este sentido, our data shows historical information about those customers that we know that have been «loyal» (leales) or «churners» (desleales). With this data we can built a model to make a prediction on the other customers that we don't know yet how they are likely to behave («loyal» or «churner»)
 
En este sentido, our data shows historical information about those customers that we know that have been «loyal» (leales) or «churners» (desleales). With this data we can built a model to make a prediction on the other customers that we don't know yet how they are likely to behave («loyal» or «churner»)
  
 +
 +
<br />
 
===Introductions===
 
===Introductions===
 
https://rapidminer.com/training/videos/#introductions
 
https://rapidminer.com/training/videos/#introductions
Line 26: Line 34:
 
Important terms: https://docs.rapidminer.com/latest/studio/getting-started/important-terms.html
 
Important terms: https://docs.rapidminer.com/latest/studio/getting-started/important-terms.html
  
 +
 +
<br />
 
====GUI Intro====
 
====GUI Intro====
 
https://rapidminer.wistia.com/medias/dxnsrftr9i
 
https://rapidminer.wistia.com/medias/dxnsrftr9i
  
 +
 +
<br />
 
=====The views=====
 
=====The views=====
  
 +
 +
<br />
 
======Design view======
 
======Design view======
 
Work areas for specific taks...
 
Work areas for specific taks...
Line 50: Line 64:
 
'''Ports:'''
 
'''Ports:'''
  
 +
 +
<br />
 
======Results view======
 
======Results view======
 
Work areas for specific taks...
 
Work areas for specific taks...
  
 +
 +
<br />
 
======Auto Model view======
 
======Auto Model view======
  
 +
 +
<br />
 
=====Operators=====
 
=====Operators=====
  
 +
 +
<br />
 
=====Repository=====
 
=====Repository=====
 
* Through the repository panel you can access data and your process.  
 
* Through the repository panel you can access data and your process.  
 
* Al iniciar un proyecto se recomienda crear un nuevo repositorio con dos sub-folders: '''data''' and '''processes'''
 
* Al iniciar un proyecto se recomienda crear un nuevo repositorio con dos sub-folders: '''data''' and '''processes'''
  
 +
 +
<br />
 
=====Parameters=====
 
=====Parameters=====
  
 +
 +
<br />
 
=====Global Search=====
 
=====Global Search=====
  
 +
 +
<br />
 
====Adding extensions - The RapidMiner Marketplace====
 
====Adding extensions - The RapidMiner Marketplace====
 
https://rapidminer.wistia.com/medias/9nu4i7b5ea
 
https://rapidminer.wistia.com/medias/9nu4i7b5ea
Line 90: Line 118:
 
** Reiniciamos RapidMiner and the extension will become available.
 
** Reiniciamos RapidMiner and the extension will become available.
  
 +
 +
<br />
 
===Data Preparation AND ETL===
 
===Data Preparation AND ETL===
 
https://rapidminer.com/training/videos/#data-preparation
 
https://rapidminer.com/training/videos/#data-preparation
  
 +
 +
<br />
 
====Importing data====
 
====Importing data====
 
https://rapidminer.wistia.com/medias/t3b4v3hceb
 
https://rapidminer.wistia.com/medias/t3b4v3hceb
Line 108: Line 140:
 
#** Change Role: The default for each column is «General Attribute». We can change the role to: id, label, wight...
 
#** Change Role: The default for each column is «General Attribute». We can change the role to: id, label, wight...
  
 +
 +
<br />
 
====Data loading via a process====
 
====Data loading via a process====
 
https://rapidminer.wistia.com/medias/fyfagg7rp9
 
https://rapidminer.wistia.com/medias/fyfagg7rp9
Line 143: Line 177:
 
** Then, hit the Run process button. That will store the data into the your local repository.
 
** Then, hit the Run process button. That will store the data into the your local repository.
  
 +
 +
<br />
 
====Visualizing data====
 
====Visualizing data====
 
https://rapidminer.wistia.com/medias/w623uxkoga
 
https://rapidminer.wistia.com/medias/w623uxkoga
Line 202: Line 238:
 
[[File:Rapidminer_chart1.png|550px|thumb|center|See svg file here: http://perso.sinfronteras.ws/images/9/90/Rapidminer_chart1.svg]]
 
[[File:Rapidminer_chart1.png|550px|thumb|center|See svg file here: http://perso.sinfronteras.ws/images/9/90/Rapidminer_chart1.svg]]
  
 +
 +
<br />
 
====Turbo Prep====
 
====Turbo Prep====
  
 +
 +
<br />
 
=====Introduction=====
 
=====Introduction=====
 
https://rapidminer.wistia.com/medias/ar0409ddaw
 
https://rapidminer.wistia.com/medias/ar0409ddaw
  
 +
 +
<br />
 
=====Data Cleansing=====
 
=====Data Cleansing=====
 
https://rapidminer.wistia.com/medias/fui6gj5e8h
 
https://rapidminer.wistia.com/medias/fui6gj5e8h
  
 +
 +
<br />
 
=====Merging Data=====
 
=====Merging Data=====
 
https://rapidminer.wistia.com/medias/1tdvusdi9q
 
https://rapidminer.wistia.com/medias/1tdvusdi9q
  
 +
 +
<br />
 
=====Data Pivoting=====
 
=====Data Pivoting=====
 
https://rapidminer.wistia.com/medias/ribdg4pfg5
 
https://rapidminer.wistia.com/medias/ribdg4pfg5
  
 +
 +
<br />
 
====Connecting to Databases====
 
====Connecting to Databases====
 
https://rapidminer.wistia.com/medias/th4jxcrww6
 
https://rapidminer.wistia.com/medias/th4jxcrww6
 
   
 
   
 +
 +
<br />
 
====Data preparation====
 
====Data preparation====
 
https://rapidminer.wistia.com/medias/p6kou5484b
 
https://rapidminer.wistia.com/medias/p6kou5484b
Line 248: Line 298:
 
* <span style="background:DarkKhaki">It is very important to notice that in our data we have customers that we know if they are «loyal» or «churners» and customers that we don't know yet how they are going to behave (so we don't have loyalty information for those ones). So the ones with «loyalty» information have to be placed in one subset to be used as input to create our model; and the customers without «loyalty» information are going to be grouped in another subset so we can predict how they will act.</span>
 
* <span style="background:DarkKhaki">It is very important to notice that in our data we have customers that we know if they are «loyal» or «churners» and customers that we don't know yet how they are going to behave (so we don't have loyalty information for those ones). So the ones with «loyalty» information have to be placed in one subset to be used as input to create our model; and the customers without «loyalty» information are going to be grouped in another subset so we can predict how they will act.</span>
  
 +
 +
<br />
 
=====White spaces=====
 
=====White spaces=====
 
* To remove white spaces we use the «trim» operator:
 
* To remove white spaces we use the «trim» operator:
  
 +
 +
<br />
 
=====Valores no definidos=====
 
=====Valores no definidos=====
 
Por ejemplo, en nuestra data tenemos algunas rows with missing «age» or «gender» values. Como en nuestra datas éstos son sólo pocos, vamos a deshacernos de esas rows. Para esto:
 
Por ejemplo, en nuestra data tenemos algunas rows with missing «age» or «gender» values. Como en nuestra datas éstos son sólo pocos, vamos a deshacernos de esas rows. Para esto:
Line 260: Line 314:
 
** Es importante notar que para poder seleccionar los atributos en esta ventana, la data tiene haber sido cargada utilizando el «Import configuration wizard»
 
** Es importante notar que para poder seleccionar los atributos en esta ventana, la data tiene haber sido cargada utilizando el «Import configuration wizard»
  
 +
 +
<br />
 
=====Duplicates=====
 
=====Duplicates=====
 
* Drag and drop the «Remove duplicates operator» into our process panel:
 
* Drag and drop the «Remove duplicates operator» into our process panel:
Line 265: Line 321:
 
** Attribute filter type: All (this mean that 2 examples (rows) are considered duplicates if they are identical with respect to all attributes values).
 
** Attribute filter type: All (this mean that 2 examples (rows) are considered duplicates if they are identical with respect to all attributes values).
  
 +
 +
<br />
 
=====Wrong value format - Replace operator=====
 
=====Wrong value format - Replace operator=====
 
We need to solve the problem with the «Gender» values that are «m» instead of «male». We can use the «Replace operator» for that:
 
We need to solve the problem with the «Gender» values that are «m» instead of «male». We can use the «Replace operator» for that:
Line 277: Line 335:
 
** Replace by: male
 
** Replace by: male
  
 +
 +
<br />
 
=====We need to split the data in 2 subsets=====
 
=====We need to split the data in 2 subsets=====
 
* Subset 1: The ones with «loyalty» information.
 
* Subset 1: The ones with «loyalty» information.
Line 297: Line 357:
  
  
 +
<br />
 
=====Remove necessaries attributes=====
 
=====Remove necessaries attributes=====
 
* Notice we have only missing values in in the Churn attribute of the unlabeled_customers subset. Nos podemos entonces deshacer the este attribute:
 
* Notice we have only missing values in in the Churn attribute of the unlabeled_customers subset. Nos podemos entonces deshacer the este attribute:
Line 307: Line 368:
  
  
 +
<br />
 
=====Set Churn as the label=====
 
=====Set Churn as the label=====
 
* Set «Churn» as the «label» (Ir order to the model to know which is the target variable (label))
 
* Set «Churn» as the «label» (Ir order to the model to know which is the target variable (label))
Line 313: Line 375:
 
*** Target role: label
 
*** Target role: label
  
 +
 +
<br />
 
=====Grouping operators=====
 
=====Grouping operators=====
 
Para simplificar el Process panel podemos también agrupar operators that are used for similar purposes into one operator. En nuestro caso podríamos egrupar todos los operators usados para data cleaning. Para esto sólo hay que seleccionar con el mouse todos los operators que queremos agrupar, then right click > Move into new subprocess.
 
Para simplificar el Process panel podemos también agrupar operators that are used for similar purposes into one operator. En nuestro caso podríamos egrupar todos los operators usados para data cleaning. Para esto sólo hay que seleccionar con el mouse todos los operators que queremos agrupar, then right click > Move into new subprocess.
  
 +
 +
<br />
 
===Model and Validate===
 
===Model and Validate===
 
https://rapidminer.com/training/videos/#model-validate
 
https://rapidminer.com/training/videos/#model-validate
  
 +
 +
<br />
 
====Creating a Decision Tree Model====
 
====Creating a Decision Tree Model====
 
https://rapidminer.wistia.com/medias/foaj0o4si9
 
https://rapidminer.wistia.com/medias/foaj0o4si9
  
 +
 +
<br />
 
====Applying the Model====
 
====Applying the Model====
 
https://rapidminer.wistia.com/medias/d4mfrw6bt0
 
https://rapidminer.wistia.com/medias/d4mfrw6bt0
  
 +
 +
<br />
 
====Testing a Model====
 
====Testing a Model====
 
https://rapidminer.wistia.com/medias/imjv4717hu
 
https://rapidminer.wistia.com/medias/imjv4717hu
  
 +
 +
<br />
 
====Validating a Model====
 
====Validating a Model====
 
https://rapidminer.wistia.com/medias/qwsalih5st
 
https://rapidminer.wistia.com/medias/qwsalih5st
  
 +
 +
<br />
 
====Finding the right Model====
 
====Finding the right Model====
 
https://rapidminer.wistia.com/medias/qmjo3wp59c
 
https://rapidminer.wistia.com/medias/qmjo3wp59c
  
 +
 +
<br />
 
====Optimization of the Model Parameters====
 
====Optimization of the Model Parameters====
 
https://rapidminer.wistia.com/medias/qie4nlf3in
 
https://rapidminer.wistia.com/medias/qie4nlf3in
  
 +
 +
<br />
 
====Automate Model Selection and Optimization====
 
====Automate Model Selection and Optimization====
 
https://rapidminer.wistia.com/medias/626q0hgcis
 
https://rapidminer.wistia.com/medias/626q0hgcis
  
 +
 +
<br />
 
====Auto Model====
 
====Auto Model====
  
 +
 +
<br />
 
=====Classification=====
 
=====Classification=====
 
https://rapidminer.wistia.com/medias/41ksc7dfx9
 
https://rapidminer.wistia.com/medias/41ksc7dfx9
  
 +
 +
<br />
 
=====Clustering and Outliers=====
 
=====Clustering and Outliers=====
 
https://rapidminer.wistia.com/medias/bbhv1ttm6p
 
https://rapidminer.wistia.com/medias/bbhv1ttm6p
  
 +
 +
<br />
 
===Operationalize===
 
===Operationalize===
 
https://rapidminer.com/training/videos/#operationalize
 
https://rapidminer.com/training/videos/#operationalize
  
 +
 +
<br />
 
====Collaboration of RapidMiner Studio and Server====
 
====Collaboration of RapidMiner Studio and Server====
 
https://rapidminer.wistia.com/medias/yradsqx3nu
 
https://rapidminer.wistia.com/medias/yradsqx3nu
  
 +
 +
<br />
 
====RapidMiner Server====
 
====RapidMiner Server====
  
 +
 +
<br />
 
=====Introduction to RapidMiner Server=====
 
=====Introduction to RapidMiner Server=====
 
https://rapidminer.wistia.com/medias/liyeu0vbmo
 
https://rapidminer.wistia.com/medias/liyeu0vbmo
  
 +
 +
<br />
 
=====RapidMiner Server Installation=====
 
=====RapidMiner Server Installation=====
 +
 +
 +
<br />
 
======RapidMiner Server Installation - Preparations======
 
======RapidMiner Server Installation - Preparations======
 
https://rapidminer.wistia.com/medias/vv1za6naow
 
https://rapidminer.wistia.com/medias/vv1za6naow
  
 +
 +
<br />
 
======RapidMiner Server Installation - Walk-through======
 
======RapidMiner Server Installation - Walk-through======
 
https://rapidminer.wistia.com/medias/ky3ooclf3a
 
https://rapidminer.wistia.com/medias/ky3ooclf3a
  
 +
 +
<br />
 
====Introducing RapidMiner Radoop====
 
====Introducing RapidMiner Radoop====
 
https://rapidminer.wistia.com/medias/hen3txskvh
 
https://rapidminer.wistia.com/medias/hen3txskvh
 +
 +
 +
<br />

Revision as of 21:46, 26 June 2020


https://rapidminer.com/


The RapidMiner Book - Data Mining For The Masses: https://docs.rapidminer.com/downloads/DataMiningForTheMasses.pdf


RapidMiner is a data science software platform developed by the company of the same name that provides an integrated environment for data preparation, machine learning, deep learning, text mining, and predictive analytics. It is used for business and commercial applications as well as for research, education, training, rapid prototyping, and application development and supports all steps of the machine learning process including data preparation, results visualization, model validation and optimization. RapidMiner is developed on an open core model. The RapidMiner Studio Free Edition, which is limited to 1 logical processor and 10,000 data rows is available under the AGPL license. Commercial pricing starts at $2,500 and is available from the developer.



Installation

Descargamos el paquete y seguimos las instruciones en el sitio oficial: https://docs.rapidminer.com/latest/studio/installation/

./RapidMiner-Studio.sh

We will need to create a RapidMiner account.



Training Videos

https://rapidminer.com/training/videos/

En el ejemplo tratado en este training utilizamos una data que contiene cierta información de customers de una compañía cualquiera. Queremos hacer predicciones de los customers que son más propensos a suprimir su afiliación con la compañía. De esta forma, si sabemos cuales son los customer que probablemente suprimirán su afiliación, la compañía puede tomar las medidas necesarias to encourage them to stay as customers.

En este sentido, our data shows historical information about those customers that we know that have been «loyal» (leales) or «churners» (desleales). With this data we can built a model to make a prediction on the other customers that we don't know yet how they are likely to behave («loyal» or «churner»)



Introductions

https://rapidminer.com/training/videos/#introductions

Important terms: https://docs.rapidminer.com/latest/studio/getting-started/important-terms.html



GUI Intro

https://rapidminer.wistia.com/medias/dxnsrftr9i



The views


Design view

Work areas for specific taks...

Process panel: It is to dissing any process, like:

  • Data loading
  • Forecasting
  • ...


  • To get started with a very simple process we can place an operator into the process panel:
    • For example, we can go to Data Access Operators and place (Drag and Drop) a «Retrieve» operator into the process panel.
    • Luego de hacer esto y seleccionar (click) el operator en el Process panel, el Parameters Panel cambia y permitirá, a traveés del folder icon, seleccionar the file we want to load. Podemos, por ejemplo seleccionar la «Titanic data set» que se encuenra pre-loaded in RapidMiner-Studio.
    • Then, in order to run the process, we need to connect the port of the «Retrieve» operator with the Result port.
    • Then, to run the process we have to click the Run button (>) (or F11) y así RapidMiner ejecutará el proceso y automáticamente desplegará the Result View, where the data set is display by default as a table.


Ports:



Results view

Work areas for specific taks...



Auto Model view


Operators


Repository
  • Through the repository panel you can access data and your process.
  • Al iniciar un proyecto se recomienda crear un nuevo repositorio con dos sub-folders: data and processes



Parameters


Global Search


Adding extensions - The RapidMiner Marketplace

https://rapidminer.wistia.com/medias/9nu4i7b5ea

  • To add extensions go to: Extensions > Marketplace:
    • Top Downloads: some of the most popular extensions.
    • Se recomienda instalar las siguientes:
      • Text Processing
      • Web Mining
      • Python/R integration
      • Anomaly Detection
      • Series extension
      • RapidMiner Radoop


  • Luego de instalar la extension se the «Extension» folder in the «Operators» panel mostrará una nueva carpeta por cada extension instalada.
  • También hay extension que adicionan una nueva «View». Por ejemplo, the «Radoop» extension adds the «Hadoop Data» view.


  • To manage and uninstall extensions go to: Extensions > Manage Extensions


  • Otra forma de instalar extensions es ir directamente al Marketplace website at: https://marketplace.rapidminer.com
    • Descargamos el .jar file y lo colocamos en la «extension folder»: /home/adelo/.RapidMiner/extensions/
    • Reiniciamos RapidMiner and the extension will become available.



Data Preparation AND ETL

https://rapidminer.com/training/videos/#data-preparation



Importing data

https://rapidminer.wistia.com/medias/t3b4v3hceb

The data used in this video is available here: http://docs.rapidminer.com/studio/getting-started/customer-churn-data.xlsx

  1. Iniciamos nuestro nuevo proyecto by creating a new repository with 2 sub-folders, let's call it:
    • MyFirstPrediction
      • data
      • processes
  2. Then, to import the data we can click the button «Import Data» and look for the file or we can just Drag and Drop the file into RapidMiner.
    • In the second step (format your columns) we can change some of the properties of the attributes (columns):
      • Change type: Real, Integer, etc
      • Rename column
      • Change Role: The default for each column is «General Attribute». We can change the role to: id, label, wight...



Data loading via a process

https://rapidminer.wistia.com/medias/fyfagg7rp9

  1. Define the read operator we want:
    • Go to the Operator panel and for this case we well use the «Read Excel» operator. So we Drag and Drop this operator into our Process panel.
  2. Go to the Parameters panel and click «Import configuration wizard»
    • Look for the file
    • After the file is load into the «Import configuration wizard» you can make the configuration you want:
      • Change the role of the Churn Attribute to «Label».
  3. Connect the port to the result port.
  4. Hit the Run process button (>) (F11)


  • Es muy importante notar la diferencia enter esté método para cargar la data y el anterior explciado en la sección RapidMiner#Importing data. La diferencia es que el presente método load the data into the RAN Memory; it doesn't store it into the local repository.


  • Save the process into your local repository:
  • File > Save process: then select the folder, or
  • Go to the folder into the correct repository where you want to save the process: Right click > Store process here.


  • Import/Export processes: That is only to save the process in another folder (export) and then import the process to our Process panel
  • File > Import/Export process
  • After you import a process, it will be into the process panel but not saved into your local repository. You can save it by right clicking the processes folder in your local repository > store process here.


  • Delete a Repository:
  • When you delete it from the Repository panel, you are just deleting the lick to it but not the files and folders inside it; if you want to delete them, you need to go to the file system and to it there (~/.RapidMiner/repositories).


  • We can also save the data into our repository through the process adding a «Store» operator»:
    • Drag and Drop the «Store operator» into the process panel (sobre la línea que conecta el the «Read Excel operator» y the result port).
    • Then, in the Parameters panel click in the folder icon and indicate where you want to store the data.
    • Then, hit the Run process button. That will store the data into the your local repository.



Visualizing data

https://rapidminer.wistia.com/medias/w623uxkoga

  • Attribute = Column
    • Regular attributes
    • Special attributes:
      • Label: Cuando una attributo es marcado como «Label» quiere decir que tal atributo es el que queremos que el modelo aprenda a predecir (It's the attribute that we want our model to learn to predict). So we are going to use the regular attributes to do so.
  • Example = Row
  • Example set = The entire data


  • Data tab:
    • When displaying data (in the Results View) we can sort the order of the attributes by clicking on the attribute (one click for ascending, a second click to descending and a third time to remove the sorting. By pressing the Ctrl key we can sort by multiple attributes.


  • Statistics tab: RapidMiner do some automatic data discovery.
    • We can display the data in different chart styles (Histogram, Scatter, Pie, etc).
      • We can display multiple attributes in the same chart by clicking in the attributes shown in the «Plots box» using the Ctrl key.


  • To change the standard colors of the chart you can go to:
    • Settings > Preferences:
      • Color for minimum value in chart keys
      • Color for maximum value in chart keys


  • Advanced charts tap:
    • Example (using the «customer-churn-data»):
      • Drag and Drop the «Age» attribute to the «Domain dimension» (that represent the x axis) and the «Last transaction» attribute to the «Empty axis» («Numerical axis) (that represent the y axis)
  • Having «Series: «LastTransaction» selected (clicking) from chart configuration box:
  • Title: Number of RapidMiner users in million
  • Visualization: Lines and shapes
  • Some format configurations:
  • Item shape: Diamond
  • Color: Yellow
  • Line stile: solid
  • Aggregation: Average
  • Indicators:
  • Indicator type: Band
  • Indicator 1: Drag and Drop age to this field.
  • Indicator 2: Drag and Drop age to this field.


  • Selecting Domain dimension from chart configuration box:
  • Title: Weeks from today


  • Selecting global configuration from chart configuration box:
  • Chart title: Prediction of RapidMiner studio users
  • Plot background: Change color to black


  • Then you can export the plot:
  • File > Print/Export Image




Turbo Prep


Connecting to Databases

https://rapidminer.wistia.com/medias/th4jxcrww6



Data preparation

https://rapidminer.wistia.com/medias/p6kou5484b

  • In reality data is never complete and without issues. Here we'll show you some of the operators that help to prepare and clean up the data.


  • Loading the data:
  • Let's start dragging and drop the «Read excel» operator into our process panel.
  • Then, select the data panel via the parameters panel
  • Connect the output port of the operator to the result port
  • Run the process (> button or F11)


  • Some of the problems the data can have are:
  • White spaces in some rows: This can cause different problems. For example, in this data we have the attribute «Gender», that have 2 possible values: «male» or «female». So, if some of the rows have white spaces at the beginning or the end of the value (« femele») that will be interpreted as another value so we will have three values for the «Gender» attribute: «male», «female» and « female».
  • Valores no definidos
  • Repeated rows (duplicates)
  • Wrong value format: En nuestra data some of the values for the attribute «Gender» have the value «m» instead of «male». As we said before, we need this attribute to have only 2 values: «male» or «female».


  • Para identificar problemas en la data debemos empezar mirando las estadíasticas de la data. Here we can see, for example:
  • How many values are missing for each attribute.
  • Duplicates.
Some of the problems the data can have shown in the statistics panel.png


  • It is very important to notice that in our data we have customers that we know if they are «loyal» or «churners» and customers that we don't know yet how they are going to behave (so we don't have loyalty information for those ones). So the ones with «loyalty» information have to be placed in one subset to be used as input to create our model; and the customers without «loyalty» information are going to be grouped in another subset so we can predict how they will act.



White spaces
  • To remove white spaces we use the «trim» operator:



Valores no definidos

Por ejemplo, en nuestra data tenemos algunas rows with missing «age» or «gender» values. Como en nuestra datas éstos son sólo pocos, vamos a deshacernos de esas rows. Para esto:

  • Agregamos el el operator «Filter examples». al process panel.
  • Para configurar dicho operator debemos hacer click en el «Add filter button» del parameters panel.
  • En la ventana que se abrirá debemos seleccionar los atributos en los cuales queremos realizar cambios and set the filter to «Is not missing»
    • Gender: Is not missing
    • Age: Is not missing
    • Es importante notar que para poder seleccionar los atributos en esta ventana, la data tiene haber sido cargada utilizando el «Import configuration wizard»



Duplicates
  • Drag and drop the «Remove duplicates operator» into our process panel:
  • Then, in the parameters panel:
    • Attribute filter type: All (this mean that 2 examples (rows) are considered duplicates if they are identical with respect to all attributes values).



Wrong value format - Replace operator

We need to solve the problem with the «Gender» values that are «m» instead of «male». We can use the «Replace operator» for that:

  • We Drag and Drop the «Replace operator» into our process panel.
  • In the Parameters panel we set:
    • Attribute filter type: Single
    • Attribute: Gender
    • Replace what: if here we put just m that would replace every m in the attribute. For example the value «female» would become «femaleale». So, We only need to replace «m» if it's not a part of a wold.
    • Replace by: male



We need to split the data in 2 subsets
  • Subset 1: The ones with «loyalty» information.
  • Subset 2: The ones without «loyalty» information.


  • For splitting we can use the «Split operator» or the «Filter examples operator»:
    • We Drag and Drop the «Filter examples operator» into our process panel.
      • In the Parameters panel we click in «Add filter» and we set:
        • Churn: is not missing


  • Now we can store this examples (rows) using the «Store operator»:
    • We Drag and Drop the «Store operator» into our process panel
    • We connect the «Unmatch port» of the «Filter examples operator» used to split the data to the «Input port» of the «Store operator»
    • Then, in the Parameter panal of the «Store operator»:
      • Repository entry: click in the folder icon:
        • Choose the location to save the data and give a name to the data subset 2: «Local Repository/Data/unlabeled_customers»
    • Connect the port of the «Store operator» to the «Result port»



Remove necessaries attributes
  • Notice we have only missing values in in the Churn attribute of the unlabeled_customers subset. Nos podemos entonces deshacer the este attribute:
    • Use the «Select Attributes operator»:
      • Attribute filter type: single
      • Attribute: Churn
      • Tick the «Invert selection» box
  • We can also remove the attribute «Name» (que no se utilizará para nada) using the «Select Attributes operator» como se mostró arriba.



Set Churn as the label
  • Set «Churn» as the «label» (Ir order to the model to know which is the target variable (label))
    • We use the «Set role operator». In the Parameter panel:
      • Attribute name: Churn
      • Target role: label



Grouping operators

Para simplificar el Process panel podemos también agrupar operators that are used for similar purposes into one operator. En nuestro caso podríamos egrupar todos los operators usados para data cleaning. Para esto sólo hay que seleccionar con el mouse todos los operators que queremos agrupar, then right click > Move into new subprocess.



Model and Validate

https://rapidminer.com/training/videos/#model-validate



Operationalize

https://rapidminer.com/training/videos/#operationalize



Collaboration of RapidMiner Studio and Server

https://rapidminer.wistia.com/medias/yradsqx3nu



RapidMiner Server


Introduction to RapidMiner Server

https://rapidminer.wistia.com/medias/liyeu0vbmo



RapidMiner Server Installation


RapidMiner Server Installation - Preparations

https://rapidminer.wistia.com/medias/vv1za6naow



RapidMiner Server Installation - Walk-through

https://rapidminer.wistia.com/medias/ky3ooclf3a



Introducing RapidMiner Radoop

https://rapidminer.wistia.com/medias/hen3txskvh