Difference between revisions of "Python for Data Science"

From Sinfronteras
Jump to: navigation, search
(Missing Data)
(Keep a python script running on a remote server)
 
(184 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
 +
<br />
 
For a standard Python tutorial go to [[Python]]
 
For a standard Python tutorial go to [[Python]]
 +
 +
 +
<br />
 +
==Courses==
 +
*Udemy - Python for Data Science and Machine Learning Bootcamp
 +
 +
:https://www.udemy.com/course/python-for-data-science-and-machine-learning-bootcamp/
  
  
Line 11: Line 20:
 
<br />
 
<br />
 
===Installation===
 
===Installation===
https://linuxize.com/post/how-to-install-anaconda-on-ubuntu-18-04/
+
Installation from the official Anaconda Web site: https://docs.anaconda.com/anaconda/install/
  
https://www.digitalocean.com/community/tutorials/how-to-install-the-anaconda-python-distribution-on-ubuntu-18-04
 
  
 +
<br />
  
<br />
 
 
===Anaconda comes with a few IDE===
 
===Anaconda comes with a few IDE===
  
Line 35: Line 43:
  
 
<br />
 
<br />
 +
 
==Jupyter==
 
==Jupyter==
 
Jupyter comes with Anaconda.
 
Jupyter comes with Anaconda.
Line 48: Line 57:
  
 
<br />
 
<br />
===Online Jupyter===
+
===Remote connection===
There are many sites that provides solutions to run your Jupyter Notebook in the cloud: https://www.dataschool.io/cloud-services-for-jupyter-notebook/
+
https://jupyter-notebook.readthedocs.io/en/stable/public_server.html
 
 
I have tried:
 
  
*https://cocalc.com/app
 
  
::https://cocalc.com/projects/595bf475-61a7-47fa-af69-ba804c3f23f9/files/?session=default
+
A**1
::Parece bueno, pero tiene opciones que no son gratis
 
  
  
*https://www.kaggle.com/
+
<syntaxhighlight lang="shell">
 +
(base) adelo@vmi346715:~/.jupyter$ openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout mykey.key -out mycert.pem
 +
Generating a RSA private key
 +
......................................+++++
 +
....................................+++++
 +
writing new private key to 'mykey.key'
 +
-----
 +
You are about to be asked to enter information that will be incorporated
 +
into your certificate request.
 +
What you are about to enter is what is called a Distinguished Name or a DN.
 +
There are quite a few fields but you can leave some blank
 +
For some fields there will be a default value,
 +
If you enter '.', the field will be left blank.
 +
-----
 +
Country Name (2 letter code) [AU]:IE
 +
State or Province Name (full name) [Some-State]:Dublin
 +
Locality Name (eg, city) []:Dublin
 +
Organization Name (eg, company) [Internet Widgits Pty Ltd]:.
 +
Organizational Unit Name (eg, section) []:.
 +
Common Name (e.g. server FQDN or YOUR name) []:sinfronteras   
 +
Email Address []:adeloaleman@gmail.com
 +
</syntaxhighlight>
  
::https://www.kaggle.com/adeloaleman/kernel1917a91630/edit
 
::Parece bueno pero no encontré la forma adicionar una TOC
 
  
 +
<br />
 +
===Share Jupyter Notebook online===
 +
* '''GitHub:'''
 +
: https://docs.github.com/en/github/managing-files-in-a-repository/working-with-jupyter-notebook-files-on-github
 +
: Example: https://github.com/adeloaleman/AmazonLaptopsDashboard/blob/master/DataAnalysis/data_analysis2.ipynb
  
*https://drive.google.com
 
  
:*https://colab.research.google.com
+
* '''Nbviewer''
::Es el que estoy utilizando ahora
+
: https://nbviewer.jupyter.org/
 +
: Example: https://nbviewer.jupyter.org/github/bokeh/bokeh-notebooks/blob/main/tutorial/06%20-%20Linking%20and%20Interactions.ipynb
  
  
 
<br />
 
<br />
==Courses==
 
 
*Udemy - Python for Data Science and Machine Learning Bootcamp
 
  
:https://www.udemy.com/course/python-for-data-science-and-machine-learning-bootcamp/
+
===Customize Jupyter===
  
  
 
<br />
 
<br />
==Most popular Python Data Science Libraries===
+
====Themes====
 +
https://github.com/dunovank/jupyter-themes
  
*NumPy
+
Ver el tema que muestran en esta página: https://gist.github.com/pierrejoubert73/902cc94d79424356a8d20be2b382e1ab
*SciPy
 
*Pandas
 
*Seaborn
 
*SciKit'Learn
 
*MatplotLib
 
*Plotly
 
*PySpartk
 
  
  
<br />
+
jt  -t oceans16    -cellw 98%  -lineh 120  -fs 14  -nfs 14  -dfs 14  -ofs 14
==NumPy==
 
  
*NumPy (or Numpy) is a Linear Algebra Library for Python, the reason it is so important for Data Science with Python is that almost all of the libraries in the PyData Ecosystem rely on NumPy as one of their main building blocks.
 
  
*Numpy is also incredibly fast, as it has bindings to C libraries. For more info on why you would want to use Arrays instead of lists, check out this great [StackOverflow post](http://stackoverflow.com/questions/993984/why-numpy-instead-of-python-lists).
+
https://www.kaggle.com/getting-started/97540
 +
jt  -t monokai      -cellw 98%  -lineh 120  -fs 14  -nfs 14  -dfs 14  -ofs 14  -f fira  -nf ptsans  --kl  -cursw 2  -cursc r  -T
  
  
 
<br />
 
<br />
===Installation===
 
It is highly recommended you install Python using the Anaconda distribution to make sure all underlying dependencies (such as Linear Algebra libraries) all sync up with the use of a conda install.
 
  
 +
====Extensions====
 +
This post mention so nice extension and configuration that can be done: https://towardsdatascience.com/bringing-the-best-out-of-jupyter-notebooks-for-data-science-f0871519ca29
  
If you have Anaconda, install NumPy by:
+
<br />
 +
=====Unofficial Jupyter Notebook Extensions=====
 +
https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/index.html
  
conda install numpy
+
<span style="color: green">'''This is very important. There are very nice extensions in this package:'''</span>
<br />If you are not using Anaconda distribution:
 
  
*
+
* toc2: https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/nbextensions/toc2/README.html
 +
* Collapsible Headings
 +
* ... etc
  
pip install numpy
+
<br />
 +
======Installation======
 +
https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/install.html
  
 +
<span style="color: red">'''I had some issues to install it. La format indicada por defecto:'''</span>
  
 +
pip install jupyter_contrib_nbextensions
 +
jupyter contrib nbextension install --user
  
Then, to use it:<syntaxhighlight lang="python3">
+
<span style="color: red">'''A través de la forma anterior no pude instalar el paquete de forma correcta. La instalación no retornó errorres, y la extensión se mostraba en Jupyter-notebook pero no podía activar "enable" las extensiones.'''</span>
import numpy as np
 
arr = np.arange(0,10)
 
</syntaxhighlight>
 
 
 
  
===Arrays===
 
{| class="wikitable"
 
! colspan="2" rowspan="2" |
 
! colspan="2" rowspan="2" |Method/Operation
 
! rowspan="2" |Description/Comments
 
!Example
 
|-
 
!<syntaxhighlight lang="python3">
 
import numpy as np
 
</syntaxhighlight>
 
|-
 
! rowspan="10" |<h5 style="text-align:left">Methods for creating NumPy Arrays</h5>
 
|<h5 style="text-align:left">From a Python List</h5>
 
| colspan="2" |'''''<code>array()</code>'''''
 
|We can create an array by directly converting a list or list of lists.
 
|<code>my_list = [1,2,3]</code>
 
<code>np.array(my_list)</code>
 
  
 +
<span style="color: red">'''Al parecer es un problema con la ubicación de la instalación. Yo estaba usando conda pero conda está presentando problemas. La instalación de los paquestes demora muchísimo y luego el paquete parece no estar disponible.'''</span>
  
<code>my_matrix = [[1,2,3],[4,5,6],[7,8,9]]</code>
 
  
<code>np.array(my_matrix)</code>
+
<span style="color: red">'''En el siguiente post encontré una solución para instalar nbextension usando pip:'''</span>
|-
+
https://github.com/ipython-contrib/jupyter_contrib_nbextensions/issues/1127
| rowspan="9" |<h5 style="text-align:left">From Built-in NumPy Methods</h5>
 
| colspan="2" |'''''<code>arange()</code>'''''
 
|Return evenly spaced values within a given interval.
 
|<code>np.arange(0,10)</code>
 
<code>np.arange(0,11,2)</code>
 
|-
 
| colspan="2" |'''''<code>zeros()</code>'''''
 
|Generate arrays of zeros.
 
|<code>np.zeros(3)</code>
 
<code>np.zeros((5,5))</code>
 
|-
 
| colspan="2" |'''''<code>ones()</code>'''''
 
|Generate arrays of ones.
 
|<code>np.ones(3)</code>
 
<code>np.ones((3,3))</code>
 
|-
 
| colspan="2" |'''''<code>linspace()</code>'''''
 
|Return evenly spaced numbers over a specified interval.
 
|<code>np.linspace(0,10,3)</code>
 
<code>np.linspace(0,10,50)</code>
 
|-
 
| colspan="2" |'''''<code>eye()</code>'''''
 
|Creates an identity matrix.
 
|<code>np.linspace(0,10,50)</code>
 
|-
 
| rowspan="4" |'''''<code>random</code>'''''
 
|'''''<code>rand()</code>'''''
 
|Create an array of the given shape and populate it with random samples from a uniform distribution over <code>[0, 1)</code>.
 
|<syntaxhighlight lang="python3">
 
np.random.rand(2)
 
np.random.rand(5,5)
 
  
 +
pip install --upgrade jupyter_contrib_nbextensions
 +
jupyter contrib nbextension install  --sys-prefix  --symlink
  
# Another way to invoke a function:
+
<span style="color: red">'''«--symlink» creo que lo usé pero no estoy completamente seguro. También realicé el --upgrade pero creo que la diferencia la hicieron las opciones --sys-prefix  --symlink'''</span>
from numpy.random import rand
 
# Then you can call the function directly
 
rand(5,5)
 
</syntaxhighlight><br />
 
|-
 
|'''''<code>randn()</code>'''''
 
|Return a sample (or samples) from the "standard normal" distribution. Unlike rand which is uniform.
 
|<code>np.random.randn(2)</code>
 
<code>np.random.randn(5,5)</code>
 
|-
 
|'''''<code>randint()</code>'''''
 
|Return random integers from <code>low</code> (inclusive) to <code>high</code> (exclusive).
 
|<code>np.random.randint(1,100)</code>
 
<code>np.random.randint(1,100,10)</code>
 
|-
 
|'''<code>seed()</code>'''
 
|sets the random seed of the NumPy pseudo-random number generator. It provides an essential input that enables NumPy to generate pseudo-random numbers for random processes. See [[wikipedia:Random_seed|s1]] and [https://www.sharpsightlabs.com/blog/numpy-random-seed/ s2]. for explanation.
 
|<code>np.random.seed(101)</code>
 
|-
 
! rowspan="4" |<h5 style="text-align:left">Others Array Attributes and Methods</h5>
 
| rowspan="4" |
 
| colspan="2" |''<code>'''reshape()'''</code>''
 
|Returns an array containing the same data with a new shape.
 
|<code>arr.reshape(5,5)</code>
 
|-
 
| colspan="2" |'''''<code>max()</code>, <code>min()</code>, <code>argmax()</code>, <code>argmin()</code>'''''
 
|Finding max or min values. Or to find their index locations using argmin or argmax.
 
|<code>arr.max()</code>
 
<code>arr.argmax()</code>
 
|-
 
| colspan="2" |''<code>'''shape()'''</code>''
 
|Shape is an attribute that arrays have (not a method).
 
|NO LO ENTENDI.. REVISAR!
 
  
  
<nowiki>#</nowiki>Length of array
 
  
arr_length = arr2d.shape[1]
+
Si no se muestra la '''Nbextensions''' tab (), try to reinstall the https://github.com/Jupyter-contrib/jupyter_nbextensions_configurator
<br />
 
|-
 
| colspan="2" |''<code>'''dtype()'''</code>''
 
|You can also grab the data type of the object in the array.
 
|<code>arr.dtype</code>
 
|-
 
!<nowiki>-</nowiki>
 
!-
 
! colspan="2" |-
 
!-
 
!-
 
|-
 
! rowspan="8" |<h5 style="text-align:left">Indexing and Selection</h5>
 
  
<div style="text-align:left">
+
pip install jupyter_nbextensions_configurator
*How to select elements or groups of elements from an array.
+
or
*The general format is '''arr_2d[row][col]''' or '''arr_2d[row,col]'''. I recommend usually using the comma notation for clarity.
+
conda install -c conda-forge jupyter_nbextensions_configurator
</div>
 
|
 
| colspan="2" |
 
| colspan="2" |<div class="mw-collapsible mw-collapsed" style="">
 
'''Creating sample array for the following examples:'''
 
<div class="mw-collapsible-content">
 
<syntaxhighlight lang="python3">
 
import numpy as np
 
arr = np.arange(0,10)
 
# 1D Array:
 
arr = np.arange(0,11)
 
#Show
 
arr
 
Output: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
 
  
# 2D Array
 
arr_2d = np.array(([5,10,15],[20,25,30],[35,40,45]))
 
#Show
 
arr_2d
 
Output:
 
array([[ 5, 10, 15],
 
      [20, 25, 30],
 
      [35, 40, 45]])
 
</syntaxhighlight>
 
</div>
 
</div>
 
|-
 
| rowspan="2" |<h5 style="text-align:left">Bracket Indexing and Selection (Slicing)</h5>
 
| colspan="2" |
 
|Note: When we create a sub-array slicing an array (slice_of_arr = arr[0:6]), data is not copied, it's a view of the original array! This avoids memory problems! To get a copy, need to use the method '''copy()'''. See important note below.
 
|<syntaxhighlight lang="python3">
 
#Get a value at an index
 
arr[8]
 
  
#Get values in a range
+
<br />
arr[1:5]
 
  
slice_of_arr = arr[0:6]
+
====CustomJS and CustonCSS files====
 +
This is a good post: https://forums.fast.ai/t/jupyter-notebook-enhancements-tips-and-tricks/17064
  
#2D
+
Keyboard Shortcut Customization: https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Custom%20Keyboard%20Shortcuts.html
arr_2d[1]
 
arr_2d[1][0]
 
arr_2d[1,0] # The same that above
 
  
#Shape (2,2) from top right corner
 
arr_2d[:2,1:]
 
#Output:
 
array([[10, 15],
 
      [25, 30]])
 
  
#Shape bottom row
+
<br />
arr_2d[2,:]
+
custom.js
</syntaxhighlight><br />
+
<syntaxhighlight lang="js">
|-
+
/** Mis configuraciones */
| colspan="2" |
 
| colspan="2" |<div class="mw-collapsible mw-collapsed" style="">
 
'''Fancy Indexing''':
 
<div class="mw-collapsible-content">
 
Fancy indexing allows you to select entire rows or columns out of order.
 
 
 
Example:<syntaxhighlight lang="python3">
 
# Set up matrix
 
arr2d = np.zeros((10,10))
 
  
# Length of array
+
// This is to enable syntax highlighting for SQL code:
arr_length = arr2d.shape[1]
+
// https://stackoverflow.com/questions/43641362/adding-syntax-highlighting-to-jupyter-notebook-cell-magic
 +
require(['notebook/js/codecell'], function(codecell) {
 +
  codecell.CodeCell.options_default.highlight_modes['magic_text/x-mssql'] = {'reg':[/^%%sql/]} ;
 +
  Jupyter.notebook.events.one('kernel_ready.Kernel', function(){
 +
  Jupyter.notebook.get_cells().map(function(cell){
 +
      if (cell.cell_type == 'code'){ cell.auto_highlight(); } }) ;
 +
  });
 +
});
  
# Set up array
 
for i in range(arr_length):
 
    arr2d[i] = i
 
   
 
arr2d
 
# Output:
 
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
 
      [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
 
      [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
 
      [3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
 
      [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
 
      [5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
 
      [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
 
      [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
 
      [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
 
      [9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
 
  
# Fancy indexing allows the following
+
// My plain theme
arr2d[[6,4,2,7]]
+
// This is a good post where I took some ideas to write the following fuction: https://forums.fast.ai/t/jupyter-notebook-enhancements-tips-and-tricks/17064
# Output:
+
function plainTheme() {
array([[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
+
    var input_promp_fields = document.getElementsByClassName("prompt_container");
      [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
+
    var text_render_fields = document.getElementsByClassName("text_cell_render");
      [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
 
      [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.]])
 
</syntaxhighlight><br />
 
</div>
 
</div>
 
|-
 
| rowspan="2" |<h5 style="text-align:left">Broadcasting</h5>
 
  
 +
    if (input_promp_fields[0].style.visibility == "collapse"){
 +
        action = "visible";
 +
        input_marginLeft = "0px";
 +
        border_top  = "3px";
 +
        prompt_width = "74px";
 +
        padding_top = "0px";
 +
        output_margin = "40px";
 +
    }else{
 +
        action = "collapse";
 +
        input_marginLeft = "74px";
 +
        border_top  = '0px';
 +
        prompt_width = "74px";
 +
        padding_top = "40px";
 +
        output_margin = "40px";
 +
    }
  
(Setting a value with index range)
+
    // Si queremos usar !important debemos hacerlo de esta forma utilizando JQuery:
| colspan="2" rowspan="2" |
+
    // https://makitweb.com/how-to-add-important-to-css-property-with-jquery/
| rowspan="2" |Setting a value with index range:
+
    var text_cell_fields = document.getElementsByClassName("text_cell");
Numpy arrays differ from a normal Python list because of their ability to broadcast.
+
    $(text_cell_fields).ready(function(){
|arr[0:5]=100<br />'''#'''Show
+
        $('.input_prompt').css({
arr
+
            'cssText': `width: 40px !important; max-width: ${prompt_width} !important; min-width: ${prompt_width} !important;`
 +
        });
 +
    });
  
Output: array([100, 100, 100, 100, 100,  5,  6,  7,  8,  9,  10])
+
    $(document).ready(function(){
|-
+
        $(".prompt_container").css(
|'''#'''Setting all the values of an Array
+
            'visibility', `${action}`
arr[:]=99
+
        );
|-
+
       
|<h5 style="text-align:left">Get a copy of an Array</h5>
+
        $(".input").css(
| colspan="2" |'''<code>copy''()''</code>'''
+
            'padding-left', `${input_marginLeft}`
|Note: When we create a sub-array slicing an array (slice_of_arr = arr[0:6]), data is not copied, it's a view of the original array! This avoids memory problems! To get a copy, need to use the method '''copy()'''. See important note below.
+
        );
|arr_copy = arr.copy()
+
       
|-
+
        $(".output_subarea").css(
|<h5 style="text-align:left">Important notes on Slices</h5>
+
            'margin-left', `${output_margin}`
| colspan="2" |
+
        );
| colspan="2" |<div class="mw-collapsible mw-collapsed" style=""><syntaxhighlight lang="python3">
+
                   
slice_of_arr = arr[0:6]
+
        $('.cell').css({
#Show slice
+
            'cssText': `border-top-width: ${border_top} !important; border-bottom-width: ${border_top} !important;`
slice_of_arr
+
        });
Output: array([0, 1, 2, 3, 4, 5])
+
       
 +
        $(".collapsible_headings_ellipsis").css({
 +
            'cssText': `padding-top:${padding_top} !important; border-top-width: ${border_top} !important; border-bottom-width: ${border_top} !important;`
 +
        });
  
#Making changes in slice_of_arr
+
        $(".text_cell_render").css({
slice_of_arr[:]=99
+
            'cssText': `margin-left: -10px;`
#Show slice
+
        });
slice_of_arr
+
    });           
Output: array([99, 99, 99, 99, 99, 99])
+
}
  
#Now note the changes also occur in our original array!
+
Jupyter.keyboard_manager.command_shortcuts.add_shortcut('Alt-Ctrl-Q', {
#Show
+
    help : '...',
arr
+
    help_index : 'zz',
Output: array([99, 99, 99, 99, 99, 99, 6, 7, 8, 9, 10])
+
    handler : function (event) {
 +
        plainTheme();
 +
    return false;
 +
    }}
 +
);
  
#When we create a sub-array slicing an array (slice_of_arr = arr[0:6]), data is not copied, it's a view of the original array! This avoids memory problems!
+
Jupyter.keyboard_manager.edit_shortcuts.add_shortcut('Alt-Ctrl-Q', {
 +
    help : '...',
 +
    help_index : 'zz',
 +
    handler : function (event) {
 +
        plainTheme();
 +
    return false;
 +
    }}
 +
);
  
#To get a copy, need to use the method copy()
 
</syntaxhighlight>
 
</div>
 
|-
 
|<h5 style="text-align:left">Using brackets for selection based on comparison operators and booleans</h5>
 
| colspan="2" |
 
| colspan="2" |<div class="mw-collapsible mw-collapsed" style=""><syntaxhighlight lang="python3">
 
arr = np.arange(1,11)
 
arr > 4
 
# Output:
 
array([False, False, False, False,  True,  True,  True,  True,  True,
 
        True])
 
  
bool_arr = arr>4
+
// This could be very usefull. It allows to add text automatically into a cell
bool_arr
+
// https://forums.fast.ai/t/jupyter-notebook-enhancements-tips-and-tricks/17064/27
# Output:
+
Jupyter.keyboard_manager.edit_shortcuts.add_shortcut('Ctrl-Shift-J', {
array([False, False, False, False,  True,  True,  True,  True,  True,
+
    help : '...',
         True])
+
    help_index : 'zz',
 +
    handler : function (event) {
 +
        document.body.style.background = 'blue'
 +
        var target = Jupyter.notebook.get_selected_cell()
 +
        var cursor = target.code_mirror.getCursor()
 +
        var before = target.get_pre_cursor()
 +
        var after = target.get_post_cursor()
 +
        target.set_text(before + 'from IPython.core.display import display, HTML; \n\taverrrdisplay(HTML("<style>.container { width:98% !important;}</style>"))' + after)
 +
        cursor.ch += 20 // where to put your cursor
 +
        target.code_mirror.setCursor(cursor)
 +
         return false;
 +
    }}
 +
);
  
arr[bool_arr]
 
# Output:
 
array([ 5,  6,  7,  8,  9, 10])
 
  
arr[arr>2]
+
// To get the real value of a css field: https://stackoverflow.com/questions/26074476/document-body-style-backgroundcolor-doesnt-work-with-external-css-style-sheet
# Output:
+
// window.getComputedStyle(document.body).backgroundColor
array([ 3,  4,  5,  6,  7,  8,  9, 10])
+
// window.getComputedStyle(document.getElementsByClassName("input_area")[0]).backgroundColor
 
 
x = 2
 
arr[arr>x]
 
# Output:
 
array([ 3,  4,  5,  6,  7,  8,  9, 10])
 
 
</syntaxhighlight>
 
</syntaxhighlight>
</div>
 
|-
 
!-
 
!-
 
! colspan="2" |-
 
!-
 
!-
 
|-
 
!<h5 style="text-align:left">Arithmetic operations</h5>
 
|
 
| colspan="2" |<code>arr + arr</code>
 
<code>arr - arr</code>
 
 
<code>arr * arr</code>
 
 
<code>arr/arr</code>
 
 
<code>1/arr</code>
 
 
<code>arr**3</code>
 
|Warning on division by zero, but not an error!
 
<code>0/0 -> nan</code>
 
 
<code>1/0 -> inf</code>
 
|<syntaxhighlight lang="python3">
 
import numpy as np
 
arr = np.arange(0,10)
 
 
arr + arr
 
# Output:
 
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])
 
 
arr**3
 
# Output:
 
array([  0,  1,  8,  27,  64, 125, 216, 343, 512, 729])
 
</syntaxhighlight>
 
|-
 
! rowspan="5" |<h5 style="text-align:left">[https://docs.scipy.org/doc/numpy/reference/ufuncs.html Universal Array Functions]</h5>
 
| rowspan="5" |
 
| colspan="2" |<code>np.sqrt(arr)</code>
 
|Taking Square Roots
 
| rowspan="5" |<syntaxhighlight lang="python3">
 
np.sin(arr)
 
# Output:
 
array([ 0.        ,  0.84147098,  0.90929743,  0.14112001, -0.7568025 ,
 
      -0.95892427, -0.2794155 ,  0.6569866 ,  0.98935825,  0.41211849])
 
</syntaxhighlight>
 
|-
 
| colspan="2" |<code>np.exp(arr)</code>
 
|Calcualting exponential (e^)
 
|-
 
| colspan="2" |<code>np.max(arr)</code>
 
same as <code>arr.max()</code>
 
|Max
 
|-
 
| colspan="2" |<code>np.sin(arr)</code>
 
|Sin
 
|-
 
| colspan="2" |<code>np.log(arr)</code>
 
|Natural logarithm
 
|}
 
 
 
<br />
 
==Pandas==
 
You can think of pandas as an extremely powerful version of Excel, with a lot more features. In this section of the course, you should go through the notebooks in this order:
 
  
  
 
<br />
 
<br />
===Series===
+
custom.css
A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.
+
<syntaxhighlight lang="css">
 +
/*  Mis configuraciones  */
  
{| class="wikitable"
+
.container { width:98% !important; }
! rowspan="2" |
+
/* document.getElementById("notebook-container").style.minWidth = "50%"; */
! rowspan="2" |
+
/* document.getElementById("notebook-container").style.maxWidth = "50%"; */
! rowspan="2" |Method/Operator
 
! rowspan="2" |Description/Comments
 
!Example
 
|-
 
!<syntaxhighlight lang="python3">
 
import pandas as pd
 
</syntaxhighlight>
 
|-
 
! rowspan="3" |<h4 style="text-align:left">Creating Pandas Series</h4>
 
  
 +
#notebook-container {
 +
width:98% !important;
 +
}
  
<div style="text-align:left">
+
.CodeMirror-gutters {
You can convert a <code>list</code>, <code>numpy array</code>, or <code>dictionary</code> to a Series.
+
background-color: transparent !important;
</div>
+
background: transparent !important;
|<h5 style="text-align:left">From a List</h5>
+
}
|<code>pd.Series(my_list)</code>
 
| colspan="2" rowspan="3" |<syntaxhighlight lang="python3">
 
# Creating some test data:
 
labels = ['a','b','c']
 
my_list = [10,20,30]
 
arr = np.array([10,20,30])
 
d = {'a':10,'b':20,'c':30}
 
  
 +
.CodeMirror-linenumber {
 +
margin-left: -20px !important;
 +
}
  
pd.Series(data=my_list)
+
.output_subarea {
pd.Series(my_list)
+
margin-left: 40px !important;
pd.Series(arr)
+
}
# Output:
 
0    10
 
1    20
 
2    30
 
dtype: int64
 
  
pd.Series(data=my_list,index=labels)
+
#toc .fa-fw {
pd.Series(my_list,labels)
+
color: blue !important;
pd.Series(arr,labels)
+
}
pd.Series(d)
 
# Output:
 
a    10
 
b    20
 
c    30
 
dtype: int64
 
</syntaxhighlight>
 
|-
 
|<h5 style="text-align:left">From a NumPy Array</h5>
 
|<code>pd.Series(arr)</code>
 
|-
 
|<h5 style="text-align:left">From a Dectionary</h5>
 
|<code>pd.Series(d)</code>
 
|-
 
!<h4 style="text-align:left">Data in a Series</h4>
 
  
|
+
#toc .highlight_on_scroll {
|
+
margin-left: -4px !important;
| colspan="2" |A pandas Series can hold a variety of object types. Even functions (although unlikely that you will use this)<syntaxhighlight lang="python3">
+
pd.Series(data=labels)
+
}
# Output:
 
0    a
 
1    b
 
2    c
 
dtype: object
 
  
# Holding «functions» into a Series
+
#toc {
# Output:
+
padding-left: 10px !important;
pd.Series([sum,print,len])
+
}
0      <built-in function sum>
 
1      <built-in function print>
 
2      <built-in function len>
 
dtype: object
 
</syntaxhighlight>
 
|-
 
!<h4 style="text-align:left">Index in Series</h4>
 
|
 
|
 
| colspan="2" |The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).<syntaxhighlight lang="python3">
 
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])
 
ser1
 
# Output:
 
USA        1
 
Germany    2
 
USSR      3
 
Japan      4
 
dtype: int64
 
  
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])
+
/*  I have also changed the color
 +
/*  #a6e22e  by  #388bfd
 +
*  in the entire custom.css
 +
*/
  
ser1['USA']
+
/* I have also chenged some of the properties of the toc directly above in the code:  
# Output:
 
1
 
  
# Operations are then also done based off of index:
+
#toc-wrapper {
ser1 + ser2
+
z-index: 90;
# Output:
+
position: fixed !important;
Germany    4.0
+
display: flex;
Italy      NaN
+
flex-direction: column;
Japan      8.0
+
overflow: hidden;
USA        2.0
+
padding: 10px;
USSR      NaN
+
padding-top: 40px !important;
dtype: float64
+
border-style: solid;
 +
border-width: thin;
 +
border-right-width: medium !important;
 +
background-color: #1e1e1e !important;
 +
}
 +
#toc-wrapper.ui-draggable.ui-resizable.sidebar-wrapper {
 +
border-color: rgba(93,92,82,.25) !important;
 +
}
 +
#toc a,
 +
#navigate_menu a,
 +
.toc {
 +
color: #f8f8f0 !important;
 +
font-size: 16pt !important;
 +
}
 +
#toc li > span:hover {
 +
background-color: rgba(93,92,82,.25) !important;
 +
}
 +
#toc a:hover,
 +
#navigate_menu a:hover,
 +
.toc {
 +
color: #DAA520 !important;
 +
font-size: 16pt !important;
 +
}
 +
#toc-wrapper .toc-item-num {
 +
color: #388bfd !important;
 +
font-size: 16pt !important;
 +
}
 +
*/
 
</syntaxhighlight>
 
</syntaxhighlight>
|}
 
  
  
 
<br />
 
<br />
  
===DataFrames===
+
====Configurations from the Juniper notebook====
DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!
 
  
 +
<syntaxhighlight lang="python3">
 +
from IPython.core.display import display, HTML;
  
<syntaxhighlight lang="python">
+
display(HTML("<style>.container { width:98% !important;}</style>"<))
import pandas as pd
 
import numpy as np
 
  
from numpy.random import randn
+
display(HTML('<style>.prompt.input_prompt{display:none !important;}</style>'))
np.random.seed(101)
+
display(HTML('<style>.prompt.input_prompt{visibility: visible !important;</style>'))
 +
display(HTML('<style>.prompt.input_prompt{margin-left8kmclustering.ipynb 50px}</style>'))
 +
display(HTML('<style>.prompt.input_prompt{visibility: visible !important; width: 0px !important; min-width: 0px !important}</style>'))
  
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
+
display(HTML('<style>.input_area{margin-left: -50px;}</style>'))
 +
display(HTML('<style>.input{margin-left: -20px;}</style>'))
  
df
+
display(HTML('<style>.output_area{margin-left: 55px}</style>'))
# Output:
 
          W          X          Y          Z
 
A  2.706850    0.628133    0.907969    0.503826
 
B  0.651118  -0.319318  -0.848077    0.605965
 
C  -2.018168    0.740122    0.528813  -0.589001
 
D  0.188695  -0.758872  -0.933237    0.955057
 
E  0.190794    1.978757    2.605967    0.683509
 
</syntaxhighlight>
 
  
 +
# display(HTML('<style>.cell{margin-bottom: -5px !important; margin-top: -5px !important;}</style>'))
 +
# display(HTML('<style>.code_cell{margin-bottom: -5px !important; margin-top: -5px !important;}</style>'))
  
 
+
# display(HTML('<style>.output_wrapper{margin-bottom: 0px !important; margin-top: 0px !important;}</style>'))
'''DataFrame Columns are just Series:'''<syntaxhighlight lang="python3">
 
type(df['W'])
 
# Output:
 
pandas.core.series.Series
 
 
</syntaxhighlight>
 
</syntaxhighlight>
{| class="wikitable"
 
!
 
!
 
!Method/
 
Operator
 
!Description/Comments
 
!Example
 
|-
 
! rowspan="5" |<h4 style="text-align:left">Selection and Indexing</h4>
 
  
  
<div style="text-align:left">
+
<br />
Let's learn the various
 
  
methods to grab data
+
===Online Jupyter===
 +
There are many sites that provides solutions to run your Jupyter Notebook in the cloud: https://www.dataschool.io/cloud-services-for-jupyter-notebook/
  
from a DataFrame
+
I have tried:
</div>
 
  
|<h5 style="text-align:left">Standard systax</h5>
+
*https://cocalc.com/app
|<code>'''df[<nowiki>''</nowiki>]'''</code>
 
|
 
| rowspan="2" |<syntaxhighlight lang="python3">
 
# Pass a list of column names:
 
df[['W','Z']]
 
  
          W          Z
+
::https://cocalc.com/projects/595bf475-61a7-47fa-af69-ba804c3f23f9/files/?session=default
A  2.706850    0.503826
+
::Parece bueno, pero tiene opciones que no son gratis
B  0.651118    0.605965
 
C  -2.018168  -0.589001
 
D  0.188695    0.955057
 
E  0.190794    0.683509
 
</syntaxhighlight>
 
|-
 
|<h5 style="text-align:left">SQL syntax</h5>
 
(NOT RECOMMENDED!)
 
|<code>'''df.W'''</code>
 
|
 
|-
 
|<h5 style="text-align:left">Selecting Rows</h5>
 
|'''<code>df.loc[<nowiki>''</nowiki>]</code>'''
 
|
 
|<syntaxhighlight lang="python3">
 
df.loc['A']
 
# Or select based off of position instead of label :
 
df.iloc[2]
 
# Output:
 
W    2.706850
 
X    0.628133
 
Y    0.907969
 
Z    0.503826
 
Name: A, dtype: float64
 
</syntaxhighlight>
 
|-
 
|<h5 style="text-align:left">Selecting subset of rows and columns</h5>
 
|'''<code>df.loc[<nowiki>''</nowiki>,<nowiki>''</nowiki>]</code>'''
 
|
 
|<syntaxhighlight lang="python3">
 
df.loc['B','Y']
 
# Output:
 
-0.84807698340363147
 
  
df.loc[['A','B'],['W','Y']]
 
# Output:
 
          W          Y
 
A  2.706850    0.907969
 
B  0.651118  -0.848077
 
</syntaxhighlight>
 
|-
 
|<h5 style="text-align:left">Conditional Selection</h5>
 
|
 
| colspan="2" |<div class="mw-collapsible mw-collapsed" style="">
 
An important feature of pandas is conditional selection using bracket notation, very similar to numpy:
 
<div class="mw-collapsible-content">
 
<syntaxhighlight lang="python3">
 
df
 
# Output:
 
          W          X          Y          Z
 
A  2.706850    0.628133    0.907969    0.503826
 
B  0.651118  -0.319318  -0.848077    0.605965
 
C  -2.018168    0.740122    0.528813  -0.589001
 
D  0.188695  -0.758872  -0.933237    0.955057
 
E  0.190794    1.978757    2.605967    0.683509
 
  
df>0
+
*https://www.kaggle.com/
# Output:
 
    W      X      Y      Z
 
A  True    True    True    True
 
B  True    False  False  True
 
C  False  True    True    False
 
D  True    False  False  True
 
E  True    True    True    True
 
  
df[df>0]
+
::https://www.kaggle.com/adeloaleman/kernel1917a91630/edit
# Output:
+
::Parece bueno pero no encontré la forma adicionar una TOC
          W          X          Y          Z
 
A  2.706850    0.628133    0.907969    0.503826
 
B  0.651118    NaN        NaN        0.605965
 
C  NaN        0.740122    0.528813    NaN
 
D  0.188695    NaN        NaN        0.955057
 
E  0.190794    1.978757    2.605967    0.683509
 
  
df[df['W']>0]
 
# Output:
 
          W          X          Y          Z
 
A  2.706850    0.628133    0.907969    0.503826
 
B  0.651118  -0.319318  -0.848077    0.605965
 
D  0.188695  -0.758872  -0.933237    0.955057
 
E  0.190794    1.978757    2.605967    0.683509
 
  
df[df['W']>0]['Y']
+
*https://drive.google.com
# Output:  
 
A    0.907969
 
B  -0.848077
 
D  -0.933237
 
E    2.605967
 
Name: Y, dtype: float64
 
  
df[df['W']>0][['Y','X']]
+
:*https://colab.research.google.com
# Output:
+
::Es el que estoy utilizando ahora
          Y          X
 
A  0.907969    0.628133
 
B  -0.848077  -0.319318
 
D  -0.933237  -0.758872
 
E  2.605967    1.978757
 
  
# For two conditions you can use | and & with parenthesis:
 
df[(df['W']>0) & (df['Y'] > 1)]
 
# Output:
 
          W          X          Y          Z
 
E  0.190794    1.978757    2.605967    0.683509
 
</syntaxhighlight>
 
</div>
 
</div>
 
|-
 
!<h4 style="text-align:left">Creating a new column</h4>
 
|
 
|
 
|
 
|<syntaxhighlight lang="python3">
 
df['new'] = df['W'] + df['Y']
 
</syntaxhighlight>
 
|-
 
!<h4 style="text-align:left">Removing Columns</h4>
 
|
 
|'''<code>df.drop()</code>'''
 
| colspan="2" |
 
<div class="mw-collapsible mw-collapsed" style="">
 
<syntaxhighlight lang="python3">
 
df.drop('new',axis=1)
 
# Output:
 
          W          X          Y          Z
 
A  2.706850    0.628133    0.907969    0.503826
 
B  0.651118  -0.319318  -0.848077    0.605965
 
C  -2.018168    0.740122    0.528813  -0.589001
 
D  0.188695  -0.758872  -0.933237    0.955057
 
E  0.190794    1.978757    2.605967    0.683509
 
  
# Not inplace unless specified!
+
<br />
df
+
===Some remarks===
# Output:
 
          W          X          Y          Z        new
 
A  2.706850    0.628133    0.907969    0.503826    3.614819
 
B  0.651118  -0.319318  -0.848077    0.605965  -0.196959
 
C  -2.018168    0.740122    0.528813  -0.589001  -1.489355
 
D  0.188695  -0.758872  -0.933237    0.955057  -0.744542
 
E  0.190794    1.978757    2.605967    0.683509    2.796762
 
  
df.drop('new',axis=1,inplace=True)
 
df
 
# Output:
 
          W          X          Y          Z
 
A  2.706850    0.628133    0.907969    0.503826
 
B  0.651118  -0.319318  -0.848077    0.605965
 
C  -2.018168    0.740122    0.528813  -0.589001
 
D  0.188695  -0.758872  -0.933237    0.955057
 
E  0.190794    1.978757    2.605967    0.683509
 
  
 +
<br />
 +
====Executing Terminal Commands in Jupyter Notebooks====
 +
https://support.anaconda.com/hc/en-us/articles/360023858254-Executing-Terminal-Commands-in-Jupyter-Notebooks
  
# Can also drop rows this way:
+
If we are in the Notebook, and we want to run a shell command rather than a notebook command we use the <code>'''!''' or '''%'''</code>
df.drop('E',axis=0)
 
# Output:
 
          W          X          Y          Z
 
A  2.706850    0.628133    0.907969    0.503826
 
B  0.651118  -0.319318  -0.848077    0.605965
 
C  -2.018168    0.740122    0.528813  -0.589001
 
D  0.188695  -0.758872  -0.933237    0.955057
 
</syntaxhighlight>
 
</div>
 
|-
 
! rowspan="2" |<h4 style="text-align:left">Resetting the index</h4>
 
|<h5 style="text-align:left">Reset to default</h5>
 
(0,1...n index)
 
|'''<code>df.reset_index()</code>'''
 
| colspan="2" |<div class="mw-collapsible mw-collapsed" style="">
 
<syntaxhighlight lang="python3">
 
df
 
# Output:
 
          W          X          Y          Z
 
A  2.706850    0.628133    0.907969    0.503826
 
B  0.651118  -0.319318  -0.848077    0.605965
 
C  -2.018168    0.740122    0.528813  -0.589001
 
D  0.188695  -0.758872  -0.933237    0.955057
 
E  0.190794    1.978757    2.605967    0.683509
 
  
df.reset_index()
+
Try, for example:
# Output:
+
  %ls
  index          W          X          Y          Z
+
  !pwd
0      A  2.706850    0.628133  0.907969  0.503826
 
1      B  0.651118  -0.319318 -0.848077  0.605965
 
2      C -2.018168    0.740122  0.528813  -0.589001
 
3      D  0.188695  -0.758872  -0.933237  0.955057
 
4      E  0.190794    1.978757  2.605967  0.683509
 
</syntaxhighlight>
 
</div>
 
|-
 
|<h5 style="text-align:left">Setting index to something else</h5>
 
|'''<code>df.set_index(<nowiki>''</nowiki>)</code>'''
 
| colspan="2" |<div class="mw-collapsible mw-collapsed" style="">
 
<syntaxhighlight lang="python3">
 
newind = 'CA NY WY OR CO'.split()
 
df['States'] = newind
 
  
df
+
It's the same as if you opened up a terminal and typed it without the <code>'''!'''</code>
# Output:
 
          W            X          Y          Z  States
 
A  2.706850    0.628133    0.907969  0.503826      CA
 
B  0.651118  -0.319318  -0.848077  0.605965      NY
 
C  -2.018168    0.740122    0.528813  -0.589001      WY
 
D  0.188695  -0.758872  -0.933237  0.955057      OR
 
E  0.190794    1.978757    2.605967  0.683509      CO
 
  
df.set_index('States')
 
# Output:
 
                W          X          Y          Z
 
States             
 
    CA  2.706850    0.628133    0.907969  0.503826
 
    NY  0.651118  -0.319318  -0.848077  0.605965
 
    WY  -2.018168    0.740122    0.528813  -0.589001
 
    OR  0.188695  -0.758872  -0.933237  0.955057
 
    CO  0.190794    1.978757    2.605967  0.683509
 
  
df
+
<br />
# Output:
 
          W            X          Y          Z  States
 
A  2.706850    0.628133    0.907969  0.503826      CA
 
B  0.651118  -0.319318  -0.848077  0.605965      NY
 
C  -2.018168    0.740122    0.528813  -0.589001      WY
 
D  0.188695  -0.758872  -0.933237  0.955057      OR
 
E  0.190794    1.978757    2.605967  0.683509      CO
 
  
# We net to add «inplace=True»:
+
===[[HTML presentation with Reveal.js#Creating Presentations in Jupyter Notebook with RevealJS|Creating Presentations in Jupyter Notebook with RevealJS]]===
df.set_index('States',inplace=True)
 
df
 
# Output:
 
                W          X          Y          Z
 
States             
 
    CA  2.706850    0.628133    0.907969  0.503826
 
    NY  0.651118  -0.319318  -0.848077  0.605965
 
    WY  -2.018168    0.740122    0.528813  -0.589001
 
    OR  0.188695  -0.758872  -0.933237  0.955057
 
    CO  0.190794    1.978757    2.605967  0.683509
 
</syntaxhighlight>
 
</div>
 
|-
 
! rowspan="2" |<h4 style="text-align:left">Multi-Indexed DataFrame</h4>
 
|<h5 style="text-align:left">Creating a Multi-Indexed DataFrame</h5>
 
|
 
| colspan="2" |<div class="mw-collapsible mw-collapsed" style="">
 
<syntaxhighlight lang="python3">
 
# Index Levels
 
outside = ['G1','G1','G1','G2','G2','G2']
 
inside = [1,2,3,1,2,3]
 
hier_index = list(zip(outside,inside))
 
hier_index = pd.MultiIndex.from_tuples(hier_index)
 
  
hier_index
 
# Output:
 
MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]],
 
          labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
 
  
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
+
<br />
df
 
# Output:
 
              A          B
 
G1  1  0.153661  0.167638
 
    2  -0.765930  0.962299
 
    3  0.902826  -0.537909
 
G2  1  -1.549671  0.435253
 
    2  1.259904  -0.447898
 
    3  0.266207  0.412580
 
</syntaxhighlight>
 
</div>
 
|-
 
|<h5 style="text-align:left">Multi-Index and Index Hierarchy</h5>
 
|
 
| colspan="2" |<div class="mw-collapsible mw-collapsed" style="">
 
<syntaxhighlight lang="python3">
 
df.loc['G1']
 
# Output:
 
          A          B
 
1  0.153661  0.167638
 
2  -0.765930  0.962299
 
3  0.902826  -0.537909
 
  
df.loc['G1'].loc[1]
+
==Some of the most popular Python Data Science Libraries==
# Output:
 
A    0.153661
 
B    0.167638
 
Name: 1, dtype: float64
 
  
df.index.names
+
*NumPy
# Output:
+
*SciPy
FrozenList([None, None])
+
*Pandas
 +
*Seaborn
 +
*SciKit'Learn
 +
*MatplotLib
 +
*Plotly
 +
*PySpartk
  
df.index.names = ['Group','Num']
 
df
 
# Output:
 
                  A          B
 
Group Num       
 
  G1  1  0.153661  0.167638
 
        2  -0.765930  0.962299
 
        3  0.902826  -0.537909
 
  G2  1  -1.549671  0.435253
 
        2  1.259904  -0.447898
 
        3  0.266207  0.412580
 
  
df.xs('G1')
+
<br />
# Output:
 
            A            B
 
Num       
 
1    0.153661    0.167638
 
2  -0.765930    0.962299
 
3    0.902826    -0.537909
 
 
 
df.xs(['G1',1])
 
# Output:
 
A    0.153661
 
B    0.167638
 
Name: (G1, 1), dtype: float64
 
 
 
df.xs(1,level='Num')
 
# Output:
 
              A          B
 
Group     
 
  G1  0.153661  0.167638
 
  G2  -1.549671  0.435253
 
</syntaxhighlight>
 
</div>
 
|}
 
  
 +
==[[NumPy and Pandas]]==
  
  
 +
<br />
 +
==[[Data Visualization with Python]]==
  
  
 
<br />
 
<br />
  
===Missing Data===
+
==[[Natural Language Processing]]==
Let's show a few convenient methods to deal with Missing Data in pandas.
 
  
* <code>dropna()</code> method allows the user to analyze and drop Rows/Columns with Null values in different ways:
 
<blockquote>
 
<syntaxhighlight lang="python">
 
DataFrameName.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
 
</syntaxhighlight>
 
</blockquote>
 
  
* <code>fillna()</code> allows to fill Null fields with a given value:
+
<br />
 +
==[[Dash - Plotly]]==
  
  
<syntaxhighlight lang="python">
+
<br />
import numpy as np
+
==[[Scrapy]]==
import pandas as pd
 
  
  
df = pd.DataFrame({'A':[1,2,np.nan],
+
<br />
                  'B':[5,np.nan,np.nan],
+
==Using SQL in Jupyter==
                  'C':[1,2,3]})
+
Connecting to a database in Jupyter
  
df
 
# Output:
 
      A      B    C
 
0  1.0    5.0    1
 
1  2.0    NaN    2
 
2  NaN    NaN    3
 
  
 +
https://pypi.org/project/ipython-sql/
  
''By default, dropna() drop all the rows without Null values:''
+
https://stackoverflow.com/questions/454854/no-module-named-mysqldb
df.dropna()
 
df.dropna(axis=0) # Same as default
 
# Output:
 
      A      B    C
 
0  1.0    5.0    1
 
  
 +
https://stackoverflow.com/questions/5178292/pip-install-mysql-python-fails-with-environmenterror-mysql-config-not-found
  
'''If we want to display all the columns without Null values:'''
+
https://docs.kyso.io/guides/sql-interface-within-jupyterlab
df.dropna(axis=1)
 
  
 +
https://www.datacamp.com/community/tutorials/sql-interface-within-jupyterlab
  
# If we want to display all the rows that have at least 2 non-null values:
+
https://stackoverflow.com/questions/43641362/adding-syntax-highlighting-to-jupyter-notebook-cell-magic
df.dropna(thresh=2)
 
# Output:
 
      A      B    C
 
0  1.0    5.0    1
 
1  2.0    NaN    2
 
  
 +
https://www.sqlshack.com/learn-jupyter-notebooks-for-sql-server/
  
# Columns with at least 3 non-null values:
 
df.dropna(thresh=3)
 
# Output:
 
      A      B    C
 
0  1.0    5.0    1
 
  
 +
Verificar las fuentes above. Creo que lo único que tuve que hacer la última vez que lo instalé fue basado en las 3 primeras sources:
  
# To fill null fields with a given value:
+
pip install ipython-sql
df.fillna(value='FILL VALUE')
+
# Output:
+
sudo apt install default-libmysqlclient-dev
    A            B            C
+
0  1            5            1
+
pip install mysqlclient
1  2            FILL VALUE  2
+
2  FILL VALUE  FILL VALUE  3
+
sudo apt-get install python3-mysqldb
  
  
# But many times what we want to do is to replace these null fields with, for example, the «mean» of the columns. We can do it this way:
+
Luego adding SQL syntax highlighting to Jupyter as describe above in the corrrespoinding source.
df['A'].fillna(value=df['A'].mean())
 
# Output:
 
0    1.0
 
1    2.0
 
2    1.5  # *
 
Name: A, dtype: float64
 
 
 
# * The Null field has been filled with the mean of the column
 
</syntaxhighlight>
 
 
 
 
 
<br />
 
 
 
===GroupBy===
 
 
 
 
 
<br />
 
===Merging,Joining,and Concatenating===
 
 
 
 
 
<br />
 
===Operations===
 
 
 
 
 
<br />
 
===Data Input and Output===
 
  
  
 
<br />
 
<br />

Latest revision as of 15:47, 11 September 2024


For a standard Python tutorial go to Python



Courses

  • Udemy - Python for Data Science and Machine Learning Bootcamp
https://www.udemy.com/course/python-for-data-science-and-machine-learning-bootcamp/



Anaconda

Anaconda is a free and open source distribution of the Python and R programming languages for data science and machine learning related applications (large-scale data processing, predictive analytics, scientific computing), that aims to simplify package management and deployment. Package versions are managed by the package management system conda. https://en.wikipedia.org/wiki/Anaconda_(Python_distribution)

En otras palabras, Anaconda puede ser visto como un paquete (a distribution) que incluye no solo Python (or R) but many libraries that are used in Data Science, as well as its own virtual environment system. It's an "all-in-one" install that is extremely popular in data science and Machine Learning.Creating sample array for the following examples:



Installation

Installation from the official Anaconda Web site: https://docs.anaconda.com/anaconda/install/



Anaconda comes with a few IDE

  • Jupyter Lab
  • Jupyter Notebook
  • Spyder
  • Qtconsole
  • and others



Anaconda Navigator

Anaconda Navigator is a GUI that helps you to easily start important applications and manage the packages in your local Anaconda installation

You can open the Anaconda Navigator from the Terminal:

anaconda-navigator



Jupyter

Jupyter comes with Anaconda.

  • It is a development environment (IDE) where we can write codes; but it also allows us to display images, and write down markdown notes.
  • It is the most popular IDE in data science for exploring and analyzing data.
  • Other famoues IDE for Python are Sublime Text and PyCharm.
  • There is Jupyter Lab and Jupyter Notebook



Remote connection

https://jupyter-notebook.readthedocs.io/en/stable/public_server.html


A**1


(base) adelo@vmi346715:~/.jupyter$ openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout mykey.key -out mycert.pem
Generating a RSA private key
......................................+++++
....................................+++++
writing new private key to 'mykey.key'
-----
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [AU]:IE	
State or Province Name (full name) [Some-State]:Dublin
Locality Name (eg, city) []:Dublin
Organization Name (eg, company) [Internet Widgits Pty Ltd]:.
Organizational Unit Name (eg, section) []:.
Common Name (e.g. server FQDN or YOUR name) []:sinfronteras    
Email Address []:adeloaleman@gmail.com



Share Jupyter Notebook online

  • GitHub:
https://docs.github.com/en/github/managing-files-in-a-repository/working-with-jupyter-notebook-files-on-github
Example: https://github.com/adeloaleman/AmazonLaptopsDashboard/blob/master/DataAnalysis/data_analysis2.ipynb


  • 'Nbviewer
https://nbviewer.jupyter.org/
Example: https://nbviewer.jupyter.org/github/bokeh/bokeh-notebooks/blob/main/tutorial/06%20-%20Linking%20and%20Interactions.ipynb



Customize Jupyter


Themes

https://github.com/dunovank/jupyter-themes

Ver el tema que muestran en esta página: https://gist.github.com/pierrejoubert73/902cc94d79424356a8d20be2b382e1ab


jt   -t oceans16     -cellw 98%   -lineh 120   -fs 14   -nfs 14   -dfs 14   -ofs 14


https://www.kaggle.com/getting-started/97540

jt   -t monokai      -cellw 98%   -lineh 120   -fs 14   -nfs 14   -dfs 14   -ofs 14   -f fira   -nf ptsans   -N   -kl   -cursw 2   -cursc r   -T



Extensions

This post mention so nice extension and configuration that can be done: https://towardsdatascience.com/bringing-the-best-out-of-jupyter-notebooks-for-data-science-f0871519ca29


Unofficial Jupyter Notebook Extensions

https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/index.html

This is very important. There are very nice extensions in this package:


Installation

https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/install.html

I had some issues to install it. La format indicada por defecto:

pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user

A través de la forma anterior no pude instalar el paquete de forma correcta. La instalación no retornó errorres, y la extensión se mostraba en Jupyter-notebook pero no podía activar "enable" las extensiones.


Al parecer es un problema con la ubicación de la instalación. Yo estaba usando conda pero conda está presentando problemas. La instalación de los paquestes demora muchísimo y luego el paquete parece no estar disponible.


En el siguiente post encontré una solución para instalar nbextension usando pip: https://github.com/ipython-contrib/jupyter_contrib_nbextensions/issues/1127

pip install --upgrade jupyter_contrib_nbextensions
jupyter contrib nbextension install  --sys-prefix  --symlink

«--symlink» creo que lo usé pero no estoy completamente seguro. También realicé el --upgrade pero creo que la diferencia la hicieron las opciones --sys-prefix --symlink


Si no se muestra la Nbextensions tab (), try to reinstall the https://github.com/Jupyter-contrib/jupyter_nbextensions_configurator

pip install jupyter_nbextensions_configurator

or

conda install -c conda-forge jupyter_nbextensions_configurator



CustomJS and CustonCSS files

This is a good post: https://forums.fast.ai/t/jupyter-notebook-enhancements-tips-and-tricks/17064

Keyboard Shortcut Customization: https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Custom%20Keyboard%20Shortcuts.html



custom.js
/** Mis configuraciones */ 

// This is to enable syntax highlighting for SQL code: 
// https://stackoverflow.com/questions/43641362/adding-syntax-highlighting-to-jupyter-notebook-cell-magic
require(['notebook/js/codecell'], function(codecell) {
  codecell.CodeCell.options_default.highlight_modes['magic_text/x-mssql'] = {'reg':[/^%%sql/]} ;
  Jupyter.notebook.events.one('kernel_ready.Kernel', function(){
  Jupyter.notebook.get_cells().map(function(cell){
      if (cell.cell_type == 'code'){ cell.auto_highlight(); } }) ;
  });
});


// My plain theme
// This is a good post where I took some ideas to write the following fuction: https://forums.fast.ai/t/jupyter-notebook-enhancements-tips-and-tricks/17064
function plainTheme() {
    var input_promp_fields = document.getElementsByClassName("prompt_container");
    var text_render_fields = document.getElementsByClassName("text_cell_render");

    if (input_promp_fields[0].style.visibility == "collapse"){
        action = "visible";
        input_marginLeft = "0px";
        border_top  = "3px";
        prompt_width = "74px";
        padding_top = "0px";
        output_margin = "40px";
    }else{
        action = "collapse";
        input_marginLeft = "74px";
        border_top  = '0px';
        prompt_width = "74px";
        padding_top = "40px";
        output_margin = "40px";
    }

    // Si queremos usar !important debemos hacerlo de esta forma utilizando JQuery:
    // https://makitweb.com/how-to-add-important-to-css-property-with-jquery/
    var text_cell_fields = document.getElementsByClassName("text_cell");
    $(text_cell_fields).ready(function(){
        $('.input_prompt').css({
            'cssText': `width: 40px !important; max-width: ${prompt_width} !important; min-width: ${prompt_width} !important;`
        });
    });

    $(document).ready(function(){
        $(".prompt_container").css(
            'visibility', `${action}`
        );
        
        $(".input").css(
            'padding-left', `${input_marginLeft}`
        );
        
        $(".output_subarea").css(
            'margin-left', `${output_margin}`
        );
                    
        $('.cell').css({
            'cssText': `border-top-width: ${border_top} !important; border-bottom-width: ${border_top} !important;`
        });
        
        $(".collapsible_headings_ellipsis").css({
            'cssText': `padding-top:${padding_top} !important; border-top-width: ${border_top} !important; border-bottom-width: ${border_top} !important;`
        });

        $(".text_cell_render").css({
            'cssText': `margin-left: -10px;`
        });
    });            
}

Jupyter.keyboard_manager.command_shortcuts.add_shortcut('Alt-Ctrl-Q', {
    help : '...',
    help_index : 'zz',
    handler : function (event) {
        plainTheme();
    return false;
    }}
);

Jupyter.keyboard_manager.edit_shortcuts.add_shortcut('Alt-Ctrl-Q', {
    help : '...',
    help_index : 'zz',
    handler : function (event) {
        plainTheme();
    return false;
    }}
);


// This could be very usefull. It allows to add text automatically into a cell
// https://forums.fast.ai/t/jupyter-notebook-enhancements-tips-and-tricks/17064/27
Jupyter.keyboard_manager.edit_shortcuts.add_shortcut('Ctrl-Shift-J', {
    help : '...',
    help_index : 'zz',
    handler : function (event) {
        document.body.style.background = 'blue'
        var target = Jupyter.notebook.get_selected_cell()
        var cursor = target.code_mirror.getCursor()
        var before = target.get_pre_cursor()
        var after = target.get_post_cursor()
        target.set_text(before + 'from IPython.core.display import display, HTML; \n\taverrrdisplay(HTML("<style>.container { width:98% !important;}</style>"))' + after)
        cursor.ch += 20 // where to put your cursor
        target.code_mirror.setCursor(cursor)
        return false;
    }}
);


// To get the real value of a css field: https://stackoverflow.com/questions/26074476/document-body-style-backgroundcolor-doesnt-work-with-external-css-style-sheet
// window.getComputedStyle(document.body).backgroundColor
// window.getComputedStyle(document.getElementsByClassName("input_area")[0]).backgroundColor



custom.css
/*  Mis configuraciones  */

.container { width:98% !important; }
/* document.getElementById("notebook-container").style.minWidth = "50%"; */
/* document.getElementById("notebook-container").style.maxWidth = "50%"; */

#notebook-container {
 width:98% !important;
}

.CodeMirror-gutters {
 background-color: transparent !important;
 background: transparent !important;
}

.CodeMirror-linenumber {
 margin-left: -20px !important;
}

.output_subarea {
 margin-left: 40px !important;
}

#toc .fa-fw {
 color: blue !important;
}

#toc .highlight_on_scroll {
 margin-left: -4px !important;
 
}

#toc {
 padding-left: 10px !important;
}

/*  I have also changed the color
/*  #a6e22e   by   #388bfd 
 *  in the entire custom.css
 */

/* I have also chenged some of the properties of the toc directly above in the code: 

#toc-wrapper {
 z-index: 90;
 position: fixed !important;
 display: flex;
 flex-direction: column;
 overflow: hidden;
 padding: 10px;
 padding-top: 40px !important;
 border-style: solid;
 border-width: thin;
 border-right-width: medium !important;
 background-color: #1e1e1e !important;
}
#toc-wrapper.ui-draggable.ui-resizable.sidebar-wrapper {
 border-color: rgba(93,92,82,.25) !important;
}
#toc a,
#navigate_menu a,
.toc {
 color: #f8f8f0 !important;
 font-size: 16pt !important;
}
#toc li > span:hover {
 background-color: rgba(93,92,82,.25) !important;
}
#toc a:hover,
#navigate_menu a:hover,
.toc {
 color: #DAA520 !important;
 font-size: 16pt !important;
}
#toc-wrapper .toc-item-num {
 color: #388bfd !important;
 font-size: 16pt !important;
}
*/



Configurations from the Juniper notebook

from IPython.core.display import display, HTML; 

display(HTML("<style>.container { width:98% !important;}</style>"<))

display(HTML('<style>.prompt.input_prompt{display:none !important;}</style>'))
display(HTML('<style>.prompt.input_prompt{visibility: visible !important;</style>'))
display(HTML('<style>.prompt.input_prompt{margin-left8kmclustering.ipynb 50px}</style>'))
display(HTML('<style>.prompt.input_prompt{visibility: visible !important; width: 0px !important; min-width: 0px !important}</style>'))  

display(HTML('<style>.input_area{margin-left: -50px;}</style>'))
display(HTML('<style>.input{margin-left: -20px;}</style>'))

display(HTML('<style>.output_area{margin-left: 55px}</style>'))

# display(HTML('<style>.cell{margin-bottom: -5px !important; margin-top: -5px !important;}</style>'))
# display(HTML('<style>.code_cell{margin-bottom: -5px !important; margin-top: -5px !important;}</style>'))

# display(HTML('<style>.output_wrapper{margin-bottom: 0px !important; margin-top: 0px !important;}</style>'))



Online Jupyter

There are many sites that provides solutions to run your Jupyter Notebook in the cloud: https://www.dataschool.io/cloud-services-for-jupyter-notebook/

I have tried:

https://cocalc.com/projects/595bf475-61a7-47fa-af69-ba804c3f23f9/files/?session=default
Parece bueno, pero tiene opciones que no son gratis


https://www.kaggle.com/adeloaleman/kernel1917a91630/edit
Parece bueno pero no encontré la forma adicionar una TOC


Es el que estoy utilizando ahora



Some remarks


Executing Terminal Commands in Jupyter Notebooks

https://support.anaconda.com/hc/en-us/articles/360023858254-Executing-Terminal-Commands-in-Jupyter-Notebooks

If we are in the Notebook, and we want to run a shell command rather than a notebook command we use the ! or %

Try, for example:

%ls 
!pwd

It's the same as if you opened up a terminal and typed it without the !



Creating Presentations in Jupyter Notebook with RevealJS


Some of the most popular Python Data Science Libraries

  • NumPy
  • SciPy
  • Pandas
  • Seaborn
  • SciKit'Learn
  • MatplotLib
  • Plotly
  • PySpartk



NumPy and Pandas


Data Visualization with Python


Natural Language Processing


Dash - Plotly


Scrapy


Using SQL in Jupyter

Connecting to a database in Jupyter


https://pypi.org/project/ipython-sql/

https://stackoverflow.com/questions/454854/no-module-named-mysqldb

https://stackoverflow.com/questions/5178292/pip-install-mysql-python-fails-with-environmenterror-mysql-config-not-found

https://docs.kyso.io/guides/sql-interface-within-jupyterlab

https://www.datacamp.com/community/tutorials/sql-interface-within-jupyterlab

https://stackoverflow.com/questions/43641362/adding-syntax-highlighting-to-jupyter-notebook-cell-magic

https://www.sqlshack.com/learn-jupyter-notebooks-for-sql-server/


Verificar las fuentes above. Creo que lo único que tuve que hacer la última vez que lo instalé fue basado en las 3 primeras sources:

pip install ipython-sql

sudo apt install default-libmysqlclient-dev

pip install mysqlclient

sudo apt-get install python3-mysqldb


Luego adding SQL syntax highlighting to Jupyter as describe above in the corrrespoinding source.