{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from dataset import Dataset\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Dataset introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The idea with Dataset is that you simplify most of the tasks that you normally do with pandas DataFrame. This normally applies when you're starting in Python. You will have access at any time, to the underlying `pandas DataFrame` that holds the data, in case you need to use the `numpy` representation of the values, or access specific locations of your data.\n", "\n", "## Data loading\n", "\n", "To start with Dataset, you must load your data the same way it is done with `pandas`, by passing the URL or file location to the constructor (`Dataset()`). If you need to add more `pandas` parameters to this call, specifying what is the separator, or whether to use headers, etc., simply add them after the file location.\n", "\n", "I'm using the location of the CSV from U.Arizona." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "URL=\"https://www2.cs.arizona.edu/classes/cs120/fall17/ASSIGNMENTS/assg02/Pokemon.csv\"\n", "pokemon = Dataset(URL, delimiter=',', header=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this point, you have access to the methods provided by Dataset to describe the dataset, clean up your data, perform feature selection or plot some interesting feature engineering related plots.\n", "\n", "If you already have a DataFrame and want to use inside a Dataset class, you can also import it, using the method:\n", "\n", " >>> my_dataset = Dataset.from_dataframe(my_dataframe)\n", "\n", "### Access to internal DataFrame\n", "\n", "If you want access your `DataFrame` you simply have to call the property `features` at the end of the `Dataset` name:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#NameType 1Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01.0BulbasaurGrassPoison318.045.049.049.065.065.045.01.0False
12.0IvysaurGrassPoison405.060.062.063.080.080.060.01.0False
23.0VenusaurGrassPoison525.080.082.083.0100.0100.080.01.0False
33.0VenusaurMega VenusaurGrassPoison625.080.0100.0123.0122.0120.080.01.0False
44.0CharmanderFireNaN309.039.052.043.060.050.065.01.0False
..........................................
795719.0DiancieRockFairy600.050.0100.0150.0100.0150.050.06.0True
796719.0DiancieMega DiancieRockFairy700.050.0160.0110.0160.0110.0110.06.0True
797720.0HoopaHoopa ConfinedPsychicGhost600.080.0110.060.0150.0130.070.06.0True
798720.0HoopaHoopa UnboundPsychicDark680.080.0160.060.0170.0130.080.06.0True
799721.0VolcanionFireWater600.080.0110.0120.0130.090.070.06.0True
\n", "

800 rows × 13 columns

\n", "
" ], "text/plain": [ " # Name Type 1 Type 2 Total HP Attack \\\n", "0 1.0 Bulbasaur Grass Poison 318.0 45.0 49.0 \n", "1 2.0 Ivysaur Grass Poison 405.0 60.0 62.0 \n", "2 3.0 Venusaur Grass Poison 525.0 80.0 82.0 \n", "3 3.0 VenusaurMega Venusaur Grass Poison 625.0 80.0 100.0 \n", "4 4.0 Charmander Fire NaN 309.0 39.0 52.0 \n", ".. ... ... ... ... ... ... ... \n", "795 719.0 Diancie Rock Fairy 600.0 50.0 100.0 \n", "796 719.0 DiancieMega Diancie Rock Fairy 700.0 50.0 160.0 \n", "797 720.0 HoopaHoopa Confined Psychic Ghost 600.0 80.0 110.0 \n", "798 720.0 HoopaHoopa Unbound Psychic Dark 680.0 80.0 160.0 \n", "799 721.0 Volcanion Fire Water 600.0 80.0 110.0 \n", "\n", " Defense Sp. Atk Sp. Def Speed Generation Legendary \n", "0 49.0 65.0 65.0 45.0 1.0 False \n", "1 63.0 80.0 80.0 60.0 1.0 False \n", "2 83.0 100.0 100.0 80.0 1.0 False \n", "3 123.0 122.0 120.0 80.0 1.0 False \n", "4 43.0 60.0 50.0 65.0 1.0 False \n", ".. ... ... ... ... ... ... \n", "795 150.0 100.0 150.0 50.0 6.0 True \n", "796 110.0 160.0 110.0 110.0 6.0 True \n", "797 60.0 150.0 130.0 70.0 6.0 True \n", "798 60.0 170.0 130.0 80.0 6.0 True \n", "799 120.0 130.0 90.0 70.0 6.0 True \n", "\n", "[800 rows x 13 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If, instead of displaying the entire pandas `DataFrame` we want to see a special feature, we can refer to that feature using its name, right after the `features` property. In this case, let's have a look to the feature called `Name` holding the pokemon name." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Bulbasaur\n", "1 Ivysaur\n", "2 Venusaur\n", "3 VenusaurMega Venusaur\n", "4 Charmander\n", " ... \n", "795 Diancie\n", "796 DiancieMega Diancie\n", "797 HoopaHoopa Confined\n", "798 HoopaHoopa Unbound\n", "799 Volcanion\n", "Name: Name, Length: 800, dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.features.Name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result from this call is pandas `Series`.\n", "\n", "or (to show only the first 5 lines from the dataframe):" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#NameType 1Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01.0BulbasaurGrassPoison318.045.049.049.065.065.045.01.0False
12.0IvysaurGrassPoison405.060.062.063.080.080.060.01.0False
23.0VenusaurGrassPoison525.080.082.083.0100.0100.080.01.0False
\n", "
" ], "text/plain": [ " # Name Type 1 Type 2 Total HP Attack Defense Sp. Atk \\\n", "0 1.0 Bulbasaur Grass Poison 318.0 45.0 49.0 49.0 65.0 \n", "1 2.0 Ivysaur Grass Poison 405.0 60.0 62.0 63.0 80.0 \n", "2 3.0 Venusaur Grass Poison 525.0 80.0 82.0 83.0 100.0 \n", "\n", " Sp. Def Speed Generation Legendary \n", "0 65.0 45.0 1.0 False \n", "1 80.0 60.0 1.0 False \n", "2 100.0 80.0 1.0 False " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.features.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Number of features and number of samples in the dataset are accessible through the following properties:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Nr of features: 13\n", "Nr of samples: 800\n" ] } ], "source": [ "print('Nr of features:', pokemon.num_features)\n", "print('Nr of samples:', pokemon.num_samples)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Access to numerical/categorical variables\n", "\n", "It is also possible to work only with the numerical or categorical variables in the dataset. To do that you just have to use the properties: `.categorical` or `.numerical` to access those portions of the dataframe that only contain those feature subtypes.\n", "\n", "If we want to access only the categorical, we type:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameType 1Type 2Legendary
0BulbasaurGrassPoisonFalse
1IvysaurGrassPoisonFalse
2VenusaurGrassPoisonFalse
\n", "
" ], "text/plain": [ " Name Type 1 Type 2 Legendary\n", "0 Bulbasaur Grass Poison False\n", "1 Ivysaur Grass Poison False\n", "2 Venusaur Grass Poison False" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.categorical.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to access the numerical ones:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#TotalHPAttackDefenseSp. AtkSp. DefSpeedGeneration
01.0318.045.049.049.065.065.045.01.0
12.0405.060.062.063.080.080.060.01.0
23.0525.080.082.083.0100.0100.080.01.0
\n", "
" ], "text/plain": [ " # Total HP Attack Defense Sp. Atk Sp. Def Speed Generation\n", "0 1.0 318.0 45.0 49.0 49.0 65.0 65.0 45.0 1.0\n", "1 2.0 405.0 60.0 62.0 63.0 80.0 80.0 60.0 1.0\n", "2 3.0 525.0 80.0 82.0 83.0 100.0 100.0 80.0 1.0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.numerical.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In case we want only the names of the variables that are numerical or categorical, we can either use:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Name', 'Type 1', 'Type 2', 'Legendary']" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.categorical_features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Name', 'Type 1', 'Type 2', 'Legendary']" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.names('categorical')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "which, of course, also applies to numerical." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set the target variable\n", "\n", "At this point, we can make something very interesting when working with datasets, which is to select what feature will be the target variable. By doing so, `Dataset` will separate that feature from the rest, allowing to use special feature engineering methods that we will se afterwards:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "pokemon.set_target('Legendary');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we select the target, that feature dissapears from the `features` property, as you can see when we call the `head` method again:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#NameType 1Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGeneration
01.0BulbasaurGrassPoison318.045.049.049.065.065.045.01.0
12.0IvysaurGrassPoison405.060.062.063.080.080.060.01.0
23.0VenusaurGrassPoison525.080.082.083.0100.0100.080.01.0
\n", "
" ], "text/plain": [ " # Name Type 1 Type 2 Total HP Attack Defense Sp. Atk \\\n", "0 1.0 Bulbasaur Grass Poison 318.0 45.0 49.0 49.0 65.0 \n", "1 2.0 Ivysaur Grass Poison 405.0 60.0 62.0 63.0 80.0 \n", "2 3.0 Venusaur Grass Poison 525.0 80.0 82.0 83.0 100.0 \n", "\n", " Sp. Def Speed Generation \n", "0 65.0 45.0 1.0 \n", "1 80.0 60.0 1.0 \n", "2 100.0 80.0 1.0 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.features.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our feature is now in a new property called `target`. We cann access it, by calling it from our Dataset, which will return a `pandas Series` object to work with." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", " ... \n", "795 True\n", "796 True\n", "797 True\n", "798 True\n", "799 True\n", "Name: Legendary, Length: 800, dtype: bool" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If at any point during your work you want to unset the target variable, and make it part of the dataset again as a normal feature, you just have to call\n", "\n", " >>> pokemon.unset_target()\n", " \n", "From that point, no target variable is defined within the dataset and all features are considered normal features.\n", "\n", "### Access to features\n", "\n", "From this point, if we want to access a DataFrame that will contain all features, including the target variable, we must use the property `all`, because, as you can see, `features` no longer contains the target variable:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#NameType 1Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01.0BulbasaurGrassPoison318.045.049.049.065.065.045.01.0False
12.0IvysaurGrassPoison405.060.062.063.080.080.060.01.0False
23.0VenusaurGrassPoison525.080.082.083.0100.0100.080.01.0False
\n", "
" ], "text/plain": [ " # Name Type 1 Type 2 Total HP Attack Defense Sp. Atk \\\n", "0 1.0 Bulbasaur Grass Poison 318.0 45.0 49.0 49.0 65.0 \n", "1 2.0 Ivysaur Grass Poison 405.0 60.0 62.0 63.0 80.0 \n", "2 3.0 Venusaur Grass Poison 525.0 80.0 82.0 83.0 100.0 \n", "\n", " Sp. Def Speed Generation Legendary \n", "0 65.0 45.0 1.0 False \n", "1 80.0 60.0 1.0 False \n", "2 100.0 80.0 1.0 False " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.all.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data description\n", "\n", "First thing we can do with our dataset is to describe it, just to know the types of the variables, and whether we have NA'a or incomplete features." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "12 Features. 800 Samples\n", "Available types: [dtype('float64') dtype('O')]\n", " · 3 categorical features\n", " · 9 numerical features\n", " · 1 categorical features with NAs\n", " · 0 numerical features with NAs\n", " · 12 Complete features\n", "--\n", "Target: Legendary (bool)\n", "'Legendary'\n", " · Min.: 0.0000\n", " · 1stQ: 0.0000\n", " · Med.: 0.0000\n", " · Mean: 0.0813\n", " · 3rdQ: 0.0000\n", " · Max.: 1.0000\n" ] } ], "source": [ "pokemon.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In case you want more information about the values, each feature is taking, then you can use the `summary()` method:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Features Summary (all):\n", "'#' : float64 Min.(1.0) 1stQ(184.) Med.(364.) Mean(362.) 3rdQ(539.) Max.(721.)\n", "'Name' : object 800 categs. 'Bulbasaur'(1, 0.0013) 'Ivysaur'(1, 0.0013) 'Venusaur'(1, 0.0013) 'VenusaurMega Venusaur'(1, 0.0013) ...\n", "'Type 1' : object 18 categs. 'Grass'(112, 0.1400) 'Fire'(98, 0.1225) 'Water'(70, 0.0875) 'Bug'(69, 0.0862) ...\n", "'Type 2' : object 18 categs. 'Poison'(97, 0.2343) 'nan'(35, 0.0845) 'Flying'(34, 0.0821) 'Dragon'(33, 0.0797) ...\n", "'Total' : float64 Min.(180.) 1stQ(330.) Med.(450.) Mean(435.) 3rdQ(515.) Max.(780.)\n", "'HP' : float64 Min.(1.0) 1stQ(50.0) Med.(65.0) Mean(69.2) 3rdQ(80.0) Max.(255.)\n", "'Attack' : float64 Min.(5.0) 1stQ(55.0) Med.(75.0) Mean(79.0) 3rdQ(100.) Max.(190.)\n", "'Defense' : float64 Min.(5.0) 1stQ(50.0) Med.(70.0) Mean(73.8) 3rdQ(90.0) Max.(230.)\n", "'Sp. Atk' : float64 Min.(10.0) 1stQ(49.7) Med.(65.0) Mean(72.8) 3rdQ(95.0) Max.(194.)\n", "'Sp. Def' : float64 Min.(20.0) 1stQ(50.0) Med.(70.0) Mean(71.9) 3rdQ(90.0) Max.(230.)\n", "'Speed' : float64 Min.(5.0) 1stQ(45.0) Med.(65.0) Mean(68.2) 3rdQ(90.0) Max.(180.)\n", "'Generation': float64 Min.(1.0) 1stQ(2.0) Med.(3.0) Mean(3.32) 3rdQ(5.0) Max.(6.0)\n", "'Legendary' : bool 2 categs. 'False'(735, 0.9187) 'True'(65, 0.0813) \n" ] } ], "source": [ "pokemon.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Maniuplation\n", "\n", "### Remove columns\n", "\n", "You can easily remove columns from data by using the method `drop_columns()`. You can pass a single column/feature name or a list of features, and they will dissapear from the dataset." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.drop_columns('#')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also remove all the columns that are not in a list. To do that, you use `keep_columns()` and what you must pass to the function is a feature name or a list of feature names you want to keep in your dataset. For example, if we might want to keep only the numerical features:\n", "\n", " pokemon.keep_columns(pokemon.numerical_features)\n", " \n", "or if we might want to keep only a couple of well known features:\n", "\n", " pokemon.keep_columns(['Total', 'Attack'])\n", " \n", "### Add columns\n", "\n", "You can also add columns or entire dataframes to your existing dataset. If you want to simply add a pandas `Series` to the existing dataset, call:\n", "\n", " pokemon.add_colums(my_data_series)\n", " \n", "If what you want is to add an entire dataframe, the mechanism is exactly the same:\n", "\n", " pokemon.add_columns(my_dataframe)\n", "\n", "In the following cells, all features are shown, then categoricals are extracted to variable `categoricals`, and removed from the dataset. Then `categoricals` is added to the original dataset, resulting in a dataset which is equivalent to the one we stsarted with." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameType 1Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGeneration
0BulbasaurGrassPoison318.045.049.049.065.065.045.01.0
1IvysaurGrassPoison405.060.062.063.080.080.060.01.0
2VenusaurGrassPoison525.080.082.083.0100.0100.080.01.0
\n", "
" ], "text/plain": [ " Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def \\\n", "0 Bulbasaur Grass Poison 318.0 45.0 49.0 49.0 65.0 65.0 \n", "1 Ivysaur Grass Poison 405.0 60.0 62.0 63.0 80.0 80.0 \n", "2 Venusaur Grass Poison 525.0 80.0 82.0 83.0 100.0 100.0 \n", "\n", " Speed Generation \n", "0 45.0 1.0 \n", "1 60.0 1.0 \n", "2 80.0 1.0 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Original dataset\n", "pokemon.features.head(3)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TotalHPAttackDefenseSp. AtkSp. DefSpeedGeneration
0318.045.049.049.065.065.045.01.0
1405.060.062.063.080.080.060.01.0
2525.080.082.083.0100.0100.080.01.0
\n", "
" ], "text/plain": [ " Total HP Attack Defense Sp. Atk Sp. Def Speed Generation\n", "0 318.0 45.0 49.0 49.0 65.0 65.0 45.0 1.0\n", "1 405.0 60.0 62.0 63.0 80.0 80.0 60.0 1.0\n", "2 525.0 80.0 82.0 83.0 100.0 100.0 80.0 1.0" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "categoricals = pokemon.categorical\n", "\n", "pokemon.drop_columns(pokemon.categorical_features)\n", "pokemon.features.head(3)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TotalHPAttackDefenseSp. AtkSp. DefSpeedGenerationNameType 1Type 2
0318.045.049.049.065.065.045.01.0BulbasaurGrassPoison
1405.060.062.063.080.080.060.01.0IvysaurGrassPoison
2525.080.082.083.0100.0100.080.01.0VenusaurGrassPoison
\n", "
" ], "text/plain": [ " Total HP Attack Defense Sp. Atk Sp. Def Speed Generation \\\n", "0 318.0 45.0 49.0 49.0 65.0 65.0 45.0 1.0 \n", "1 405.0 60.0 62.0 63.0 80.0 80.0 60.0 1.0 \n", "2 525.0 80.0 82.0 83.0 100.0 100.0 80.0 1.0 \n", "\n", " Name Type 1 Type 2 \n", "0 Bulbasaur Grass Poison \n", "1 Ivysaur Grass Poison \n", "2 Venusaur Grass Poison " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.add_columns(categoricals)\n", "pokemon.features.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Remove samples\n", "\n", "If what you want is to remove some samples (rows) from the dataset, you simply call the method `drop_samples()` passing the list of indices you want to remove. For example:\n", "\n", " pokemon.drop_samples([34, 56, 78])\n", " \n", "### Samples Matching criteria\n", "\n", "If you want to select samples for which one of the features fulfills a certain criteria, you can get the list of indices of those samples by calling:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 70 samples for which 'Type 1' is valued 'Grass'\n" ] } ], "source": [ "print('There are', len(pokemon.samples_matching('Grass', 'Type 1')),\n", " 'samples for which \\'Type 1\\' is valued \\'Grass\\'')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Conversion\n", "\n", "Conversion of data in Dataset is primarily between types for numerical features (`int` <-> `float`), and between categorical and numerical, and viceversa.\n", "\n", "### Type conversion (int, float)\n", "\n", "To convert between float and int:\n", "\n", " pokemon.to_int('this_is_a_float_feature')\n", "\n", "or\n", "\n", " pokemon.to_float(['int_feature_1', 'int_feature_2'])\n", "\n", "Again, you can pass a single name or a list of names between brackets.\n", "\n", "### Categorical <-> Numerical\n", "\n", "We can also convert numerical features to categorical and viceversa (when it makes sense), using:\n", "\n", " pokemon.to_categorical('my_numerical_feature')\n", " \n", "or\n", "\n", " pokemon.to_numerical('my_categorical_feature')\n", " \n", "In our case, it seems that the feature `Generation` could be considered as **categorical**. So, to convert it as a category, we better convert it first to `int`, to later convert it to a category. We can call both methods in the same line. We should also convert the target variable to the categorical type, since `bool` is often interpreted as a number (0/1)." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Features Summary (all):\n", "'Total' : float64 Min.(180.) 1stQ(330.) Med.(450.) Mean(435.) 3rdQ(515.) Max.(780.)\n", "'HP' : float64 Min.(1.0) 1stQ(50.0) Med.(65.0) Mean(69.2) 3rdQ(80.0) Max.(255.)\n", "'Attack' : float64 Min.(5.0) 1stQ(55.0) Med.(75.0) Mean(79.0) 3rdQ(100.) Max.(190.)\n", "'Defense' : float64 Min.(5.0) 1stQ(50.0) Med.(70.0) Mean(73.8) 3rdQ(90.0) Max.(230.)\n", "'Sp. Atk' : float64 Min.(10.0) 1stQ(49.7) Med.(65.0) Mean(72.8) 3rdQ(95.0) Max.(194.)\n", "'Sp. Def' : float64 Min.(20.0) 1stQ(50.0) Med.(70.0) Mean(71.9) 3rdQ(90.0) Max.(230.)\n", "'Speed' : float64 Min.(5.0) 1stQ(45.0) Med.(65.0) Mean(68.2) 3rdQ(90.0) Max.(180.)\n", "'Generation': object 6 categs. '1'(166, 0.2075) '2'(165, 0.2062) '3'(160, 0.2000) '4'(121, 0.1512) ...\n", "'Name' : object 800 categs. 'Bulbasaur'(1, 0.0013) 'Ivysaur'(1, 0.0013) 'Venusaur'(1, 0.0013) 'VenusaurMega Venusaur'(1, 0.0013) ...\n", "'Type 1' : object 18 categs. 'Grass'(112, 0.1400) 'Fire'(98, 0.1225) 'Water'(70, 0.0875) 'Bug'(69, 0.0862) ...\n", "'Type 2' : object 18 categs. 'Poison'(97, 0.2343) 'nan'(35, 0.0845) 'Flying'(34, 0.0821) 'Dragon'(33, 0.0797) ...\n", "'Legendary' : bool 2 categs. 'False'(735, 0.9187) 'True'(65, 0.0813) \n" ] } ], "source": [ "pokemon.to_int('Generation').to_categorical(['Generation'])\n", "pokemon.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Histograms and density plots\n", "\n", "## Histograms\n", "\n", "This function helps you to plot double histograms to compare feature-distributions for all possible target values. So, you **must set the target variable** before calling this method, or provide the name of the variable you want to compare your distribution against.\n", "\n", "You can plot the histogram of any of the numerical variables with respect to the target variable to see what is its distribution, using (in this case we're plotting a feature called `Total` against the target variable –a binomial boolean feature previsouly set):" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXsAAAEGCAYAAACEgjUUAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAVK0lEQVR4nO3dfbBcdZ3n8fc3D3BjYMgDSYxcMAGzjji6MV5DKHVKDaPIsgbRshBqzDBQcXyY4LgWglO1MlVawo4l49SmwJTIBOQh2fgAi7urIWLNLhbJ3GB4DJgIIVyKkBgedGaIJuS7f/RJaEPCvbmnO307v/er6tY953fO6f7+ku7PPf3r07+OzESSdGQb1ekCJEntZ9hLUgEMe0kqgGEvSQUw7CWpAGM6XQDA8ccfnzNmzOh0GZLUVdatW/frzJwylH1HRNjPmDGD/v7+TpchSV0lIp4Y6r4O40hSAQx7SSqAYS9JBRgRY/aSNFS7du1iYGCAnTt3drqUw6anp4fe3l7Gjh077Nsw7CV1lYGBAY499lhmzJhBRHS6nLbLTHbs2MHAwAAzZ84c9u04jCOpq+zcuZPJkycXEfQAEcHkyZNrv5Ix7CV1nVKCfq9W9Newl6QCOGYvqavdvGZLS2/v/NNOGnSf0aNH85a3vGXf+g9/+EMONgvA5s2bOfvss3nwwQdbVeKwGPbSQbQyRIYSIOoe48aNY/369Z0u45A4jCNJLbB582be/e53M2fOHObMmcPPf/7zV+zz0EMPMXfuXGbPns1b3/pWNm7cCMB3v/vdfe2f/OQneemll1pen2EvSYfoxRdfZPbs2cyePZsPf/jDAEydOpVVq1Zx7733snz5chYvXvyK46699louueQS1q9fT39/P729vWzYsIHly5dz9913s379ekaPHs1NN93U8podxpGkQ3SgYZxdu3bx2c9+dl9g//KXv3zFcaeffjpf/epXGRgY4Nxzz2XWrFmsXr2adevW8Y53vANo/CGZOnVqy2s27CWpBa6++mqmTZvGfffdx549e+jp6XnFPueffz6nnXYaP/rRjzjrrLP41re+RWaycOFCvva1r7W1PodxJKkFXnjhBaZPn86oUaO48cYbDzju/thjj3HyySezePFiFixYwP3338/8+fNZuXIl27ZtA+DZZ5/liSeGPHPxkHlmL6mrjZQrnT796U/zkY98hBtuuIEzzzyT8ePHv2KfFStWcOONNzJ27Fhe+9rX8qUvfYlJkybxla98hfe///3s2bOHsWPHsmTJEl7/+te3tL7IzJbe4HD09fWlX16ikcZLL0emDRs28KY3vanTZRx2B+p3RKzLzL6hHO8wjiQVwLCXpAIY9pJUAMNekgpg2EtSAQx7SSrAoNfZR8R3gLOBbZn5J1Xb3wP/Gfg98Cvgwsx8vtp2OXAR8BKwODN/3KbaJQn6r2/t7fVd+Kqbd+zYwfz58wHYunUro0ePZsqUKQCsXbuWo446qrX1tMhQPlT1T8B/B25oalsFXJ6ZuyPiKuBy4IsRcSpwHvBm4HXAnRHxHzKz9VO4SVIHTJ48ed+8OFdccQXHHHMMX/jCF/5gn8wkMxk1auQMngxaSWb+M/Dsfm0/yczd1eo9QG+1vAC4NTN/l5mPA5uAuS2sV5JGpE2bNnHqqadywQUX8OY3v5knn3ySCRMm7Nt+6623cvHFFwPwzDPPcO6559LX18fcuXO555572l5fK6ZL+EtgebV8Ao3w32uganuFiFgELAI46SQ/XSip+z3yyCPccMMN9PX1sXv37oPut3jxYi699FLmzZt32L7JqlbYR8TfAruBQ558OTOXAkuhMV1CnTokaSQ45ZRT6OsbfPaCO++8k0cffXTf+nPPPceLL77IuHHj2lbbsMM+Iv6Cxhu38/PlCXaeAk5s2q23apOkI17z5GejRo2iee6xnTt37lvOzMP+Zu6w3j2IiDOBS4EPZea/N226HTgvIo6OiJnALGBt/TIlqbuMGjWKiRMnsnHjRvbs2cMPfvCDfdvOOOMMlixZsm/9cHyf7VAuvbwFeA9wfEQMAF+mcfXN0cCqiAC4JzP/KjMfiogVwMM0hnc+45U4ktpqkEslO+mqq67iAx/4AFOnTuXtb387v/vd7wBYsmQJn/rUp7j++uvZvXs3733ve/8g/NvBKY6lg3CK45HJKY5f5hTHkqQ/YNhLUgEMe0ldZyQMPx9OreivYS+pq/T09LBjx45iAj8z2bFjBz09PbVuxy8cl9RVent7GRgYYPv27Z0u5bDp6emht7d38B1fhWEvqauMHTuWmTNndrqMruMwjiQVwLCXpAIY9pJUAMNekgpg2EtSAQx7SSqAYS9JBTDsJakAhr0kFcCwl6QCGPaSVADDXpIKYNhLUgGc9bIL+F2okuryzF6SCmDYS1IBDHtJKsCgYR8R34mIbRHxYFPbpIhYFREbq98Tq/aIiH+MiE0RcX9EzGln8ZKkoRnKmf0/AWfu13YZsDozZwGrq3WADwKzqp9FwDWtKVOSVMegYZ+Z/ww8u1/zAmBZtbwMOKep/YZsuAeYEBHTW1WsJGl4hjtmPy0zn66WtwLTquUTgCeb9huo2l4hIhZFRH9E9Jf0LfGS1Am136DNzARyGMctzcy+zOybMmVK3TIkSa9iuGH/zN7hmer3tqr9KeDEpv16qzZJUgcNN+xvBxZWywuB25raP1FdlTMPeKFpuEeS1CGDTpcQEbcA7wGOj4gB4MvAlcCKiLgIeAL4WLX7/wLOAjYB/w5c2IaaJUmHaNCwz8yPH2TT/APsm8Bn6hYlSWotP0ErSQUw7CWpAIa9JBXAsJekAhj2klQAw16SCmDYS1IBDHtJKoBhL0kFMOwlqQCGvSQVwLCXpAIY9pJUAMNekgpg2EtSAQx7SSqAYS9JBTDsJakAhr0kFcCwl6QCGPaSVADDXpIKMKbOwRHxN8DFQAIPABcC04FbgcnAOuDPM/P3NeuUhuTmNVs6XYI0Ig37zD4iTgAWA32Z+SfAaOA84Crg6sx8A/AccFErCpUkDV+tM/vq+HERsQt4DfA08D7g/Gr7MuAK4Jqa9yN1tVa+4jj/tJNadlsqx7DP7DPzKeDrwBYaIf8CjWGb5zNzd7XbAHDCgY6PiEUR0R8R/du3bx9uGZKkIagzjDMRWADMBF4HjAfOHOrxmbk0M/sys2/KlCnDLUOSNAR1rsY5A3g8M7dn5i7g+8A7gQkRsXd4qBd4qmaNkqSa6oT9FmBeRLwmIgKYDzwM3AV8tNpnIXBbvRIlSXXVGbNfA6wE7qVx2eUoYCnwReDzEbGJxuWX17WgTklSDbWuxsnMLwNf3q/5MWBunduVJLWWn6CVpALUvc5eqs1PvUrt55m9JBXAsJekAjiMUxg/ti+VyTN7SSqAYS9JBTDsJakAhr0kFcCwl6QCGPaSVADDXpIKYNhLUgEMe0kqgGEvSQUw7CWpAIa9JBXAsJekAhj2klQAw16SCmDYS1IBDHtJKoBhL0kFqBX2ETEhIlZGxCMRsSEiTo+ISRGxKiI2Vr8ntqpYSdLw1D2z/ybwfzLzj4H/CGwALgNWZ+YsYHW1LknqoGGHfUQcB/wpcB1AZv4+M58HFgDLqt2WAefULVKSVE+dM/uZwHbg+oj4RUR8OyLGA9My8+lqn63AtAMdHBGLIqI/Ivq3b99eowxJ0mDqhP0YYA5wTWa+Dfg39huyycwE8kAHZ+bSzOzLzL4pU6bUKEOSNJg6YT8ADGTmmmp9JY3wfyYipgNUv7fVK1GSVNewwz4ztwJPRsQbq6b5wMPA7cDCqm0hcFutCiVJtY2pefxfAzdFxFHAY8CFNP6ArIiIi4AngI/VvA9JUk21wj4z1wN9B9g0v87tSpJay0/QSlIBDHtJKoBhL0kFMOwlqQCGvSQVwLCXpAIY9pJUAMNekgpg2EtSAQx7SSqAYS9JBTDsJakAhr0kFcCwl6QCGPaSVADDXpIKUPebqnQQN6/Z0ukSJGkfz+wlqQCGvSQVwLCXpAIY9pJUAMNekgpQO+wjYnRE/CIi7qjWZ0bEmojYFBHLI+Ko+mVKkupoxZn9JcCGpvWrgKsz8w3Ac8BFLbgPSVINtcI+InqB/wR8u1oP4H3AymqXZcA5de5DklRf3TP7fwAuBfZU65OB5zNzd7U+AJxQ8z4kSTUNO+wj4mxgW2auG+bxiyKiPyL6t2/fPtwyJElDUGe6hHcCH4qIs4Ae4I+AbwITImJMdXbfCzx1oIMzcymwFKCvry9r1KEOcUoIqXsM+8w+My/PzN7MnAGcB/w0My8A7gI+Wu22ELitdpWSpFracZ39F4HPR8QmGmP417XhPiRJh6Als15m5s+An1XLjwFzW3G7klqo//rW32bfha2/TbWFn6CVpAIY9pJUAMNekgpg2EtSAQx7SSqAYS9JBTDsJakAhr0kFcCwl6QCtOQTtJK605rHn611/K9eenkyvPNPO6luOWojz+wlqQCGvSQVwLCXpAIY9pJUAMNekgpg2EtSAQx7SSqAYS9JBfBDVU1uXrNl8J0kqQt5Zi9JBTDsJakAhr0kFcCwl6QCDDvsI+LEiLgrIh6OiIci4pKqfVJErIqIjdXvia0rV5I0HHXO7HcD/yUzTwXmAZ+JiFOBy4DVmTkLWF2tS5I6aNhhn5lPZ+a91fJvgQ3ACcACYFm12zLgnLpFSpLqacl19hExA3gbsAaYlplPV5u2AtMOcswiYBHASScN/0sPvDZekgZX+w3aiDgG+B7wucz8TfO2zEwgD3RcZi7NzL7M7JsyZUrdMiRJr6JW2EfEWBpBf1Nmfr9qfiYiplfbpwPb6pUoSaqrztU4AVwHbMjMbzRtuh1YWC0vBG4bfnmSpFaoM2b/TuDPgQciYn3V9iXgSmBFRFwEPAF8rF6JkqS6hh32mfn/gDjI5vnDvV1JBeu/vrW313dha2+vi/kJWkkqgGEvSQUw7CWpAIa9JBXAsJekAvi1hFKXGe4UIadsebbFlaibeGYvSQUw7CWpAIa9JBXAsJekAhj2klQAw16SCmDYS1IBvM5eUku04itCT9nyLKfNnNSCarQ/z+wlqQCGvSQVwGEcSUcuvwxlH8/sJakAntlLGlHWPN66Cdta/mZvq18pwGF7teCZvSQVwLCXpAIY9pJUAMNekgrQtrCPiDMj4tGI2BQRl7XrfiRJg2tL2EfEaGAJ8EHgVODjEXFqO+5LkjS4dp3ZzwU2ZeZjmfl74FZgQZvuS5I0iHZdZ38C8GTT+gBwWvMOEbEIWFSt/mtEPNqmWpodD/z6MNxPux0p/QD7MhIdQj++0NZCWqAL/k/+cqg7Hqgvrx/qwR37UFVmLgWWHs77jIj+zOw7nPfZDkdKP8C+jERHSj/AvjRr1zDOU8CJTeu9VZskqQPaFfb/AsyKiJkRcRRwHnB7m+5LkjSItgzjZObuiPgs8GNgNPCdzHyoHfd1iA7rsFEbHSn9APsyEh0p/QD7sk9kZqsKkSSNUH6CVpIKYNhLUgGOmLCPiBMj4q6IeDgiHoqIS6r2SRGxKiI2Vr8nVu0REf9YTedwf0TM6WwPXhYRPRGxNiLuq/ryd1X7zIhYU9W8vHrzm4g4ulrfVG2f0cn69xcRoyPiFxFxR7Xerf3YHBEPRMT6iOiv2rru8QUQERMiYmVEPBIRGyLi9G7sS0S8sfr/2Pvzm4j4XJf25W+q5/uDEXFLlQOte65k5hHxA0wH5lTLxwK/pDFVw38DLqvaLwOuqpbPAv43EMA8YE2n+9DUlwCOqZbHAmuqGlcA51Xt1wKfqpY/DVxbLZ8HLO90H/brz+eBm4E7qvVu7cdm4Pj92rru8VXVtwy4uFo+CpjQrX1p6tNoYCuNDxp1VV9ofBD1cWBctb4C+ItWPlc63sk2/uPdBvwZ8CgwvWqbDjxaLX8L+HjT/vv2G0k/wGuAe2l8AvnXwJiq/XTgx9Xyj4HTq+Ux1X7R6dqrenqB1cD7gDuqJ1nX9aOq6UBh33WPL+C4Klhiv/au68t+9b8fuLsb+8LLsw5Mqh77dwAfaOVz5YgZxmlWvaR5G40z4mmZ+XS1aSswrVo+0JQOJxymEgdVDX2sB7YBq4BfAc9n5u5ql+Z69/Wl2v4CMPnwVnxQ/wBcCuyp1ifTnf0ASOAnEbEuGtN9QHc+vmYC24Hrq+G1b0fEeLqzL83OA26plruqL5n5FPB1YAvwNI3H/jpa+Fw54sI+Io4Bvgd8LjN/07wtG38Gu+Ja08x8KTNn0zgzngv8cYdLOmQRcTawLTPXdbqWFnlXZs6hMZvrZyLiT5s3dtHjawwwB7gmM98G/BuNoY59uqgvAFRj2R8C/sf+27qhL9V7Cgto/CF+HTAeOLOV93FEhX1EjKUR9Ddl5ver5mciYnq1fTqNM2XokikdMvN54C4aL+EmRMTeD8I117uvL9X244Adh7nUA3kn8KGI2Exj5tP3Ad+k+/oB7Dv7IjO3AT+g8Ue4Gx9fA8BAZq6p1lfSCP9u7MteHwTuzcxnqvVu68sZwOOZuT0zdwHfp/H8adlz5YgJ+4gI4DpgQ2Z+o2nT7cDCankhjbH8ve2fqN6dnwe80PSyr6MiYkpETKiWx9F472EDjdD/aLXb/n3Z28ePAj+tzmY6KjMvz8zezJxB4yX2TzPzArqsHwARMT4ijt27TGN8+EG68PGVmVuBJyPijVXTfOBhurAvTT7Oy0M40H192QLMi4jXVFm29/+kdc+VTr8x0cI3ON5F46Xa/cD66ucsGuNYq4GNwJ3ApGr/oPEFK78CHgD6Ot2Hpr68FfhF1ZcHgf9atZ8MrAU20Xi5enTV3lOtb6q2n9zpPhygT+/h5atxuq4fVc33VT8PAX9btXfd46uqbzbQXz3GfghM7OK+jKdxVntcU1vX9QX4O+CR6jl/I3B0K58rTpcgSQU4YoZxJEkHZ9hLUgEMe0kqgGEvSQUw7CWpAIa9ihIRk5tmSNwaEU81rR91gP0nRcRfDeF2x0TE8+2pWqrPSy9VrIi4AvjXzPz6q+zzBmBlNqaueLXbGgP8OjMntLZKqTU8s5cqEXFpNZf4gxHx11XzlcDeOdOvjIg/ioifRsS91XzoZ3eyZmmo2vKF41K3iYjTgAuAd9B4XqyNiJ/RmCDsDXvP7Kv5l87JzN9ExFTgbhrT0Uojmmf2UsO7gO9l5ouZ+VsaUwi8+wD7BXBlRNwP/AQ4MSKOP4x1SsPimb10aD5BY4bBOZm5OyIGaMxTIo1ontlLDf8X+HBEjKu+E2FB1fZbGl9zuddxNObo3x0Rf8YI+OILaSg8s5eAzFwbEbcA/1I1XZOZDwBU30z1APAj4BvA/6zW19KYVVEa8bz0UpIK4DCOJBXAsJekAhj2klQAw16SCmDYS1IBDHtJKoBhL0kF+P/zbE2axHljWwAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "pokemon.plot_histogram('Total')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or if you want, plot every numerical feature histogram):" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAzUAAADRCAYAAAD10ZzlAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAgAElEQVR4nO3de5hcdZ3n8feHBAwEJFxCNqbJJEIGiRmImTaAqMsSUETWoIvIZeUibpxRDOPIcpl5noHZhRXUGYZZWTACEhBIMOrAIzqKEdYRh2AHAgQCJEKAzgaSCYFxRhAC3/3j/DpUOtXd1V2Xc07V5/U8/XSdW9X3VJ1vnfqd3+UoIjAzMzMzMyurHfIOwMzMzMzMrB4u1JiZmZmZWam5UGNmZmZmZqXmQo2ZmZmZmZWaCzVmZmZmZlZqLtSYmZmZmVmpuVBj1kCSrpe0QdLKKsu+LCkk7Z2mJenvJa2R9LCkWa2P2MzMzKz8RucdAMDee+8dU6ZMyTsMs0EtX778XyJi/BCr3QB8A7ixcqakfYEPAc9WzP4IMC39HQJcnf4PyvliZVBjvjSd88XKwPliVpvBcqUQhZopU6bQ09OTdxhmg5L0zFDrRMQvJE2psugK4Dzg9op5c4EbI7sD7n2SxkmaGBHrB3sN54uVQS35Iul64DhgQ0TMSPP2BBYDU4C1wIkRsVmSgCuBY4HfAWdExANDvYbzxcqglnxpBeeLFd1gueLmZ2ZNJmkusC4iHuq3aBLwXMV0b5pn1iluAI7pN+8CYGlETAOWpmnYtmZzHlnNppmZGeBCjVlTSdoF+Avgr+p8nnmSeiT1bNy4sTHBmeUsIn4BvNhv9lxgYXq8EDi+Yv6NkbkPGCdpYmsiNTOzonOhxqy59gOmAg9JWgt0AQ9I+g/AOmDfinW70rztRMSCiOiOiO7x43Nvdm3WTBMqmmA+D0xIj2uu2fRFADOzzlOIPjVWXK+//jq9vb28+uqreYfSMmPGjKGrq4sdd9yx7ueKiEeAffqmU8GmOyL+RdIdwNmSFpENEPDyUP1prNicL40VESEpRrDdAmABQHd397C3t+brxFyB5uaLta9OzJeR5IoLNTao3t5edtttN6ZMmULWT7e9RQSbNm2it7eXqVOnDnt7SbcCRwB7S+oFLoqI6wZY/UdknZ7XkHV8PnNkUVtROF8a4oW+ATNS87INaX7NNZtWfJ2WK9C0fLEO0Gn5MtJccaGmDd2y7NmhVwJOOWTykOu8+uqrHZNEAJLYa6+9GGmTlYg4eYjlUyoeB/CFEb1QSfQ/Fms55srM+dIQdwCnA5el/7dXzHfN5gAG+94vYt51Wq5A0/LFKgyUB0XMgeHotHwZaa64UGND6pQk6tNp+2uN1WnHTz37W61mk6wwc5uks4BngBPT6q7ZbDOdlivQmftsjdFpx85I9tcDBVjhjRo1ipkzZ279W7t27YDrrl27lhkzZrQuOLOCKVO+RMTJETExInaMiK6IuC4iNkXEnIiYFhFHRcSLad2IiC9ExH4R8UcR4ZtpWF3KlCtmeStDvrimxoal1qZttaqlSnjnnXdmxYoVDX1ds1ZwvpjVxrliVjvnS3WuqbFSWrt2LR/4wAeYNWsWs2bN4le/+tV26zz66KPMnj2bmTNnctBBB7F69WoAvvOd72yd/7nPfY433nij1eGbtZTzpRxuWfbsgH/WGs4Vs9oVLV9cqLHCe+WVV7ZWd3784x8HYJ999uGuu+7igQceYPHixcyfP3+77a655hrOOeccVqxYQU9PD11dXaxatYrFixdz7733smLFCkaNGsXNN9/c6l0yaxrni1ltnCtmtStDvrj5mRVetSrP119/nbPPPntrMjz55JPbbXfYYYdx6aWX0tvbyyc+8QmmTZvG0qVLWb58Oe9973uBLEn32Wef7bY1Kyvni1ltnCtmtStDvrhQY6V0xRVXMGHCBB566CHefPNNxowZs906p5xyCocccgh33nknxx57LN/85jeJCE4//XS+8pWv5BC1WT6cL2a1KXKuSLoeOA7YEBEz0rw9gcXAFGAtcGJEbFY2dNSVZCMG/g44IyIeaFpw1pGKli9ufmal9PLLLzNx4kR22GEHbrrppqptMZ966ine+c53Mn/+fObOncvDDz/MnDlzWLJkCRs2ZPfze/HFF3nmmWdaHb5ZSzlfzGpT8Fy5ATim37wLgKURMQ1YmqYBPgJMS3/zgKsbHYxZ0fLFhRorpc9//vMsXLiQgw8+mMcff5yxY8dut85tt93GjBkzmDlzJitXruS0005j+vTpXHLJJXzoQx/ioIMO4uijj2b9et+/z9qb88WsNkXOlYj4BfBiv9lzgYXp8ULg+Ir5N6ah0O8Dxkma2NCArOMVLV+U3dR8kBWqV3d+DfjPwGvAb4AzI+KltOxC4CzgDWB+RPxkqCC6u7ujp8e3HGiUWkfKqWUIv1WrVnHggQfWG1LpVNtvScsjojunkLYqU770PxbLflfnoThf3uJ8Gb7BvruHyp16ts1Dp+YK1JcvkqYAP6z4PfZSRIxLjwVsjohxkn4IXBYRv0zLlgLnD3V/pzLlSzUD5UERc2A4OjVfhpsrtdTU3MD21Z13ATMi4iDgSeDC9ELTgZOAd6dt/o+kUcPZATMzMzMbnsiuUg9+pboKSfMk9Ujq2bhxYxMiM2uNIQs11ao7I+KnEbElTd4HdKXHc4FFEfH7iHgaWAPMbmC8ZmZmZpZ5oa9ZWfq/Ic1fB+xbsV5XmrediFgQEd0R0T1+/PimBmvWTI3oU/MZ4Mfp8STguYplvWmemZmZmTXWHcDp6fHpwO0V809T5lDg5Yhwhzhra3UN6SzpL4EtwLDvmCNpHtmIHEyeXO62jmZmZmbNJOlW4Ahgb0m9wEXAZcBtks4CngFOTKv/iGw45zVkQzqf2fKAzVpsxIUaSWeQDSAwJ94abWBY1Z3AAsg6po00DrMiacXAGmZm1nki4uQBFs2psm4AX2huRK1TbQCAsnf+t8YbUaFG0jHAecB/jIjfVSy6A7hF0t8C7yAbH/3+uqM0K48bgG8AN1bMuwu4MCK2SLqcbGCN8/sNrPEO4GeS/jAith/o3cxshMo2MpqZ2UgMWagZoLrzQuBtwF3ZCILcFxF/EhGPSroNeIysWdoX/APN6rFp0ybmzMkuQj3//POMGjWKvo6M999/PzvttFOe4W0nIn6RhtysnPfTisn7gBPS460DawBPS+obWOOfWxCqtaGy5YtZXpwrZrUrS74MWagZoLrzukHWvxS4tJ6grMB6vt3Y5+sevJnvXnvtxYoVKwC4+OKL2XXXXTn33HO3WSciiAh22KEU95L9DLA4PZ5EVsjp44E12o3zxaw2zhWz2jlfqnKmWimtWbOG6dOnc+qpp/Lud7+b5557jnHjxm1dvmjRIj772c8C8MILL/CJT3yC7u5uZs+ezX333TfQ0zZVvQNr+D4CNlJlzBezPDhXzGpXtHypa/Qzszw9/vjj3HjjjXR3d7Nly5YB15s/fz7nnXcehx56KGvXruW4445j5cqVLYzUA2tY/sqUL2Z5cq6Y1a5I+eJCjZXWfvvtR3d395Dr/exnP+OJJ57YOr1582ZeeeUVdt5552aGt5UH1rAiKEu+mOXNuWJWuyLliws1Vlpjx47d+niHHXbgrQoQePXVV7c+joiWdWTzwBpWVEXMF7Micq6Y1a5I+eI+NdYWdthhB/bYYw9Wr17Nm2++yQ9+8IOty4466iiuuuqqrdN9nd2aISJOjoiJEbFjRHRFxHURsX9E7BsRM9Pfn1Ssf2lE7BcRB0TEj5sWmFmFouSLWdE5V8xql3e+uFBjbePyyy/nwx/+MO973/vo6uraOv+qq67i3nvv5aCDDmL69Ol861vfyjFKs2JwvpjVxrliVrs880WV1UR56e7ujp6enrzDaBuD3WitUi03XVu1ahUHHnhgvSGVTrX9lrQ8IoZuONpkZcqX/sdiu9/oz/nylnrzRdKXgM8CATwCnAlMBBYBewHLgU9HxGuDPU+Z86XSULlT6/f+cJ+3WTo1V8Dnl5GodnxXO3YHyoOyn3s6NV+GmyuuqTEzs0KRNAmYD3RHxAxgFHAScDlwRUTsD2wGzsovSjMzKxIXaszMrIhGAztLGg3sAqwHjgSWpOULgeNzis3MzArGhRozMyuUiFgHfB14lqww8zJZc7OXIqLvRgi9wKR8IjQzs6JxocaGVIR+V63UaftrjdVpx08z9lfSHsBcYCrZPZzGAscMY/t5knok9WzcuLHh8VljdFquQGfuszVGpx07I9lfF2psUGPGjGHTpk0dk0wRwaZNmxgzZkzeoVgJOV8a5ijg6YjYGBGvA98HDgfGpeZoAF3AugHiWhAR3RHRPX78+EbHZg3QabkCPr/YyHVavow0V3zzTRtUV1cXvb29dNLVzjFjxmwzDKFZrZwvDfMscKikXYBXgDlAD3A3cALZCGinA7c3+oWtNToxV8DnFxuZTsyXkeSKCzUDqGV4zLIPEViLHXfckalTp+YdhlkpOF8aIyKWSVoCPABsAR4EFgB3AoskXZLmXZdflFYP54pZ7ZwvtXGhxszMCiciLgIu6jf7KWB2DuGYmVnBDdmnRtL1kjZIWlkxb09Jd0lanf7vkeZL0t9LWiPpYUmzmhm8mZmZmZlZLQMF3MD2o85cACyNiGnA0jQN8BFgWvqbB1zdmDDNzMzMzMyqG7JQExG/AF7sN3su2Y3PYNsboM0FbozMfWQj1UxsVLBmZmZmti1JX5L0qKSVkm6VNEbSVEnLUuuZxZJ2yjtOs2Ya6ZDOEyJifXr8PDAhPZ4EPFex3oA3R/N9BMzMzMzqI2kSMB/ojogZwCjgJOBy4IqI2B/YDJyVX5RmzVf3fWoiGzR72ANn+z4C1o7cB83MzHIwGtg53cdpF2A9cCSwJC2vbFVj1pZGWqh5oa9ZWfq/Ic1fB+xbsd6AN0cza1M34D5oZmbWIhGxDvg62f2d1gMvA8uBlyJiS1ptwJYzZu1ipIWaO8hufAbb3gDtDuC0dAX6UODlimZqZm3PfdDMzKyVUu3/XGAq8A5gLNtfXBtse3cHsLZQy5DOtwL/DBwgqVfSWcBlwNGSVgNHpWmAH5HdR2AN8C3g802J2qxc6u6DZmZmNoCjgKcjYmNEvA58Hzic7EJZ3/0IB2w54+4A1i6GvPlmRJw8wKI5VdYN4Av1BmXWriIiJA27D5qkeWRN1Jg8eXLD4zIzs9J6FjhU0i7AK2S/z3qAu4ETgEVs26rGrC0NWagxs7q9IGliRKwfaR+0iFgALADo7u4edqHIzKyaW5Y9O+CyUw7xBZQyiIhlkpYADwBbgAfJzhd3AoskXZLmXZdflGbNV/foZ2Y2JPdBMzOzpomIiyLiXRExIyI+HRG/j4inImJ2ROwfEZ+MiN/nHadZM7mmxqyBUh+0I4C9JfUCF5H1Obst9Ud7Bjgxrf4j4FiyPmi/A85secBmZmZmbcCFGrMGch80MzMzs9ZzocbMzMzMOkq1/mTuR1Zu7lNjZmZmZmal5kKNmZmZmZmVmgs1ZmZmZmZWai7UmJmZmZlZqblQY2ZmZmZmpebRz8zMzMys5TwCmTWSCzVmZmZmVirVCkTW2dz8zMzMzMzMSs2FGjMzMzMzK7W6CjWSviTpUUkrJd0qaYykqZKWSVojabGknRoVrJmZdQZJ4yQtkfS4pFWSDpO0p6S7JK1O//fIO04zMyuGERdqJE0C5gPdETEDGAWcBFwOXBER+wObgbMaEaiZmXWUK4F/jIh3AQcDq4ALgKURMQ1YmqbNzMzqHihgNLCzpNeBXYD1wJHAKWn5QuBi4Oo6X8cSd4wzs3YnaXfgg8AZABHxGvCapLnAEWm1hcA9wPmtj9DMzIpmxIWaiFgn6evAs8ArwE+B5cBLEbElrdYLTKo7SmuKWgpIHlrRzHIwFdgIfFvSwWTnlnOACRGxPq3zPDAhp/jMzKxg6ml+tgcwl+zk8w5gLHDMMLafJ6lHUs/GjRtHGoZZabgPmlnNRgOzgKsj4j3Av9OvqVlEBBDVNvb5xcys89TT/Owo4OmI2Agg6fvA4cA4SaNTbU0XsK7axhGxAFgA0N3dXfXEZNYuKvqgTY+IVyTdRtYH7ViyPmiLJF1D1gfNzTWt0/UCvRGxLE0vISvUvCBpYkSslzQR2FBt43Y8v7jpsZnZ4OoZ/exZ4FBJu0gSMAd4DLgbOCGtczpwe30hmrWNvj5oo9m2D9qStHwhcHxOsZkVRkQ8Dzwn6YA0q+/8cgfZeQV8fjEzswr19KlZJmkJ8ACwBXiQ7MrYncAiSZekedc1IlCzMqu3D5qkecA8gMmT3c/JOsIXgZtTk8yngDPJLsTdJuks4BngxBzjMzOzAqlr9LOIuAi4qN/sp4DZ9TyvWbvp1wftJeC7DKMPWjs2pzEbTESsALqrLJrT6lgayc3IrBkkjQOuBWaQ9TX7DPAEsBiYAqwFToyIzTmFaNZ09Q7pbA3iE13bq6sPWrvof5x7dD0zs4bou6/TCal2cxfgL8ju63SZpAvI+qV5CHRrW/X0qTGz2rkPmpmZNVzFfZ2ug+y+ThHxElnrgIVpNffZtLbnmpo61Fq74qvR5j5oZmbWJL6vkxku1Ji1jPugmZlZE/Td1+mL6QLalVS5r5OkAe/rhAeisTbg5mdmZmZm5VXtvk6zSPd1Ahjqvk4R0R0R3ePHj29JwGbN4EKNmZmZWUn5vk5mGTc/MzMzMys339fJOp4LNWZmZjYsgw2U48FxWq+d7uvkW1zYSLn5mZmZmZmZlZoLNWZmZmZmVmpufmZmuenfzMDNVszMzGwkXFNjZmZmZmal5poaMxsx17SYmZlZEbimxszMzMzMSs2FGjMzMzMzK7W6CjWSxklaIulxSaskHSZpT0l3SVqd/u/RqGDNzMzMzMz6q7em5krgHyPiXcDBwCrgAmBpREwDlqZpMzMzMzOzphhxoUbS7sAHgesAIuK1iHgJmAssTKstBI6vN0gzMzMzM7OB1DP62VRgI/BtSQcDy4FzgAkRsT6t8zwwob4QzdqDpHHAtcAMIIDPAE8Ai4EpwFrgxIjYnFOIVVWOcObRzczMzKyI6ml+NhqYBVwdEe8B/p1+Tc0iIsh+vG1H0jxJPZJ6Nm7cWEcYZqXh5ppmZmZmTVBPTU0v0BsRy9L0ErIfZC9ImhgR6yVNBDZU2zgiFgALALq7u6sWfMzaRUVzzTMga64JvCZpLnBEWm0hcA9wfusjNDMzs/76348N3GqhqEZcUxMRzwPPSTogzZoDPAbcAZye5p0O3F5XhGbtobK55oOSrpU0lhqba7pm08zMzGxg9dTUAHwRuFnSTsBTwJlkBaXbJJ0FPAOcWOdrmLWDvuaaX4yIZZKupEpzTUlVay07pWaz/xUxXw0zMzOzWtRVqImIFUB3lUVz6nleszZUV3NNs04kaRTQA6yLiOMkTQUWAXuRDU7z6dSU08xsQNWakFn7qfc+NWZWAzfXNBuRc8gG1OhzOXBFROwPbAbOyiUqMzMrnHqbn5lZ7dxc06xGkrqAjwKXAn8uScCRwClplYXAxcDVuQTYAXx128zKxIUasxbphOaa/hFkDfR3wHnAbml6L+CliNiSpnuBSdU2lDQPmAcwebL7ZZmZdQI3PzMzs0KRdBywISKWj2T7iFgQEd0R0T1+/PgGR2dmZkXkmhozq5lrYqxFDgc+JulYYAzwdrKb146TNDrV1nQB63KM0axQPLCGdTrX1JiZWaFExIUR0RURU4CTgJ9HxKnA3cAJaTUPrGG2LQ+sYR3NhRozMyuL88kGDVhDdvX5upzjMSuEioE1rk3TfQNrLEmrLASOzyc6s9Zw8zMzMyusiLgHuCc9fgqYnWc8ZgU14oE1zNqFa2rMzMzMSqregTUkzZPUI6ln48aNDY7OrHVcqDEzMzMrr76BNdaSDQxwJBUDa6R1BhxYw6MFWrtwocbMzMyspDywhlmm4/rUeEhaMzMz6wDnA4skXQI8iAfWsDbXcYUaMzMzs3bkgTWsk7n5mZmZmZmZlZoLNWZmZmZmVmp1F2okjZL0oKQfpumpkpZJWiNpsaSd6g/TzMzMzMysukbU1JwDrKqYvhy4IiL2BzYDZzXgNczagi8CmJmZmTVeXQMFSOoCPgpcCvy5JJGNj35KWmUhcDFwdT2vY9ZG+i4CvD1N910EWCTpGrKLAM4Xs4IbbCTNUw6Z3MJIzMwM6q+p+TvgPODNNL0X8FJEbEnTvcCkahv6DrbWaSouAlybpvsuAixJqywEjs8nOjMzM7PyGnGhRtJxwIaIWD6S7X0HW+tAI74IYGZmZmYDq6f52eHAxyQdC4wha05zJTBO0uj0Q60LWFd/mOXmG35a5UUASUeMYPt5wDyAyZPdtMXMzMys0ohraiLiwojoiogpwEnAzyPiVOBu4IS02unA7XVHaVZ+fRcB1gKLyJqdbb0IkNYZ8CKAazbNzMzMBlbXQAEDOB9YJOkS4EHguia8hhVFz7drW6/7zObGUXARcSFwIUCqqTk3Ik6V9F2yiwCL8EUAMzMzsxFpSKEmIu4B7kmPnwJmN+J5zTqALwKYmZmZ1akZNTVmNghfBDAzMyse94EuNxdqzMzMrGF8Dx8zy4MLNWZmZlYILhCZ2Ui5UGNmpdX/B5B/9JiZmXUmF2rMzMysJdxnwcyaxYUaa4hlT7846PLfvJGdyHwl3YbDNTFmZmZWixHffNPMzKwZJO0r6W5Jj0l6VNI5af6eku6StDr93yPvWM3MrBhcqDEzs6LZAnw5IqYDhwJfkDQduABYGhHTgKVp2szMzIUaMzMrlohYHxEPpMe/BVYBk4C5wMK02kLg+HwiNDOzonGhxszMCkvSFOA9wDJgQkSsT4ueBybkFJZZYbi5plnGAwWYmVkhSdoV+B7wZxHxr5K2LouIkBQDbDcPmAcwebIHl7C219dc8wFJuwHLJd0FnEHWXPMySReQNdc8P8c420ato/h5cJvWcqHGzMwKR9KOZAWamyPi+2n2C5ImRsR6SROBDdW2jYgFwAKA7u7uqgUfs3aRai/Xp8e/lVTZXPOItNpC4B5yLNR4OG9rtrYp1DhZzMzag7IqmeuAVRHxtxWL7gBOBy5L/2/PITyzwnJzTetkbVOoMTOztnE48GngEUkr0ry/ICvM3CbpLOAZ4MSc4huUL7JZHtxc0zrdiAs1kvYFbiQr+QewICKulLQnsBiYAqwFToyIzfWHamZmnSAifglogMVzWhmLWRm4uaZZfaOf+T4CZjXy6DRmZtYMNTTXBDfXtA4w4pqasnRMMysIj07TAG7WY2a2nVI318zLfs9+d9Dlv5n8yRZFYo3SkD417phmNjhfBDAzs2Zwc02zTN2FmlZ0TPPVWWsnvghgZmUx1NXs/pp5dXuw3wK+H4iZ1VWoccc0s+Hx6DRmZmadYTgX5V0wr189o5/5PgJmw+CLAGZmbylSLZCZlV89NTXumGZWI18EMDMzKw8PJFA+9Yx+5o5p1ng936593e4zmxdH4/kiQBWDnTR8wjAzM7NaNWT0MzMbnC8CmJmZmTVPPTffNDMzMzMzy51rasxsGx5C3czMzMrGhRob1FA/cPd79sUWRWL2lgH74hzy5dYGYm2tHQv4wx1xzMysLFyoMbO20f9HqMf9NzMz6wwu1JiZmVmpVatV66uVOmTqnrU9SblG1DSzflyoMbNCcjMZMzMzq5ULNWZmZmbWVnxhrPN4SGczMzMzMys119RYS9UymlrN7Z/NzKxQynZ1fNnTb43g+Zs3PNCINU4tufCbyZ8cdHm130w+LgfmQo2ZtY3tTiKjKgrI7gRsZtYS7TgcuhWfCzVmZtax/OPLzKw9uFBj5dXz7drW8xX6bTTjXi6DVbMPVb3eMoMdLz5GrMTK1uSrLdR6/unj75i2k1feuUnawFyoscKpbOM8mJr73rjwY2Zm1nCu6WwP7VJQcqHGzNpWrQXkvg7CZfwSNzNrBRdgGq+ytqfa27vfENsXpiVEHRpZoGrakM6SjpH0hKQ1ki5o1uuYlZ1zxax2zhez2jlfrJM0paZG0ijgKuBooBf4taQ7IuKxZryeWVk5V8xq53yxkai1xrbdtHO+uB9ZZuv7MGqA5vgd1qy+Wc3PZgNrIuIpAEmLgLlA6RPJrMFyzxU3KRjYLcueHfDk+ZvJnxyyirwZgzLkpSD7knu+tIJ/sFXX7Pel//PX89VYkPut5ZYvQ31WQzWbcg4Mz0AF9/73XoIRfHcP0S/5ljfm1P8aDdKsQs0k4LmK6V7gkCa9llmZNTxXilZIKdPJqd73rtHvfeXzlblA1EA+t5jVzvliHSW3gQIkzQPmpcl/k/TEMJ9ib+BfGhtVrtppf6rsy7m5BDI8n6k2s3Jf/qB1sWxrGPnSTsdRrRqwzyM5Ps/l1GFuMdz1B9k298+5hn0pYr7k/r7VyHE2VgHirHp+qVTEfIGmvXfD/s4twGcIFCcOqCmW7d/nWs9DwzhfbRdHPee6GrYfMFeaVahZB+xbMd2V5m0VEQuABSN9AUk9EdE90u2Lpp32x/syLEPmCtSeL+303tfK+9xR6sqXsrxvjrOxyhJnE9R9finKe+c4tleUWIoSBzRv9LNfA9MkTZW0E3AScEeTXsuszJwrZrVzvpjVzvliHaUpNTURsUXS2cBPgFHA9RHxaDNey6zMnCtmtXO+mNXO+WKdpml9aiLiR8CPmvX81NF0raDaaX+8L8PQ4Fxpp/e+Vt7nDlJnvpTlfXOcjVWWOBuuAeeXorx3jmN7RYmlKHGgiMg7BjMzMzMzsxFrVp8aMzMzMzOzlihkoUbSvpLulvSYpEclnZPm7ynpLkmr0/890nxJ+ntJayQ9LGlWvntQnaRRkh6U9MM0PVXSshT34tSRD0lvS9Nr0vIpecbdn6RxkpZIelzSKkmHlfmzkfSldJytlHSrpDFl/GwkHSPpiRTbBXnH0yiSrpe0QdLKinmlPd6G0q7ff3krcn5IWivpEUkrJPWkeVU/7xbGVIq8GyDOiyWtS+/nCknHViy7MMX5hKQPtyrOssk7X/LKiaIc90U5rkt3PoqIwv0BE4FZ6fFuwJPAdPfLz0YAAAnYSURBVOCrwAVp/gXA5enxscCPAQGHAsvy3ocB9uvPgVuAH6bp24CT0uNrgD9Njz8PXJMenwQszjv2fvuxEPhserwTMK6snw3ZzcmeBnau+EzOKNtnQ9YJ9DfAO9Nn8hAwPe+4GrRvHwRmASsr5pXyeKtxf9vy+y/n97TQ+QGsBfbuN6/q593CmEqRdwPEeTFwbpV1p6fP/m3A1HRMjMr78y/aXxHyJa+cKMpxX5Tjumzno0LW1ETE+oh4ID3+LbCK7MfnXLIf1KT/x6fHc4EbI3MfME7SxBaHPShJXcBHgWvTtIAjgSVplf7707efS4A5af3cSdqdLNmuA4iI1yLiJUr82ZANmLGzpNHALsB6yvfZzAbWRMRTEfEasIgs1tKLiF8AL/abXebjbVDt+P1XAGXMj4E+75YoS94NEOdA5gKLIuL3EfE0sIbs2LBtFTVfmp4TRTnui3Jcl+18VMhCTSVlzXveAywDJkTE+rToeWBCejwJeK5is940r0j+DjgPeDNN7wW8FBFb0nRlzFv3Jy1/Oa1fBFOBjcC3lTWlu1bSWEr62UTEOuDrwLNkhZmXgeWU77Mp9PvcBKU83oarjb7/8lb09yiAn0paruzu7jDw552nMh2DZ6fmL9dXNFMqYpxFVIT3qUg5UaTjPrfjugzno0IXaiTtCnwP+LOI+NfKZZHVc5Vi6DZJxwEbImJ53rE0wGiyKtGrI+I9wL+TVT1uVbLPZg+yKwtTgXcAY4Fjcg3KhqVMx9twtMv3n9Xk/RExC/gI8AVJH6xcWMTPu4gxVbga2A+YSXax6m/yDcdGoJA5kfNxn9txXZbzUWELNZJ2JHsDb46I76fZL/RVY6X/G9L8dcC+FZt3pXlFcTjwMUlryapxjwSuJKuW67tXUGXMW/cnLd8d2NTKgAfRC/RGxLI0vYSskFPWz+Yo4OmI2BgRrwPfJ/u8yvbZFP19brSyHm81abPvvyIo9HuUaoyJiA3AD8iajgz0eeepFMdgRLwQEW9ExJvAt3irKU6h4iyw3N+nguVEIY77vI7rMp2PClmoSX0UrgNWRcTfViy6Azg9PT4duL1i/mlp1IVDgZcrqsVyFxEXRkRXREwh61z+84g4FbgbOCGt1n9/+vbzhLR+IUrBEfE88JykA9KsOcBjlPSzIWt2dqikXdJx17c/Zftsfg1MUzZq205kx9kdOcfUTGU93obUbt9/BVHY/JA0VtJufY+BDwErGfjzzlMpjsF+bfg/TvZ+QhbnScpGsZwKTAPub3V8JZBrvhQwJwpx3OdxXJfufBQtHJWg1j/g/WRVWQ8DK9LfsWR9F5YCq4GfAXum9QVcRTbiwyNAd977MMi+HcFbo5+9k+zAWwN8F3hbmj8mTa9Jy9+Zd9z99mEm0JM+n38A9ijzZwP8NfA42RfETWQjiJTus0k58mR6r/8y73gauF+3klW1v05WU3hWmY+3Gva3bb//cn5fC5kf6bvmofT3aF9sA33eLYyrFHk3QJw3pTgeJvuRNbFi/b9McT4BfCTvz7+of3nmS545UZTjvijHddnOR0pBmJmZmZmZlVIhm5+ZmZmZmZnVyoUaMzMzMzMrNRdqzMzMzMys1FyoMTMzMzOzUnOhxszMzMzMSs2FmpKQtJekFenveUnrKqZ3qrL+npL+pIbnHS3ppeZEbVYskv6t3/QZkr6RHl9ckVcrJX0snyjNWkPS8ZJC0rvS9BRJp1Qsnynp2Dqef62kvRsRq1mrSHojnQcelfSQpC9LGvL3sqSvpW2+1oo4bXsu1JRERGyKiJkRMRO4BriibzoiXquyyZ7AkIUaM9vGFSnHPglcX8uJzKzETgZ+mf4DTAFOqVg+k+yeFGad5JX02+rdwNHAR4CLathuHnBQRPz3pkZnA/IJuw1IOi9dWV4p6Ytp9mXAAelqw2WS3i7p55IekPSwpOPyjNmsyCJiFbAF8FVma0uSdiW7sd5ZZHeMh+y88YF03jgf+B/Ap9L0pyTNlvTPkh6U9CtJB6TnGiXp6+kc9HDFeajvtXaW9GNJ/62Fu2hWt4jYQFZYOVuZUalG5tfpWP8cgKQ7gF2B5SlXxkv6Xlrv15IOT+tdLOl6SfdIekrS/DR/rKQ7U83QSkmfSvP/WNL/lbRc0k8kTcznnSiH0XkHYPWRdAhwKvBess/zfkn3ABcA+6erzkjaETg+Iv5V0j7AvcAP84naLDc7S1pRMb0n2Z2Zt5Hy6k1gY6sCM2uxucA/RsSTkjZJ+mOy88a5EXEcgKQXyO4IfnaafjvwgYjYIuko4H8B/4XsR98UYGZatmfF6+wKLAJujIgbW7VzZo0SEU9JGgXsQ5Y3L0fEeyW9DbhX0k8j4mOS/q3iN9ctZDX/v5Q0GfgJcGB6yncB/wnYDXhC0tXAMcD/i4iPpu13T7/b/jcwNyI2poLOpcBnWrbzJeNCTfm9H/heRLwCIOkfgA8AP+23noDLJL2f7Mfavqmts/vTWCd5pe+kA1mfGqC7YvmXJP1X4LfApyIiWhyfWaucDFyZHi9K00Nd6NodWChpGhDAjmn+UcA1EbEFICJerNjmduCrEXFzowI3y9GHgIMknZCmdwemAU/3W+8oYLqkvum3p9pRgDsj4vfA7yVtACYAjwB/I+ly4IcR8U+SZgAzgLvS84wC1jdpv9qCCzWd4zSy5JuVrqT1AmNyjsmsaK6IiK/nHYRZM6WalCOBP5IUZD+WArhziE3/J3B3RHxc0hTgnhpe7l7gGEm3+CKBlZGkdwJvABvILhB/MSJ+MsRmOwCHRsSr/Z4L4PcVs94ARqca01lkfdgukbQU+AHwaEQc1pg9aX/uU1N+/wR8PLVZ3pWsavSfyK4071ax3u7AhlSgORqY1PpQzcysAE4AboqIP4iIKRGxL9mV5jfZ9rxR7TyyLj0+o2L+XcDnJI2GrYWmPn8FbAauaugemLWApPFkgzN9IxXKfwL8aWoahqQ/lDS2yqY/Bb5Y8Twzq6xT+TrvAH4XEd8BvgbMAp4Axks6LK2zo6R3N2C32pYLNSUXEfcDtwK/Bu4Dro6IRyLiBbIOa49Iugy4CXifpEfIOoWuzi1oMzPL08lkV4ErfY/s3PBG6qz8JeBusiY0K1J7/q8CX5H0INu29LgWeBZ4WNJDbDuCGsA5ZP3ZvtqEfTFrtJ3TMf8o8DOyAspfp2XXAo8BD0haCXyT6q2e5gPdaTCBxxh6NNo/IusTvYJspLVL0si2JwCXp7xaAbyvzn1ra3JtsJmZmZmZlZlraszMzMzMrNRcqDEzMzMzs1JzocbMzMzMzErNhRozMzMzMys1F2rMzMzMzKzUXKgxMzMzM7NSc6HGzMzMzMxKzYUaMzMzMzMrtf8P5aNSx8ks+6sAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "pokemon.plot_histogram(pokemon.numerical_features)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Density plots\n", "\n", "Same applies to density plots (if no arguments are passed, all numerical features are considered)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "scrolled": false }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "pokemon.plot_density()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Features importance\n", "\n", "To extract features importance, Dataset uses the ReliefF algorithm. By calling the method `features_importance()` you obtain a Python dictionary with the name of every feature and its relative importance to predict the target variable." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'HP': 0.05809940944881894,\n", " 'Defense': 0.0918786111111111,\n", " 'Attack': 0.10025405405405405,\n", " 'Sp. Def': 0.10831636904761896,\n", " 'Speed': 0.12509607142857132,\n", " 'Sp. Atk': 0.14393648097826098,\n", " 'Total': 0.23443208333333318}" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.features_importance()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to plot features importance, call `plot_importance()`:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "pokemon.plot_importance()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Covariance Matrix\n", "\n", "Another useful plot is the covariance matrix. This time, Dataset library adds an interesting and convenient functionality, which is, grouping features with similar covariance together in the same plot. To do so, it uses a hierarchical dendogram to determine what is the best possible order to reflect affinities between features.\n", "\n", "To use it, simply type the following:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "pokemon.plot_covariance()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Native pandas plots\n", "\n", "In case you want to access native pandas plotting functions, remeber that you can access the entire dataframe by simply accessing the property `.features`, and from there, you can reference individual variables by its names. From that point, in order to access native pandas `hist()` for the feature `Total`, we use:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXsAAAD4CAYAAAANbUbJAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAVo0lEQVR4nO3df4zcdZ3H8efrCnKF9dpiYVIKupBUcsB6xW5QjwuZFZUiRvRyejRI6ImuJnjRs8nZqlE806TnWbwE/JF65eAO6MLxw3IFDxqOFb2IuIuFbSmVAiu21FYoti423C2+74/5bhjqtPvdme90dubzeiST+c7n++Pzee9357Wz3/nOfBURmJlZZ/ujVg/AzMyaz2FvZpYAh72ZWQIc9mZmCXDYm5kl4KhWDwBg7ty50d3d3fR+XnrpJY477rim99NsnVIHuJbpqlNq6ZQ6oHYtw8PDz0fECXnWnxZh393dzdDQUNP7GRwcpFwuN72fZuuUOsC1TFedUkun1AG1a5H0i7zr+zCOmVkCHPZmZglw2JuZJWDSsJd0iqQHJG2VtEXSp7P24yVtlPRkdj+nap0VkrZL2ibpgmYWYGZmk8vzyn4cWBYRfwq8HbhS0hnAcuD+iFgA3J89Jpt3CXAmsBj4lqQZzRi8mZnlM2nYR8SuiHgkm/4tsBWYD1wM3JAtdgPwgWz6YmAgIl6OiGeA7cA5RQ/czMzym9Ixe0ndwNnAT4BSROyCyh8E4MRssfnAL6tW25G1mZlZiyjvVxxL6gJ+AKyMiDsk/SYiZlfNfzEi5kj6JvDjiLgxa18L3BMRtx+0vX6gH6BUKi0aGBgopqLDGBsbo6urq+n9NFun1AGuZbrqlFo6pQ6oXUtfX99wRPTm2kBETHoDjgbuBT5b1bYNmJdNzwO2ZdMrgBVVy90LvONw21+0aFEcCQ888MAR6afZOqWOCNcyXXVKLZ1SR0TtWoChyJHhETH5J2glCVgLbI2Iq6tm3QVcDqzK7tdXtd8s6WrgJGAB8HCuvzxm01D38rtzLbesZ5ylOZfNY3TVRYVtyyzP1yWcC1wGjEjalLV9nkrI3yrpCuBZ4EMAEbFF0q3A41TO5LkyIl4pfORmZpbbpGEfET8CdIjZ5x9inZXAygbGZWZmBfInaM3MEuCwNzNLgMPezCwBDnszswQ47M3MEuCwNzNLgMPezCwBDnszswQ47M3MEuCwNzNLgMPezCwBDnszswQ47M3MEuCwNzNLgMPezCwBDnszswQ47M3MEuCwNzNLwKRhL+k6SXskba5qu0XSpuw2OnFtWkndkg5UzftOMwdvZmb55Lng+PXAtcC/TTRExF9PTEtaDeyrWv6piFhY1ADNzKxxeS44/qCk7lrzJAn4MPDOYodlZmZFUkRMvlAl7DdExFkHtZ8HXB0RvVXLbQF+DuwHvhgRPzzENvuBfoBSqbRoYGCg3hpyGxsbo6urq+n9NFun1AHtUcvIzn2TLwSUZsLuA8X12zN/VnEbm6J22C95dEodULuWvr6+4Yn8nUyewziHswRYV/V4F/DGiHhB0iLge5LOjIj9B68YEWuANQC9vb1RLpcbHMrkBgcHORL9NFun1AHtUcvS5XfnWm5ZzzirRxp9Sr1q9NJyYduaqnbYL3l0Sh3QeC11n40j6SjgL4FbJtoi4uWIeCGbHgaeAt5c9+jMzKwQjZx6+S7giYjYMdEg6QRJM7Lp04AFwNONDdHMzBqV59TLdcCPgdMl7ZB0RTbrEl57CAfgPOAxSY8CtwGfjIi9RQ7YzMymLs/ZOEsO0b60RtvtwO2ND8vMzIrkT9CamSXAYW9mlgCHvZlZAhz2ZmYJcNibmSXAYW9mlgCHvZlZAhz2ZmYJcNibmSXAYW9mlgCHvZlZAhz2ZmYJcNibmSXAYW9mlgCHvZlZAhz2ZmYJKO7qyHbEjOzcl/si2EUbXXVRS/o1s8bkuSzhdZL2SNpc1XaVpJ2SNmW391bNWyFpu6Rtki5o1sDNzCy/PIdxrgcW12j/RkQszG73AEg6g8q1ac/M1vnWxAXIzcysdSYN+4h4EMh70fCLgYGIeDkingG2A+c0MD4zMyuAImLyhaRuYENEnJU9vgpYCuwHhoBlEfGipGuBhyLixmy5tcD3I+K2GtvsB/oBSqXSooGBgQLKObyxsTG6urqa3k+z7dm7j90HWtN3z/xZhW6vHfbJyM59uZYrzaTQ/VL0z3oq2mG/5NEpdUDtWvr6+oYjojfP+vW+Qftt4KtAZPergY8CqrFszb8mEbEGWAPQ29sb5XK5zqHkNzg4yJHop9muuWk9q0da89766KXlQrfXDvsk75vhy3rGC90vRf+sp6Id9ksenVIHNF5LXadeRsTuiHglIn4PfJdXD9XsAE6pWvRk4Lm6R2dmZoWoK+wlzat6+EFg4kydu4BLJB0j6VRgAfBwY0M0M7NGTfo/p6R1QBmYK2kH8GWgLGkhlUM0o8AnACJii6RbgceBceDKiHilOUM3M7O8Jg37iFhSo3ntYZZfCaxsZFBmZlYsf12CmVkCHPZmZglw2JuZJcBhb2aWAIe9mVkCHPZmZglw2JuZJcBhb2aWAIe9mVkCHPZmZglw2JuZJcBhb2aWAIe9mVkCHPZmZglw2JuZJcBhb2aWAIe9mVkCJg17SddJ2iNpc1XbP0l6QtJjku6UNDtr75Z0QNKm7PadZg7ezMzyyfPK/npg8UFtG4GzIuItwM+BFVXznoqIhdntk8UM08zMGjFp2EfEg8Deg9rui4jx7OFDwMlNGJuZmRVEETH5QlI3sCEizqox7z+BWyLixmy5LVRe7e8HvhgRPzzENvuBfoBSqbRoYGCgvgqmYGxsjK6urqb302x79u5j94HW9N0zf1ah22uHfTKyc1+u5UozKXS/FP2znop22C95dEodULuWvr6+4YjozbP+UY10LukLwDhwU9a0C3hjRLwgaRHwPUlnRsT+g9eNiDXAGoDe3t4ol8uNDCWXwcFBjkQ/zXbNTetZPdLQrqvb6KXlQrfXDvtk6fK7cy23rGe80P1S9M96Ktphv+TRKXVA47XUfTaOpMuB9wGXRvbvQUS8HBEvZNPDwFPAm+senZmZFaKusJe0GPgc8P6I+F1V+wmSZmTTpwELgKeLGKiZmdVv0v85Ja0DysBcSTuAL1M5++YYYKMkgIeyM2/OA/5B0jjwCvDJiNhbc8NmZnbETBr2EbGkRvPaQyx7O3B7o4MyM7Ni+RO0ZmYJcNibmSXAYW9mlgCHvZlZAhz2ZmYJaM3HMM2mqDvnp1jNrDa/sjczS4DD3swsAQ57M7MEOOzNzBLgsDczS4DD3swsAQ57M7MEOOzNzBLgsDczS4DD3swsAQ57M7METBr2kq6TtEfS5qq24yVtlPRkdj+nat4KSdslbZN0QbMGbmZm+eV5ZX89sPigtuXA/RGxALg/e4ykM4BLgDOzdb41cQFyMzNrnUnDPiIeBA6+aPjFwA3Z9A3AB6raByLi5Yh4BtgOnFPQWM3MrE6KiMkXkrqBDRFxVvb4NxExu2r+ixExR9K1wEMRcWPWvhb4fkTcVmOb/UA/QKlUWjQwMFBAOYc3NjZGV1dX0/tptj1797H7QGv67pk/q9Dt5d0nIzv3FdpvM5RmUuh+KfpnPRWd8lzplDqgdi19fX3DEdGbZ/2iv89eNdpq/jWJiDXAGoDe3t4ol8sFD+UPDQ4OciT6abZrblrP6pHWXIpg9NJyodvLu0+WtsH32S/rGS90vxT9s56KTnmudEod0Hgt9Z6Ns1vSPIDsfk/WvgM4pWq5k4Hn6h6dmZkVot6XIXcBlwOrsvv1Ve03S7oaOAlYADzc6CBt+ij6ilHLesbb4lW7WbubNOwlrQPKwFxJO4AvUwn5WyVdATwLfAggIrZIuhV4HBgHroyIV5o0djMzy2nSsI+IJYeYdf4hll8JrGxkUGZmVix/gtbMLAEOezOzBDjszcwS4LA3M0uAw97MLAEOezOzBDjszcwS4LA3M0uAw97MLAEOezOzBDjszcwS4LA3M0uAw97MLAEOezOzBDjszcwS4LA3M0uAw97MLAH1XoMWSacDt1Q1nQZ8CZgNfBz4ddb++Yi4p+4RmplZw+oO+4jYBiwEkDQD2AncCfwN8I2I+HohIzQzs4YVdRjnfOCpiPhFQdszM7MCKSIa34h0HfBIRFwr6SpgKbAfGAKWRcSLNdbpB/oBSqXSooGBgYbHMZmxsTG6urqa3k+z7dm7j90HWj2KYpRm4loOoWf+rOI2NkWd8lzplDqgdi19fX3DEdGbZ/2Gw17S64DngDMjYrekEvA8EMBXgXkR8dHDbaO3tzeGhoYaGkceg4ODlMvlpvfTbNfctJ7VI3UfgZtWlvWMu5ZDGF11UWHbmqpOea50Sh1QuxZJucO+iN/MC6m8qt8NMHGfDeS7wIYC+piWupff3ZJ+l/W0pFsza2NFHLNfAqybeCBpXtW8DwKbC+jDzMwa0NAre0nHAu8GPlHV/DVJC6kcxhk9aJ6ZmbVAQ2EfEb8D3nBQ22UNjcjMzArnT9CamSXAYW9mlgCHvZlZAhz2ZmYJcNibmSXAYW9mlgCHvZlZAhz2ZmYJcNibmSXAYW9mlgCHvZlZAhz2ZmYJcNibmSXAYW9mlgCHvZlZAhz2ZmYJcNibmSWg0csSjgK/BV4BxiOiV9LxwC1AN5XLEn44Il5sbJhmZtaIhsI+0xcRz1c9Xg7cHxGrJC3PHn+ugH7M7AgZ2bmPpcvvPuL9jq666Ij3mYpmHMa5GLghm74B+EAT+jAzsyloNOwDuE/SsKT+rK0UEbsAsvsTG+zDzMwapIiof2XppIh4TtKJwEbgb4G7ImJ21TIvRsScGuv2A/0ApVJp0cDAQN3jyGtsbIyurq7Ctjeyc19h25qK0kzYfaAlXRfOtRxaz/xZxW1sivbs3deS/VJ0zUU/51upVi19fX3DEdGbZ/2Gwv41G5KuAsaAjwPliNglaR4wGBGnH27d3t7eGBoaKmQchzM4OEi5XC5se90tOKYJsKxnnNUjRbzd0nquZXpqVS1FH7Mv+jnfSrVqkZQ77Os+jCPpOEmvn5gG3gNsBu4CLs8WuxxYX28fZmZWjEb+dJeAOyVNbOfmiPgvST8FbpV0BfAs8KHGh2lmZo2oO+wj4mngz2q0vwCc38igzMysWP4ErZlZAhz2ZmYJcNibmSXAYW9mlgCHvZlZAhz2ZmYJcNibmSXAYW9mlgCHvZlZAhz2ZmYJcNibmSXAYW9mlgCHvZlZAhz2ZmYJcNibmSXAYW9mloCOuGBm3mvBLusZZ2mLrhtrZtZKjVyD9hRJD0jaKmmLpE9n7VdJ2ilpU3Z7b3HDNTOzejTyyn4cWBYRj2QXHh+WtDGb942I+HrjwzMzsyI0cg3aXcCubPq3krYC84samJmlJ+8h2bzyHrodXXVRof1OR4W8QSupGzgb+EnW9ClJj0m6TtKcIvowM7P6KSIa24DUBfwAWBkRd0gqAc8DAXwVmBcRH62xXj/QD1AqlRYNDAzUPYaRnftyLVeaCbsP1N3NtNEpdYBrma46pZa8dfTMn9X8wTRobGyMrq6u17T19fUNR0RvnvUbCntJRwMbgHsj4uoa87uBDRFx1uG209vbG0NDQ3WPYypn46weaf8TkDqlDnAt01Wn1JK3jnY4jDM4OEi5XH5Nm6TcYd/I2TgC1gJbq4Ne0ryqxT4IbK63DzMzK0Yjf7rPBS4DRiRtyto+DyyRtJDKYZxR4BMNjdDMzBrWyNk4PwJUY9Y99Q/HzMyawV+XYGaWAIe9mVkCHPZmZglw2JuZJcBhb2aWAIe9mVkCHPZmZglw2JuZJcBhb2aWAIe9mVkCHPZmZglw2JuZJcBhb2aWAIe9mVkCHPZmZglo/+uOmZk1KO+lTZvhSF0S0a/szcwS4LA3M0tA08Je0mJJ2yRtl7S8Wf2YmdnkmhL2kmYA3wQuBM6gchHyM5rRl5mZTa5Zr+zPAbZHxNMR8b/AAHBxk/oyM7NJKCKK36j0V8DiiPhY9vgy4G0R8amqZfqB/uzh6cC2wgfyh+YCzx+BfpqtU+oA1zJddUotnVIH1K7lTRFxQp6Vm3XqpWq0veavSkSsAdY0qf+aJA1FRO+R7LMZOqUOcC3TVafU0il1QOO1NOswzg7glKrHJwPPNakvMzObRLPC/qfAAkmnSnodcAlwV5P6MjOzSTTlME5EjEv6FHAvMAO4LiK2NKOvKTqih42aqFPqANcyXXVKLZ1SBzRYS1PeoDUzs+nFn6A1M0uAw97MLAEdE/aSTpH0gKStkrZI+nTWfrykjZKezO7nVK2zIvs6h22SLmjd6F9L0h9LeljSo1ktX8na264WqHyiWtLPJG3IHrdrHaOSRiRtkjSUtbVrLbMl3Sbpiew58452rEXS6dn+mLjtl/SZNq3l77Ln+2ZJ67IcKK6OiOiIGzAPeGs2/Xrg51S+quFrwPKsfTnwj9n0GcCjwDHAqcBTwIxW15GNTUBXNn008BPg7e1YSza+zwI3Axuyx+1axygw96C2dq3lBuBj2fTrgNntWktVTTOAXwFvardagPnAM8DM7PGtwNIi62j5DmriD2898G4qn8ydl7XNA7Zl0yuAFVXL3wu8o9XjrlHHscAjwNvasRYqn7G4H3hnVdi3XR3ZeGqFfdvVAvxJFixq91oOGv97gP9px1qysP8lcDyVsyQ3ZPUUVkfHHMapJqkbOJvKK+JSROwCyO5PzBab+OFO2JG1TQvZoY9NwB5gY0S0ay3/DPw98PuqtnasAyqfAr9P0nD2dR/QnrWcBvwa+Nfs8Nq/SDqO9qyl2iXAumy6rWqJiJ3A14FngV3Avoi4jwLr6Liwl9QF3A58JiL2H27RGm3T5jzUiHglIhZSeWV8jqSzDrP4tKxF0vuAPRExnHeVGm0tr6PKuRHxVirf5nqlpPMOs+x0ruUo4K3AtyPibOAlKocIDmU61wJA9uHN9wP/MdmiNdpaXkt2LP5iKodkTgKOk/SRw61So+2wdXRU2Es6mkrQ3xQRd2TNuyXNy+bPo/JKGdrkKx0i4jfAILCY9qvlXOD9kkapfPPpOyXdSPvVAUBEPJfd7wHupPLtru1Yyw5gR/bfIsBtVMK/HWuZcCHwSETszh63Wy3vAp6JiF9HxP8BdwB/ToF1dEzYSxKwFtgaEVdXzboLuDybvpzKsfyJ9kskHSPpVGAB8PCRGu/hSDpB0uxseiaVX4QnaLNaImJFRJwcEd1U/sX+74j4CG1WB4Ck4yS9fmKayvHUzbRhLRHxK+CXkk7Pms4HHqcNa6myhFcP4UD71fIs8HZJx2ZZdj6wlSLraPUbEwW+wfEXVP6NeQzYlN3eC7yByhuET2b3x1et8wUq72JvAy5sdQ1V43oL8LOsls3Al7L2tqulanxlXn2Dtu3qoHKc+9HstgX4QrvWko1tITCU/Y59D5jTxrUcC7wAzKpqa7tagK9QeVG3Gfh3KmfaFFaHvy7BzCwBHXMYx8zMDs1hb2aWAIe9mVkCHPZmZglw2JuZJcBhb2aWAIe9mVkC/h8fJQD5CcwV0QAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "pokemon.features.Total.hist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data cleaning\n", "\n", "## NA's\n", "\n", "Identify and remove NA's is one of the basic capabilities in pandas. To ease that step, Dataset gives you info about features that contain NA's in the `describe()` method. If any, you can ask what features are presenting empty values by:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Type 2']" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.nas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and remove them, to check that everything worked fine (no need to say what features to fix)." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.drop_na()\n", "pokemon.nas() # <- this should return an empty array" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Replace NA\n", "\n", "If you want to replace NA instead of removing, use `replace_na()`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Outliers\n", "\n", "To identify outliers, you can use the method `outliers()`. You can specify how many neighbours to consider when evaluating if a sample is an outlier (default value is 20). The method will tell you what indices in the dataset contain samples that might be outliers. From this point is your decision to remove them or not, and properly evaluate the effect of that eventual removal." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 11, 20, 42, 49, 54, 105, 106, 110, 377])" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.outliers()" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.drop_samples(pokemon.outliers())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Correlation\n", "\n", "Besides all the correlogram capabilities provided by scikit-learn and many other libraries, Dataset simply allows you to compute what is the correlation between features. No plot added, but all the relevant information is returned in a single call. The relevant parameter here is the threshold used to determine whether two features are correlated or not." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Total', 'Sp. Atk', 0.7448157606316953),\n", " ('Total', 'Sp. Def', 0.7409062430885279),\n", " ('Total', 'Attack', 0.7370599898479868),\n", " ('Total', 'HP', 0.7224040324193493)]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.correlated(threshold=0.7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to consider correlation with the target variable, you should unset the target variable (returning it back to the list of features) to compute its correlation with the remaining features (of the same type)." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Generation', 'Type 2', 0.2806336628669287),\n", " ('Type 2', 'Legendary', 0.26714501722584993),\n", " ('Type 1', 'Type 2', 0.24543264607050674),\n", " ('Generation', 'Type 1', 0.23603346946402512),\n", " ('Type 1', 'Legendary', 0.19440503723011787),\n", " ('Generation', 'Legendary', 0.1889106739606925)]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.unset_target().to_categorical('Legendary').categorical_correlated(threshold=0.1)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "pokemon.set_target('Legendary');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Under represented features\n", "\n", "Dataset has a method to see if the values of some of the features are under represented. This situation occurs when one or several possible values from a feature are only present in a residual number of samples.\n", "\n", "To discover if your dataset presents this anomaly, simply type:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.under_represented_features()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "which in our case returns an empty array." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Merge categories\n", "\n", "In our case we don't have that situation. If that would be the case we can merge all under-represented categories together to have a balanced representation of all of them. To do so, we use\n", "\n", " my_data.merge_categories(column='color', old_values=['grey', 'black'], new_value='dark')\n", " \n", "to fusion values `grey` and `black` into a new category `dark`.\n", "\n", "### Merge values\n", "\n", "If we want to fusion values from numerical features, we should use:\n", "\n", " my_data.merge_values(column='years', old_values=['2001', '2002'], new_value='2000')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data transformation\n", "\n", "## One-hot encoding\n", "\n", "Some categorical features are better transformed into numerical by performing a one-hot encoding. To do so, we call the method `onehot_encode()` specifying the list of features we want to convert. If no name is given, then all categorical variables are _onehot-encoded_.\n", "\n", "In this case, the feature called `Generation` has been previously converted from numerical to categorical, but now we're transforming it into a dummified version, which will produce 6 new variables called `Generation_1` to `Generation_6`." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Features Summary (all):\n", "'Attack' : float64 Min.(20.0) 1stQ(60.0) Med.(80.0) Mean(83.0) 3rdQ(103.) Max.(190.)\n", "'Defense' : float64 Min.(15.0) 1stQ(55.0) Med.(75.0) Mean(78.2) 3rdQ(100.) Max.(180.)\n", "'HP' : float64 Min.(1.0) 1stQ(55.0) Med.(70.0) Mean(70.7) 3rdQ(85.0) Max.(150.)\n", "'Name' : object 405 categs. 'Bulbasaur'(1, 0.0025) 'Ivysaur'(1, 0.0025) 'Venusaur'(1, 0.0025) 'VenusaurMega Venusaur'(1, 0.0025) ...\n", "'Sp. Atk' : float64 Min.(20.0) 1stQ(50.0) Med.(70.0) Mean(77.3) 3rdQ(100.) Max.(180.)\n", "'Sp. Def' : float64 Min.(20.0) 1stQ(55.0) Med.(75.0) Mean(75.4) 3rdQ(95.0) Max.(154.)\n", "'Speed' : float64 Min.(10.0) 1stQ(50.0) Med.(70.0) Mean(70.9) 3rdQ(92.0) Max.(160.)\n", "'Total' : float64 Min.(190.) 1stQ(352.) Med.(474.) Mean(455.) 3rdQ(530.) Max.(780.)\n", "'Type 1' : object 18 categs. 'Grass'(51, 0.1259) 'Fire'(50, 0.1235) 'Bug'(37, 0.0914) 'Normal'(36, 0.0889) ...\n", "'Type 2' : object 18 categs. 'Poison'(97, 0.2395) 'Flying'(33, 0.0815) 'Dragon'(32, 0.0790) 'Ground'(32, 0.0790) ...\n", "'Generation_1': float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.18) 3rdQ(0.0) Max.(1.0)\n", "'Generation_2': float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.12) 3rdQ(0.0) Max.(1.0)\n", "'Generation_3': float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.20) 3rdQ(0.0) Max.(1.0)\n", "'Generation_4': float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.16) 3rdQ(0.0) Max.(1.0)\n", "'Generation_5': float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.20) 3rdQ(0.0) Max.(1.0)\n", "'Generation_6': float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.12) 3rdQ(0.0) Max.(1.0)\n", "'Legendary' : object 2 categs. 'False'(394, 0.9728) 'True'(11, 0.0272) \n" ] } ], "source": [ "pokemon.onehot_encode('Generation').summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Discretize\n", "\n", "In some other ocassions what we need is to transform a numerical variable into a category by discretizing it, o binning. To illustrate this we will transform a numerical variable into a category by specifying ranges of values or bins to consider.\n", "\n", "For example, the feature `Speed` (ranges between 10 and 160) could be discretized by considering only ranges, as follows:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Features Summary (all):\n", "'Attack' : float64 Min.(20.0) 1stQ(60.0) Med.(80.0) Mean(83.0) 3rdQ(103.) Max.(190.)\n", "'Defense' : float64 Min.(15.0) 1stQ(55.0) Med.(75.0) Mean(78.2) 3rdQ(100.) Max.(180.)\n", "'HP' : float64 Min.(1.0) 1stQ(55.0) Med.(70.0) Mean(70.7) 3rdQ(85.0) Max.(150.)\n", "'Name' : object 405 categs. 'Bulbasaur'(1, 0.0025) 'Ivysaur'(1, 0.0025) 'Venusaur'(1, 0.0025) 'VenusaurMega Venusaur'(1, 0.0025) ...\n", "'Sp. Atk' : float64 Min.(20.0) 1stQ(50.0) Med.(70.0) Mean(77.3) 3rdQ(100.) Max.(180.)\n", "'Sp. Def' : float64 Min.(20.0) 1stQ(55.0) Med.(75.0) Mean(75.4) 3rdQ(95.0) Max.(154.)\n", "'Speed' : category 3 categs. 'low'(212, 0.5261) 'mid'(165, 0.4094) 'high'(26, 0.0645) \n", "'Total' : float64 Min.(190.) 1stQ(352.) Med.(474.) Mean(455.) 3rdQ(530.) Max.(780.)\n", "'Type 1' : object 18 categs. 'Grass'(51, 0.1259) 'Fire'(50, 0.1235) 'Bug'(37, 0.0914) 'Normal'(36, 0.0889) ...\n", "'Type 2' : object 18 categs. 'Poison'(97, 0.2395) 'Flying'(33, 0.0815) 'Dragon'(32, 0.0790) 'Ground'(32, 0.0790) ...\n", "'Generation_1': float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.18) 3rdQ(0.0) Max.(1.0)\n", "'Generation_2': float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.12) 3rdQ(0.0) Max.(1.0)\n", "'Generation_3': float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.20) 3rdQ(0.0) Max.(1.0)\n", "'Generation_4': float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.16) 3rdQ(0.0) Max.(1.0)\n", "'Generation_5': float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.20) 3rdQ(0.0) Max.(1.0)\n", "'Generation_6': float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.12) 3rdQ(0.0) Max.(1.0)\n", "'Legendary' : object 2 categs. 'False'(394, 0.9728) 'True'(11, 0.0272) \n" ] } ], "source": [ "pokemon.discretize('Speed', [(10, 60),(60, 110),(110, 150)], \n", " category_names=['low', 'mid', 'high']).summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Skewness\n", "\n", "Another possibility is to fix the skewness of all those features who could present it. We can do it at once by simply calling `fix_skewness()`, or if we previously want to check what features present skewness, we can also call `skewed_features()`.\n", "\n", "Let's apply it, first to know what features present skewness (if any):\n" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "Generation_6 2.324424\n", "Generation_2 2.221659\n", "Generation_4 1.800833\n", "Generation_1 1.663678\n", "Generation_3 1.480842\n", "Generation_5 1.480842\n", "Sp. Atk 0.673112\n", "Attack 0.533859\n", "HP 0.498398\n", "Defense 0.471578\n", "Sp. Def 0.438338\n", "Total 0.070545\n", "dtype: float64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.skewed_features()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's fix skewness and plot the historgram before and after in the same plot." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "scrolled": true }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAeYAAADCCAYAAACc2WFbAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAUD0lEQVR4nO3db6xk9X3f8fenENsYxzKI7u0GUJdI2ySYbRzrynVjKb3pmpoa10seUGHhaLGRtpGI7URbxbv2AypFltZKcIqUptIWCBsV4xBiCxQnDpttrqw+AAcw6fLHFGq2eM2aderSZEmFe91vH8zZ+gJz/82fO787835JVzPzm3Nmvt97zpnv+fs7qSokSVIb/s6kA5AkST9kYZYkqSEWZkmSGmJhliSpIRZmSZIaYmGWJKkh5046AICLLrqoduzYseL7L7/8Mueff/7mBdQI854t68n7kUce+auq+rubFNJA1lqe1zIt039a8oDpyaW1PFZanpsozDt27ODhhx9e8f3FxUUWFhY2L6BGmPdsWU/eSf775kQzuLWW57VMy/SfljxgenJpLY+Vlmd3ZUuS1BALsyRJDbEwS5LUEAuzJEkNWbMwJ7kjyekkjy9ruzDJ0STPdI8XLHvvYJJnkzyd5H3jClySpGm0nrOy7wR+G/i9ZW0HgGNVdSjJge71J5NcDlwHvB34MeDPkvyDqvrBaMOePjsOfPl1bft3LXFDn/bVnDh09ahCkjQl+v2+bMTy3yJ/Y8ZvzS3mqvoq8L3XNO8BjnTPjwDXLGv/QlW9UlXPAc8C7xpRrJIkTb1Br2Oeq6pTAFV1Ksm2rv1i4MFlw53s2l4nyT5gH8Dc3ByLi4srftmZM2dWfX8a7N+19Lq2ufP6t69mGv5PszC9+9mMvJPcAXwAOF1VV3RtvwH8C+D7wH8DPlJVL3XvHQRuBH4AfLyq/nSsAUoaeQcj6dNW/QasqsPAYYD5+fla7aLv1i4KH4d+u6z371riluMbm0Qnrl8YUUSTMwvTu59NyvtOXn9o6ihwsKqWknwWOIiHpqSJGfSs7BeTbAfoHk937SeBS5cNdwnwwuDhSRqlfoemquqBqjq7a+ZBessteGhKmohBt5jvB/YCh7rH+5a1fz7J5+itYe8EvjZskJI2zUeB3++ej+XQ1Fqm5VBGS3ls9JDYay0/rNZKToNoaZqsZs3CnORuYAG4KMlJ4GZ6BfmeJDcCzwPXAlTVE0nuAZ4EloCb3O0lbQ1JPk1vub3rbFOfwYY+NLWWaTmU0VIeG72647WWH1bbyofLWpomq1mzMFfVh1Z4a/cKw38G+MwwQUnaXEn20jspbHdVnS2+HpqSJqCJu0tpdIa9XnE5r1ecDUmuAj4J/JOq+ttlb3loSpoAC7M0Q1Y4NHUQeCNwNAnAg1X1Sx6akibDwizNkBUOTd2+yvAempI2mTexkCSpIRZmSZIaYmGWJKkhFmZJkhpiYZYkqSEWZkmSGmJhliSpIRZmSZIaYmGWJKkhFmZJkhpiYZYkqSEWZkmSGmJhliSpIRZmSZIaYmGWJKkhFmZJkhpiYZZmSJI7kpxO8viytguTHE3yTPd4wbL3DiZ5NsnTSd43mail2WJhlmbLncBVr2k7AByrqp3Ase41SS4HrgPe3o3zO0nO2bxQpdk0VGFO8qtJnkjyeJK7k7xptbVvSZNVVV8Fvvea5j3Ake75EeCaZe1fqKpXquo54FngXZsSqDTDBi7MSS4GPg7MV9UVwDn01q77rn1LatZcVZ0C6B63de0XA99aNtzJrk3SGJ07gvHPS/J/gDcDLwAHgYXu/SPAIvDJIb9H0uZLn7bqO2CyD9gHMDc3x+Li4sBfeubMmaHGb0VLeezftTTU+HPn/fAzWslpEC1Nk9UMXJir6ttJfhN4HvjfwANV9UCSV619J9m26gdJmrQXk2zvltftwOmu/SRw6bLhLqG38v06VXUYOAwwPz9fCwsLAwezuLjIMOO3oqU8bjjw5aHG379riVuO98rFiesXRhDRZLQ0TVYzcGHujh3vAS4DXgL+IMmHNzD+utewt8pazjD6rdEuX0udhEn9z2dhevczwbzvB/YCh7rH+5a1fz7J54AfA3YCX5tEgNIsGWZX9nuB56rquwBJvgj8LCuvfb/KRtawt8pazjD6rdEuX0udhEmtGc/C9O5nM/JOcje9Q00XJTkJ3EyvIN+T5EZ6e8CuBaiqJ5LcAzwJLAE3VdUPxhqgpKEK8/PAu5O8md6u7N3Aw8DL9F/7ljRhVfWhFd7avcLwnwE+M76IJL3WMMeYH0pyL/AovbXpr9PbAn4Lfda+JUnS2obaT1pVN9PbFbbcK6yw9i1JklZnz1+SJDXEwixJUkMmd8qvmrdjyGsfzzpx6OqRfI4kzQK3mCVJaoiFWZKkhliYJUlqiIVZkqSGWJglSWqIhVmSpIZYmCVJaoiFWZKkhliYJUlqiIVZkqSGWJglSWqIhVkSAEl+NckTSR5PcneSNyW5MMnRJM90jxdMOk5p2lmYJZHkYuDjwHxVXQGcA1wHHACOVdVO4Fj3WtIYWZglnXUucF6Sc4E3Ay8Ae4Aj3ftHgGsmFJs0MyzMkqiqbwO/CTwPnAL+V1U9AMxV1alumFPAtslFKc0G78csie7Y8R7gMuAl4A+SfHgD4+8D9gHMzc2xuLg4cCxnzpwZavxWtJTH/l1LQ40/d94PP6OVnAbR0jRZjYVZEsB7geeq6rsASb4I/CzwYpLtVXUqyXbgdL+Rq+owcBhgfn6+FhYWBg5kcXGRYcZvRUt53HDgy0ONv3/XErcc78rF8ZdHEBGcOHT1SD5nI1qaJqtxV7Yk6O3CfneSNycJsBt4Crgf2NsNsxe4b0LxSTPDLWZJVNVDSe4FHgWWgK/T2wJ+C3BPkhvpFe9rJxelNBuGKsxJ3gbcBlwBFPBR4Gng94EdwAngX1bV/xwqSkljV1U3Aze/pvkVelvPkjbJsLuybwW+UlU/Cfw0vV1fXvcoSdKABi7MSd4K/BxwO0BVfb+qXsLrHiVJGtgwW8w/DnwX+N0kX09yW5Lz8bpHSZIGNswx5nOBdwIf604cuZUN7LbeyHWPW+Xas2H0u85w+bWDW9lGp90sTO9+ZjVvrWzHkJc5aWsapjCfBE5W1UPd63vpFeaRX/e4Va49G0a/6wxfde3gFnbi+oUNDT8L07ufWc1b0qsNvCu7qr4DfCvJT3RNu4En8bpHSZIGNuzm2MeAu5K8Afgm8BF6xd7rHiVJGsBQhbmqHgPm+7zldY+SpBWN8vj5JLr3HCe75JQkqSEWZkmSGmJhliSpIRZmSZIaYmGWJKkhFmZJkhpiYZYkqSEWZkmSGmJhlgRAkrcluTfJN5I8leQfJ7kwydEkz3SPF0w6TmnaWZglnXUr8JWq+kngp4Gn6N2Y5lhV7QSOsYE7yEkajIVZEkneCvwccDtAVX2/ql4C9gBHusGOANdMJkJpdliYJQH8OPBd4HeTfD3JbUnOB+aq6hRA97htkkFKs2Dr3+xX0iicC7wT+FhVPZTkVjaw2zrJPmAfwNzcHIuLiwMHcubMmaHGb8Uo8ti/a2k0wQxp7rx2Yulnvf/nrTJvWZiHMMq7o0yzjf6f9u9a4oYVxpm2u8g05CRwsqoe6l7fS68wv5hke1WdSrIdON1v5Ko6DBwGmJ+fr4WFhYEDWVxcZJjxWzGKPFZaDjbb/l1L3HK83XJx4vqFdQ23VeYtd2VLoqq+A3wryU90TbuBJ4H7gb1d217gvgmEJ82UdleBJG22jwF3JXkD8E3gI/RW3u9JciPwPHDtBOOTZoKFWRIAVfUYMN/nrd2bHYs0y9yVLUlSQyzMkiQ1xMIsSVJDLMySJDXEwixJUkOGLsxJzum68Puj7rV3o5EkaUCj2GL+BL270Jzl3WgkSRrQUIU5ySXA1cBty5q9G40kSQMatoORfwv8GvCjy9pedTeaJH3vRrORTu9b7Xh83J26t95x/LislneL88GotDqfS9pcAxfmJB8ATlfVI0kWNjr+Rjq9b7Xj8XF3MN96x/Hjslre6+2sfitqdT6XtLmG+dV/D/DBJO8H3gS8Ncl/ZJ13o5EkSa838DHmqjpYVZdU1Q7gOuA/VdWH8W40kiQNbBzXMR8CrkzyDHBl91qSJK3DSA5gVtUisNg9/x94NxpJkgZiz1+SJDXEwixJUkMszJL+P7vYlSbPwixpObvYlSbMwiwJsItdqRUWZklnne1i9/8ua3tVF7tA3y52JY3O7PX3KOl1hu1idyN9369lWvoMH0UerfSV33q//ev9P2+VecvCLAmG7GJ3I33fr2Va+gwfRR7j7o9/vVrvt3+9fehvlXnLXdmS7GJXaoiFWdJq7GJX2mTt7puQNBF2sStNllvMkiQ1xMIsSVJD3JUtSSO0ozuTev+upWbOqtbW4hazJEkNsTBLktQQC7MkSQ2xMEuS1BALsyRJDbEwS5LUEAuzJEkNsTBLktSQgQtzkkuT/HmSp5I8keQTXfuFSY4meaZ7vGB04UqSNN2G2WJeAvZX1U8B7wZuSnI5cAA4VlU7gWPda0mStA4DF+aqOlVVj3bP/wZ4CrgY2AMc6QY7AlwzbJCSJM2KkfSVnWQH8DPAQ8BcVZ2CXvFOsm2FcfYB+wDm5uZYXFxc8fPPnDmz6vuTsn/X0lg/f+688X9Hi1bLu8X5YFRanc8lba6hC3OStwB/CPxKVf11knWNV1WHgcMA8/PztbCwsOKwi4uLrPb+pIy7g/r9u5a45fjs3WdktbxPXL+wucFsolbnc0mba6izspP8CL2ifFdVfbFrfjHJ9u797cDp4UKUNG6ezCm1Y5izsgPcDjxVVZ9b9tb9wN7u+V7gvsHDk7RJPJlTasQwW8zvAX4R+KdJHuv+3g8cAq5M8gxwZfdaUsM8mVNqx8AHMKvqPwMrHVDePejnSpqsQU7mlDQ6s3dmkba0HSM64e7EoatH8jnTZtCTOTdylcVatvrZ6WevKJimqypaz2W988tWmbcszJKA1U/m7LaWVzyZcyNXWaxlq5+dfvZqjWm6qqL1XNZ7tcZWmbfsK1uSJ3NKDWl3FUjSZjp7MufxJI91bZ+id/LmPUluBJ4Hrp1QfNLMsDBL8mROqSEWZknSlrbek0L371patcfGVk4K9RizJEkNsTBLktQQC7MkSQ2xMEuS1BALsyRJDZnJs7JH1a2jJEmj5hazJEkNsTBLktSQmdyVLY3ycEYrnRJImg5uMUuS1BALsyRJDbEwS5LUEAuzJEkN8eQvaUijOpHszqvOH8nnSBpMKyeFusUsSVJDxrbFnOQq4FbgHOC2qjo0ru+SND6zsizbI6BaMZYt5iTnAP8O+OfA5cCHklw+ju+SND4uy9LmG9cW87uAZ6vqmwBJvgDsAZ4c5kNdo5U23ViWZUkrG1dhvhj41rLXJ4F/NKbvkjQ+Y1mWV1vJ3r9riRvWuRJur2uaRuMqzOnTVq8aINkH7Otenkny9CqfdxHwVyOKbcv4uHnPlJ//7Lry/vubEcsyay7LsOHleVUbmf757KDfMn7TNB9PSy6bmcc6582+y/O4CvNJ4NJlry8BXlg+QFUdBg6v58OSPFxV86MLb2sw79nSaN5rLsuwseV5LY3+HzZsWvKA6cllq+Qxrsul/gLYmeSyJG8ArgPuH9N3SRofl2Vpk41li7mqlpL8MvCn9C6xuKOqnhjHd0kaH5dlafON7Trmqvpj4I9H9HEj2UW2BZn3bGky7xEvy+vR5P9hANOSB0xPLlsij1S97jwOSZI0IXbJKUlSQ5ovzEmuSvJ0kmeTHJh0POOU5ESS40keS/Jw13ZhkqNJnukeL5h0nMNKckeS00keX9a2Yp5JDnbT/+kk75tM1MNbIe9/k+Tb3TR/LMn7l703FXkPI8m/TlJJLpp0LINI8htJvpHkvyT5UpK3TTqmjZiG398klyb58yRPJXkiyScmHdNami7MM9od4M9X1TuWndJ/ADhWVTuBY93rre5O4KrXtPXNs5ve1wFv78b5nW6+2Iru5PV5A/xWN83f0R3Pnba8B5LkUuBK4PlJxzKEo8AVVfUPgf8KHJxwPOs2Rb+/S8D+qvop4N3ATa3n0XRhZll3gFX1feBsd4CzZA9wpHt+BLhmgrGMRFV9Ffjea5pXynMP8IWqeqWqngOepTdfbDkr5L2Sqcl7CL8F/Bp9OjTZKqrqgapa6l4+SO868K1iKn5/q+pUVT3aPf8b4Cl6Pdo1q/XC3K87wKb/oUMq4IEkj3Q9KQHMVdUp6M1gwLaJRTdeK+U5C/PAL3e7Ou9Ytgt/FvJeUZIPAt+uqr+cdCwj9FHgTyYdxAZM3TyYZAfwM8BDk41kdWO7XGpE1tUd4BR5T1W9kGQbcDTJNyYdUAOmfR7498Cv08vp14Fb6P2AT3veJPkz4O/1eevTwKeAf7a5EQ1mtTyq6r5umE/T26V612bGNqSpmgeTvAX4Q+BXquqvJx3PalovzOvqDnBaVNUL3ePpJF+ityvpxSTbq+pUku3A6YkGOT4r5TnV80BVvXj2eZL/APxR93Kq8waoqvf2a0+yC7gM+Msk0Mv90STvqqrvbGKI67JSHmcl2Qt8ANhdW+v61KmZB5P8CL2ifFdVfXHS8ayl9V3ZM9MdYJLzk/zo2ef0thYep5fv3m6wvcB9k4lw7FbK837guiRvTHIZsBP42gTiG4tuJeSsX6A3zWHK815NVR2vqm1VtaOqdtArEO9ssSivJclVwCeBD1bV3046ng2ait/f9NbubgeeqqrPTTqe9Wh6i3nGugOcA77UbSGcC3y+qr6S5C+Ae5LcSO/s1GsnGONIJLkbWAAuSnISuBk4RJ88q+qJJPfQu//vEnBTVf1gIoEPaYW8F5K8g94uwhPAv4LpynvG/TbwRnqHpgAerKpfmmxI6zNFv7/vAX4ROJ7ksa7tU2evgGiRPX9JktSQ1ndlS5I0UyzMkiQ1xMIsSVJDLMySJDXEwixJUkMszJIkNcTCLElSQyzMkiQ15P8BdiPJZ8xbWY4AAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(8,3))\n", "plt.subplot(121)\n", "pokemon.features['HP'].hist()\n", "\n", "# Fix skewness\n", "pokemon.fix_skewness()\n", "\n", "plt.subplot(122)\n", "pokemon.features['HP'].hist()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scale\n", "\n", "It depends on the method that you use, but it is normally accepted that scaling your numeric features is a good practice. If you want to do so, you can easily do it by calling:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.scale(method='MinMaxScaler')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you don't specify any parameters, scaling will be applied to numerical features, and the method will return the dataset with those features already scaled using StandardScaler. But you can also specify `MinMaxScaler` as method, if you want a different scaling method to be applied. The result is that numerical features range now between 0 and 1. \n", "\n", "To confirm that scaling worked properly let's plot the histogram of the feature `Totals`, to see that the X-axis is now ranging between 0 and 1, instead of 0 and 800, as in the previous section plot." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD4CAYAAAAXUaZHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAPlElEQVR4nO3df6zdd13H8efLlsnohbWzeNNs4K1mgpOJYVdAUHKvJaEMYmfCkiE/OjLTGAUXM5MV/nB/mMX6B0YUCWmAtAbCzRyLm0zQpniZBjdsYdCNipuAZWW2AlvxTgIW3v5xT8x1u03PPT/u6fmc5yNpzvn+Op/3O+f2db/3c8/3e1NVSJLa8iOjLkCSNHiGuyQ1yHCXpAYZ7pLUIMNdkhq0cdQFAGzdurVmZmZ6Pv7JJ59k06ZNgyvoAjdp/YI9Twp7XpujR49+s6qeu9q2CyLcZ2ZmOHLkSM/HLy4uMjc3N7iCLnCT1i/Y86Sw57VJ8u/n2ua0jCQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNeiCuEJVupDN7L1nJOMe2DlZl+FrsDxzl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBp033JN8KMnpJA+uWHdpkkNJHu48blmx7Z1JHkny5SSvGVbhkqRz6+bM/QCw8ynr9gKHq+oK4HBnmSRXAtcDP9s55n1JNgysWklSV84b7lV1L/Dtp6zeBRzsPD8IXLti/UJVfa+qvgo8Arx0QLVKkrqUqjr/TskM8PGqelFn+Ymq2rxi++NVtSXJe4H7qurDnfUfBD5RVXes8pp7gD0A09PTVy8sLPTcxNLSElNTUz0fP24mrV8Ybc/HTp4ZybjbL9ng+zwB+ul5fn7+aFXNrrZt0HeFzCrrVv3uUVX7gf0As7OzNTc31/Ogi4uL9HP8uJm0fmG0Pd8wwrtC+j63b1g99/ppmVNJtgF0Hk931j8KPG/FfpcD3+i9PElSL3oN97uB3Z3nu4G7Vqy/PsmPJtkOXAF8tr8SJUlrdd5pmSQfBeaArUkeBW4F9gG3J7kROAFcB1BVDyW5HfgScBb47ar6wZBqlySdw3nDvareeI5NO86x/23Abf0UJUnqj1eoSlKDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAb1Fe5JfjfJQ0keTPLRJM9McmmSQ0ke7jxuGVSxkqTu9BzuSS4DfgeYraoXARuA64G9wOGqugI43FmWJK2jfqdlNgIXJ9kIPAv4BrALONjZfhC4ts8xJElrlKrq/eDkJuA24LvA31XVm5I8UVWbV+zzeFU9bWomyR5gD8D09PTVCwsLPdextLTE1NRUz8ePm0nrF0bb87GTZ0Yy7vZLNvg+T4B+ep6fnz9aVbOrbdvYa0GdufRdwHbgCeAvk7y52+Oraj+wH2B2drbm5uZ6LYXFxUX6OX7cTFq/MNqeb9h7z0jGPbBzk+/zBBhWz/1My7wa+GpV/WdV/Q9wJ/AK4FSSbQCdx9P9lylJWot+wv0E8PIkz0oSYAdwHLgb2N3ZZzdwV38lSpLWqudpmaq6P8kdwOeAs8DnWZ5mmQJuT3Ijy98ArhtEoZKk7vUc7gBVdStw61NWf4/ls3hJ0oh4haokNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqUF/hnmRzkjuS/EuS40l+McmlSQ4lebjzuGVQxUqSutPvmft7gE9W1QuBFwPHgb3A4aq6AjjcWZYkraOewz3Jc4BXAR8EqKrvV9UTwC7gYGe3g8C1/RYpSVqbVFVvByY/D+wHvsTyWftR4CbgZFVtXrHf41X1tKmZJHuAPQDT09NXLyws9FQHwNLSElNTUz0fP24mrV8Ybc/HTp4ZybjbL9ng+zwB+ul5fn7+aFXNrratn3CfBe4DXllV9yd5D/Ad4B3dhPtKs7OzdeTIkZ7qAFhcXGRubq7n48fNpPULo+15Zu89Ixn3wM5Nvs8ToJ+ek5wz3Df2UdOjwKNVdX9n+Q6W59dPJdlWVY8l2Qac7mMMreLYyTPcMKLA+dq+141kXElr0/Oce1X9B/D1JC/orNrB8hTN3cDuzrrdwF19VShJWrN+ztwB3gF8JMlFwFeAt7H8DeP2JDcCJ4Dr+hxDkrRGfYV7VT0ArDbfs6Of15Uk9ccrVCWpQYa7JDXIcJekBhnuktQgw12SGtTvRyGldTHKC7ekceSZuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDNo66AI2Xmb33jGTcm68aybDS2PLMXZIa1He4J9mQ5PNJPt5ZvjTJoSQPdx639F+mJGktBjEtcxNwHHhOZ3kvcLiq9iXZ21m+ZQDjSFono5p+O7Bz00jGbVFfZ+5JLgdeB3xgxepdwMHO84PAtf2MIUlau1RV7wcndwB/CDwb+L2qen2SJ6pq84p9Hq+qp03NJNkD7AGYnp6+emFhoec6lpaWmJqa6vn4cXP622c49d1RV7G+pi9m4nrefsmGkX1dHzt5ZiTjjrLnUeknv+bn549W1exq23qelknyeuB0VR1NMrfW46tqP7AfYHZ2tubm1vwS/2dxcZF+jh83f/aRu3j3scn6oNPNV52duJ4P7Nw0sq/rG0Y4LTNJ/5dhePnVz/+WVwK/muQa4JnAc5J8GDiVZFtVPZZkG3B6EIVKkrrX85x7Vb2zqi6vqhngeuBTVfVm4G5gd2e33cBdfVcpSVqTYfycuw+4PcmNwAnguiGMITXv2MkzI5se0fgbSLhX1SKw2Hn+LWDHIF5XktQbr1CVpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMm668fDNio/s7kzVeNZFhJY8Qzd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWpQz+Ge5HlJ/j7J8SQPJbmps/7SJIeSPNx53DK4ciVJ3ejnz+ydBW6uqs8leTZwNMkh4AbgcFXtS7IX2Avc0n+pklp37OQZbhjBn6/82r7XrfuYw9bzmXtVPVZVn+s8/y/gOHAZsAs42NntIHBtv0VKktYmVdX/iyQzwL3Ai4ATVbV5xbbHq+ppUzNJ9gB7AKanp69eWFjoefylpSWmpqZ6Pr5Xx06eWfcxAaYvhlPfHcnQI2PPk2FUPV912SXrP2hHP/k1Pz9/tKpmV9vWd7gnmQI+DdxWVXcmeaKbcF9pdna2jhw50nMNi4uLzM3N9Xx8r2ZG8OMjwM1XneXdx/qZURs/9jwZRtXzKKdl+smvJOcM974+LZPkGcDHgI9U1Z2d1aeSbOts3wac7mcMSdLa9fNpmQAfBI5X1R+v2HQ3sLvzfDdwV+/lSZJ60c/PP68E3gIcS/JAZ927gH3A7UluBE4A1/VXoiRprXoO96r6RyDn2Lyj19eVJPXPK1QlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQU38afVjJ89ww957Rl2GJF0wPHOXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGNXGFqiT1Y2aEV7gf2LlpKK/rmbskNchwl6QGDS3ck+xM8uUkjyTZO6xxJElPN5RwT7IB+HPgtcCVwBuTXDmMsSRJTzesM/eXAo9U1Veq6vvAArBrSGNJkp4iVTX4F03eAOysqt/oLL8FeFlVvX3FPnuAPZ3FFwBf7mPIrcA3+zh+3Exav2DPk8Ke1+Ynquq5q20Y1kchs8q6//ddpKr2A/sHMlhypKpmB/Fa42DS+gV7nhT2PDjDmpZ5FHjeiuXLgW8MaSxJ0lMMK9z/GbgiyfYkFwHXA3cPaSxJ0lMMZVqmqs4meTvwt8AG4ENV9dAwxuoYyPTOGJm0fsGeJ4U9D8hQfqEqSRotr1CVpAYZ7pLUoLEJ9/PdziDL/rSz/YtJXjKKOgepi57f1On1i0k+k+TFo6hzkLq9bUWSX0jyg841FWOtm56TzCV5IMlDST693jUOWhdf25ck+eskX+j0/LZR1DkoST6U5HSSB8+xffD5VVUX/D+Wfyn7b8BPAhcBXwCufMo+1wCfYPkz9i8H7h913evQ8yuALZ3nr52Enlfs9yngb4A3jLrudXifNwNfAp7fWf7xUde9Dj2/C/ijzvPnAt8GLhp17X30/CrgJcCD59g+8PwalzP3bm5nsAv4i1p2H7A5ybb1LnSAzttzVX2mqh7vLN7H8vUE46zb21a8A/gYcHo9ixuSbnr+deDOqjoBUFXj3nc3PRfw7CQBplgO97PrW+bgVNW9LPdwLgPPr3EJ98uAr69YfrSzbq37jJO19nMjy9/5x9l5e05yGfBrwPvXsa5h6uZ9/mlgS5LFJEeTvHXdqhuObnp+L/AzLF/8eAy4qap+uD7ljcTA82tc/hLTeW9n0OU+46TrfpLMsxzuvzTUioavm57/BLilqn6wfFI39rrpeSNwNbADuBj4pyT3VdW/Dru4Iemm59cADwC/AvwUcCjJP1TVd4Zd3IgMPL/GJdy7uZ1Ba7c86KqfJD8HfAB4bVV9a51qG5Zuep4FFjrBvhW4JsnZqvqr9Slx4Lr92v5mVT0JPJnkXuDFwLiGezc9vw3YV8sT0o8k+SrwQuCz61Piuht4fo3LtEw3tzO4G3hr57fOLwfOVNVj613oAJ235yTPB+4E3jLGZ3ErnbfnqtpeVTNVNQPcAfzWGAc7dPe1fRfwy0k2JnkW8DLg+DrXOUjd9HyC5Z9USDLN8p1jv7KuVa6vgefXWJy51zluZ5DkNzvb38/yJyeuAR4B/pvl7/xjq8uefx/4MeB9nTPZszXGd9TrsuemdNNzVR1P8kngi8APgQ9U1aofqRsHXb7PfwAcSHKM5SmLW6pqbG8FnOSjwBywNcmjwK3AM2B4+eXtBySpQeMyLSNJWgPDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXofwEA+3JR7VPn+AAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "pokemon.features.Total.hist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Speed` is now a category that only presents three possible values (whose labels have been provided)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature selection\n", "\n", "## Information Gain\n", "\n", "To compute the information gain provided by each categorical feature with respect to the target variable (must be set) then, you can use the method `information_gain()`. The result is a dictionary with key-value pairs, where each key corresponds to the categorical variables, and the value is the IG:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name : 0.18\n", "Speed : 0.00\n", "Type 1 : 0.04\n", "Type 2 : 0.03\n" ] } ], "source": [ "ig = pokemon.information_gain()\n", "\n", "for k in ig:\n", " print('{:<7}: {:.2f}'.format(k, ig[k]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this info, it could be safe to remove the variable `Speed`, but must always test your hypothesis first.\n", "\n", "## Stepwise feature selection\n", "\n", "We can force a feature selection process with our features by calling `stepwise_selection()`. We must be sure that the features used in the selection process are all numerical. If that is not the case, call `onehot_encode()` before using this method.\n", "\n", "In our case, we must adapat a little bit our problem to suit the needs of `stepwise_selection()`. We need target variable to be numeric. To achieve that, we must unset the target variable `Legendary` in order to bring it back list of features. Then we call the method `onehot_encode()` (only works over the features, not the target variable), and then set the target again.\n", "\n", "Remember that we can chain method calls:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Features Summary (all):\n", "'Attack' : float64 Min.(0.0) 1stQ(0.36) Med.(0.48) Mean(0.48) 3rdQ(0.61) Max.(1.0)\n", "'Defense' : float64 Min.(0.0) 1stQ(0.37) Med.(0.50) Mean(0.50) 3rdQ(0.64) Max.(1.0)\n", "'Generation_1' : float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.18) 3rdQ(0.0) Max.(1.0)\n", "'Generation_2' : float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.12) 3rdQ(0.0) Max.(0.99)\n", "'Generation_3' : float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.20) 3rdQ(0.0) Max.(1.0)\n", "'Generation_4' : float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.16) 3rdQ(0.0) Max.(1.0)\n", "'Generation_5' : float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.20) 3rdQ(0.0) Max.(1.0)\n", "'Generation_6' : float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.12) 3rdQ(0.0) Max.(0.99)\n", "'HP' : float64 Min.(0.0) 1stQ(0.49) Med.(0.58) Mean(0.58) 3rdQ(0.67) Max.(0.99)\n", "'Name' : object 403 categs. 'Bulbasaur'(1, 0.0025) 'Ivysaur'(1, 0.0025) 'Venusaur'(1, 0.0025) 'VenusaurMega Venusaur'(1, 0.0025) ...\n", "'Sp. Atk' : float64 Min.(0.0) 1stQ(0.37) Med.(0.51) Mean(0.52) 3rdQ(0.68) Max.(1.0)\n", "'Sp. Def' : float64 Min.(0.0) 1stQ(0.37) Med.(0.52) Mean(0.51) 3rdQ(0.66) Max.(1.0)\n", "'Speed' : category 3 categs. 'low'(212, 0.5261) 'mid'(165, 0.4094) 'high'(26, 0.0645) \n", "'Total' : float64 Min.(0.0) 1stQ(0.30) Med.(0.51) Mean(0.47) 3rdQ(0.60) Max.(1.0)\n", "'Type 1' : object 18 categs. 'Grass'(51, 0.1266) 'Fire'(49, 0.1216) 'Bug'(36, 0.0893) 'Normal'(36, 0.0893) ...\n", "'Type 2' : object 18 categs. 'Poison'(96, 0.2382) 'Flying'(33, 0.0819) 'Dragon'(32, 0.0794) 'Ground'(32, 0.0794) ...\n", "'Legendary_True': float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.02) 3rdQ(0.0) Max.(1.0)\n" ] } ], "source": [ "pokemon.unset_target()\n", "pokemon.onehot_encode('Legendary').drop_columns('Legendary_False')\n", "pokemon.set_target('Legendary_True').summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We remove then the column called `Legendary_False` as the column `Legendary_True` already behaves numerically as we want (0.0 means False, and 1.0 means True).\n", "\n", "Now, we let the stepwise algorithm to decide what features could be safe to remove:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Considering only numerical features\n" ] }, { "data": { "text/plain": [ "['Generation_3', 'Generation_4']" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.stepwise_selection()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a one-liner, we drop the columns/features selected by the stepwise algorithm and then printout the summary." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Considering only numerical features\n", "Features Summary (all):\n", "'Attack' : float64 Min.(0.0) 1stQ(0.36) Med.(0.48) Mean(0.48) 3rdQ(0.61) Max.(1.0)\n", "'Defense' : float64 Min.(0.0) 1stQ(0.37) Med.(0.50) Mean(0.50) 3rdQ(0.64) Max.(1.0)\n", "'Generation_1' : float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.18) 3rdQ(0.0) Max.(1.0)\n", "'Generation_2' : float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.12) 3rdQ(0.0) Max.(0.99)\n", "'Generation_5' : float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.20) 3rdQ(0.0) Max.(1.0)\n", "'Generation_6' : float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.12) 3rdQ(0.0) Max.(0.99)\n", "'HP' : float64 Min.(0.0) 1stQ(0.49) Med.(0.58) Mean(0.58) 3rdQ(0.67) Max.(0.99)\n", "'Name' : object 403 categs. 'Bulbasaur'(1, 0.0025) 'Ivysaur'(1, 0.0025) 'Venusaur'(1, 0.0025) 'VenusaurMega Venusaur'(1, 0.0025) ...\n", "'Sp. Atk' : float64 Min.(0.0) 1stQ(0.37) Med.(0.51) Mean(0.52) 3rdQ(0.68) Max.(1.0)\n", "'Sp. Def' : float64 Min.(0.0) 1stQ(0.37) Med.(0.52) Mean(0.51) 3rdQ(0.66) Max.(1.0)\n", "'Speed' : category 3 categs. 'low'(212, 0.5261) 'mid'(165, 0.4094) 'high'(26, 0.0645) \n", "'Total' : float64 Min.(0.0) 1stQ(0.30) Med.(0.51) Mean(0.47) 3rdQ(0.60) Max.(1.0)\n", "'Type 1' : object 18 categs. 'Grass'(51, 0.1266) 'Fire'(49, 0.1216) 'Bug'(36, 0.0893) 'Normal'(36, 0.0893) ...\n", "'Type 2' : object 18 categs. 'Poison'(96, 0.2382) 'Flying'(33, 0.0819) 'Dragon'(32, 0.0794) 'Ground'(32, 0.0794) ...\n", "'Legendary_True': float64 Min.(0.0) 1stQ(0.0) Med.(0.0) Mean(0.02) 3rdQ(0.0) Max.(1.0)\n" ] } ], "source": [ "pokemon.drop_columns(pokemon.stepwise_selection()).summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Dataset Split\n", "\n", "Dataset adds a method to split your dataset according to the specified proportions between training and test. The method is called `split()`, and accepts as optional parameter the percentage to be assigned to the test set.\n", "\n", "We normally split specifying the seed used by the random number generator. If you plan to repeat the split process a number of times within a CV process, you need to change the seed accordingly." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "X, y = pokemon.split(seed=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What we obtain in X and y, are two objects that inside contain the training and test splits, as pandas `DataFrames`. To access them, we simply write:\n", "\n", " X.train\n", " X.test\n", " y.train\n", " y.test\n", " \n", "Let's check it out:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AttackDefenseGeneration_1Generation_2Generation_5Generation_6HPNameSp. AtkSp. DefSpeedTotalType 1Type 2
1850.5754440.3355480.00.00.00.00.426154Anorith0.2650050.328587mid0.309899RockBug
2270.7782640.6110230.00.00.00.00.556204LopunnyMega Lopunny0.3932940.672391high0.688447NormalFighting
2450.6338560.4396860.00.00.00.00.661813Toxicroak0.6095850.453749mid0.541394PoisonFighting
\n", "
" ], "text/plain": [ " Attack Defense Generation_1 Generation_2 Generation_5 \\\n", "185 0.575444 0.335548 0.0 0.0 0.0 \n", "227 0.778264 0.611023 0.0 0.0 0.0 \n", "245 0.633856 0.439686 0.0 0.0 0.0 \n", "\n", " Generation_6 HP Name Sp. Atk Sp. Def Speed \\\n", "185 0.0 0.426154 Anorith 0.265005 0.328587 mid \n", "227 0.0 0.556204 LopunnyMega Lopunny 0.393294 0.672391 high \n", "245 0.0 0.661813 Toxicroak 0.609585 0.453749 mid \n", "\n", " Total Type 1 Type 2 \n", "185 0.309899 Rock Bug \n", "227 0.688447 Normal Fighting \n", "245 0.541394 Poison Fighting " ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.train.head(3)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Legendary_True
3600.0
620.0
3740.0
\n", "
" ], "text/plain": [ " Legendary_True\n", "360 0.0\n", "62 0.0\n", "374 0.0" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y.test.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Depending on the ML method used, you can directly pass `X.train` and `y.train` as DataFrames, or in some other cases to you need to pass the numpy array of values. In that case you simple use:\n", "\n", " X.train.values, y.train.values\n", " \n", "Same applies to test subsets." ] } ], "metadata": { "kernelspec": { "display_name": "dataset-kernel", "language": "python", "name": "dataset-kernel" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }