dataset package

Submodules

dataset.correlations module

dataset.correlations.conditional_entropy(x, y)[source]

Calculates the conditional entropy of x given y: S(x|y)

Wikipedia: <https://en.wikipedia.org/wiki/Conditional_entropy>

Parameters
  • x – list / NumPy ndarray / Pandas Series A sequence of measurements

  • y – list / NumPy ndarray / Pandas Series A sequence of measurements

Returns

float

dataset.correlations.convert(data, to)[source]
dataset.correlations.correlation_ratio(categories, measurements)[source]

Calculates the Correlation Ratio (sometimes marked by the greek letter Eta) for categorical-continuous association. Answers the question - given a continuous value of a measurement, is it possible to know which category is it associated with? Value is in the range [0,1], where 0 means a category cannot be determined by a continuous measurement, and 1 means a category can be determined with absolute certainty.

Wikipedia: https://en.wikipedia.org/wiki/Correlation_ratio

Parameters
  • categories – list / NumPy ndarray / Pandas Series A sequence of categorical measurements

  • measurements – list / NumPy ndarray / Pandas Series A sequence of continuous measurements

Returns

float in the range of [0,1]

dataset.correlations.cramers_v(x, y)[source]

Calculates Cramer’s V statistic for categorical-categorical association. Uses correction from Bergsma and Wicher, Journal of the Korean Statistical Society 42 (2013): 323-328. This is a symmetric coefficient: V(x,y) = V(y,x)

Original function taken from: https://stackoverflow.com/a/46498792/5863503 Wikipedia: <https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V>

Parameters
  • x – list / NumPy ndarray / Pandas Series A sequence of categorical measurements

  • y – list / NumPy ndarray / Pandas Series A sequence of categorical measurements

Returns

float in the range of [0,1]

dataset.correlations.theils_u(x, y)[source]

Calculates Theil’s U statistic (Uncertainty coefficient) for categorical-categorical association. This is the uncertainty of x given y: value is on the range of [0,1] - where 0 means y provides no information about x, and 1 means y provides full information about x. This is an asymmetric coefficient: U(x,y) != U(y,x)

Wikipedia: <https://en.wikipedia.org/wiki/Uncertainty_coefficient>

Parameters
  • x – list / NumPy ndarray / Pandas Series A sequence of categorical measurements

  • y – list / NumPy ndarray / Pandas Series A sequence of categorical measurements

Returns

float in the range of [0,1]

dataset.dataset module

This is the package dataset.

class dataset.dataset.Dataset(data_location=None, data_frame=None, *args, **kwargs)[source]

Bases: object

This class allows a simpler representation of the dataset used to build a model in class. It allows to load a remote CSV by providing an URL to the initialization method of the object, and work on the most common tasks related to data preparation and feature engineering.:

my_data = Dataset(URL)

my_data = Dataset.from_dataframe(my_dataframe)
add_columns(new_features)[source]

Add a Series as a new column to the dataset.

Parameters

new_features – A pandas Series object or a DataFrame with the data to be added to the Dataset. It must contain a valid name not present in the Dataset already.

Examples:

my_data.add_column(my_series)
my_data.add_column(pandas.Series().values)
my_data.add_column(my_dataframe)
aggregate(col_list, new_column, operation='sum', drop_columns=True)[source]

Perform an arithmetic operation on the given columns, and places the result on a new column, removing the original ones.

Parameters
  • col_list – the list of columns over which the operation is done

  • new_column – the name of the new column to be generated from the operation

  • drop_columns – whether remove the columns used to perfrom the aggregation

  • operation – the operation to be done over the column values for each row. Examples: ‘sum’, ‘diff’, ‘max’, etc. By default, the operation is the sum of the values.

Returns

The Dataset object

Example:

If we want to sum the values of column1 and column2 into a new column called ‘column3’, we use:

my_data.aggregate(['column1', 'column2'], 'column3', 'sum')

As a result, my_data will remove column1 and column2, and the operation will be the sum of the values, as it is the default operation.

all = None
categorical = None
categorical_correlated(threshold=0.9)[source]

Generates a correlation matrix for the categorical variables in dataset Calculates Cramer’s V statistic for categorical-categorical association. Uses correction from Bergsma and Wicher, Journal of the Korean Statistical Society 42 (2013): 323-328. This is a symmetric coefficient: V(x,y) = V(y,x) Original function taken from:

Wikipedia:

http://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V

Parameters

threshold – Limit from which correlations is considered high.

Returns

The list of categorical variables with HIGH correlation and the correlation matrix

categorical_dtypes = ['bool', 'object', 'string', 'category']
property categorical_features
property categorical_features_na
correlated(threshold=0.9)[source]

Return the features that are highly correlated to with other variables, either numerical or categorical, based on the threshold. For numerical variables Spearman correlation is used, for categorical cramers_v.

Parameters

threshold – correlation limit above which features are considered highly correlated.

Returns

the list of features that are highly correlated, and should be safe to remove.

describe(feature_name=None, inline=False)[source]

Wrapper. Calls the proper feature description method, depending on whether the feature is numerical or categorical. If no arguments are passed, the description of the entire dataset is provided.

Parameters
  • feature_name – The feature to be described. Default value is None, which implies that all features are described.

  • inline – whether the output is multiple lines or inline. This is used when describing from summary() function or from a console or cell.

Returns

The string, only when inline=True, that contains the description.

TODO: Implement a limit of characters for each line that is printed

out in the screen, so that when reaching that limit ‘…’ is printed.

describe_dataset()[source]

Printout the metadata information collected when calling the metainfo() method.

Returns

nothing

discretize(column, bins, category_names=None)[source]

Makes a feature, which is normally numerical, categorical by binning its contents into the specified buckets.

Args:

column: The name of the feature to be binned bins: the list of bins as an array of values of the form

[(15, 20), (20, 25), (25, 30), (30, 35), (35, 40)]

category_names: An array with names or values we want for our new

categories. If None a simple array with ordinal number of the category is used. In the example above, it should be an array from 1 .. 5.

Returns: The dataset modified

Example:

# Variable "x3" contains the number of sons of a person as an
# integer ranging between values 0 and 10. We want to convert
# that numerical value into a categorical one with a list
# of (say) 4 possible values, for the number of sons within
# given ranges:

my_data.discretize('x3',
        [(0, 2), (2, 4), (4, 6), (6, 8)], [1, 2, 3, 4])
drop_columns(columns_list)[source]

Drop one or a list of columns from the dataset.

Parameters

columns_list – An array-type expression with the names of the columns to be removed from the Dataset. In case a single string is passed, it will be considered the name of a sinle columns to be dropped.

Examples:

my_data.drop_columns('column_name')
my_data.drop_columns(['column1', 'column2', 'column3'])
drop_na()[source]

Drop samples with NAs from the features. If any value is infinite or -infinite, it is converted to NA, and removed also.

Examples:

my_data.drop_na()
Returns

object

drop_samples(index_list)[source]

Remove the list of samples from the dataset.

Parameters

index_list – The list of indices in the DataFrame to be removed from the features and the target DataFrames.

Returns

self

property feature_names
features = None
features_importance(num_features=None, num_neighbors=None, abs_imp=False)[source]

Computes NUMERICAL features importance, using the ReliefF algorithm as implemented in the rebate library.

Args:

num_features: The nr of features we want to display num_neighbors: The nr of neighbors to consider when computing the

features importance

abs_imp: if True, importance is displayed taking the ABS()

Returns:

A sorted dictionary with the feature names and their importance.

fix_skewness(feature_names=None, return_series=False)[source]

Ensures that the numerical features in the dataset, fit into a normal distribution by applying the Yeo-Johnson transform. If not already scaled, they’re scaled as part of the process.

Parameters
  • feature_names – Features to be fixed. If not specified, all numerical features are examined.

  • return_series – Return the normalized series

Returns

The subset fitted to normal distribution, or None

classmethod from_dataframe(df)[source]
property incomplete_features
information_gain()[source]

Computes the information gain between each categorical and target variable.

Examples:

my_data.information_gain()
Name   : 0.18
Speed  : 0.00
Type 1 : 0.04
Type 2 : 0.03
Returns:

A dictionary with the IG value for each categorical feature name

keep_columns(to_keep)[source]

Keep only one or a list of columns from the dataset.

Parameters

to_keep – A string or array-like expression indicating the columns to be kept in the Dataset. The columns not in the list of names passed are dropped.

Example:

my_data.keep_columns('column_name')
my_data.keep_columns(['column1', 'column2', 'column3'])
merge_categories(column, old_values, new_value)[source]

Merge a subset of categories present in one of the columns into a new single category. This is normally done when this list of categs is not enough representative.

Parameters
  • column – The column with the categories to be merged

  • old_values – The list of categories to be merged

  • new_value – The resulting new category after the merge.

Returns

self.

Example:

my_data.merge_categories(column='color',
                         old_values=['grey', 'black'],
                         new_value='dark')
merge_values(column, old_values, new_value)[source]

Same method as ‘merge_categories’ but for numerical values. Merge a subset of values present in one of the columns into a new single category. This is normally done when this list of values is not enough representative.

Parameters
  • column – The column with the values to be merged

  • old_values – The list of values to be merged

  • new_value – The resulting new value after the merge.

Returns

self.

Example:

my_data.merge_values(column='years',
                         old_values=['2001', '2002'],
                         new_value='2000')
meta = None
meta_tags = ['all', 'numerical', 'categorical', 'complete', 'numerical_na', 'categorical_na', 'features', 'target']
names(what='all')[source]

Returns a the names of the columns of the dataset for which the arg what is specified. If it is a list, it returns those feature names in the list, And if it is a keywork from: ‘all’, ‘categorical’, ‘categorical_na’, ‘numerical’, ‘numerical_na’, ‘complete’, then the list of features is extracted from the metainformation of the dataset.

Parameters

what

Possible values are

  • all: (Default) Include very feature, including the target

  • numerical: Only numerical features

  • categorical: Only categorical features

  • complete: Only features without NA

  • numerical_na: Numerical features with NA

  • categorical_na: Categorical features with NA

  • features: Only features, NOT the target variable.

  • target: Only the target variable.

nas()[source]

Returns the list of features that present NA entries

Returns

the list of feature names presenting NA

property num_features
property num_samples
numerical = None
numerical_correlated(threshold=0.9)[source]

Build a correlation matrix between all the features in data set

Parameters

threshold – Threshold beyond which considering high correlation. Default is 0.9

Returns

The list of columns that are highly correlated and could be drop out from dataset.

property numerical_features
property numerical_features_na
onehot_encode(feature_names=None)[source]

Encodes the categorical features in the dataset, with OneHotEncode

Parameters

feature_names – column or list of columns to be one-hot encoded. The only restriction is that the target variable cannot be specifiedin the list of columns and therefore, cannot be onehot encoded. Default = all categorical features in dataset.

Returns

self

Example:

# Encodes a single column named 'my_column_name'
my_data.onehot_encode('my_column_name')

# Encodes 'col1' and 'col2'
my_data.onehot_encode(['col1', 'col2'])

# Encodes all categorical features in the dataset
my_data.onehot_encode(my_data.names('categorical'))

or:

my_data.onehot_encode()
outliers(n_neighbors=20)[source]

Find outliers, using LOF criteria, from the numerical features. Returns a list of indices where outliers are present

Parameters

n_neighbors – Number of neighbors to use by default for kneighbors queries. If n_neighbors is larger than the number of samples provided, all samples will be used.

# TODO Implement a simple set of methods to select from in order to

detect outliers.

static plot_correlation_matrix(corr_matrix)[source]
plot_covariance()[source]

Plots the covariance matrix as explained by scikit contributor Andreas Mueller in Columbia lectures, ordering and grouping (numerical) features with higher correlation.

Returns:

None

plot_density(feature_names=None, category=None)[source]

Double density plot(s) between feature(s) and a reference category.

Parameters
  • feature_names – The name of a feature(s) in the dataset.

  • category – The name of the reference category we want to represent the double density plot against. If None, then the target variable is used.

Returns

None

Example:

# represent multiple density plots, one per unique value of the
# target
my_data.plot_density(my_feature)

# represent double density plots, one per unique value of the
# categorical feature 'my_feature2'
my_data.plot_density(my_feature1, my_feature2)

# Plot double density plots for all numerical features.
my_data.plot_density(my_data.numerical_features)

# or
my_data.plot_density()
plot_histogram(feature_names=None, category=None)[source]

Double histogram plot between a feature and a reference category.

Parameters
  • feature_names – The name(s) of the feature(s) in the dataset.

  • category – The name of the reference category we want to represent the double density plot against. If None, then the target variable is used.

Returns

None

Example:

# represent multiple density plots, one per unique value of the
# target
my_data.plot_double_hist(my_feature)

# represent double density plots, one per unique value of the
# categorical feature 'my_feature2'
my_data.double_hist(my_feature1, my_feature2)

# or
my_data.plot_density()
plot_importance(num_features=None, num_neighbors=None, abs_imp=False)[source]

Plots the NUMERICAL features importance, using the ReliefF algorithm as implemented in the rebate library.

Args:
num_features: The nr of features we want to display. Default is

all features.

num_neighbors: The nr of neighbors to consider when computing the

features importance. Default is 20.

abs_imp: if True, importance is displayed taking the ABS()

Default value is False.

Returns:

None

replace_na(column, value)[source]

Replace any NA occurrence from the column or list of columns passed by the value passed as second argument.

Parameters
  • column – Column name or list of column names from which to replace NAs with the value passes in the second argument

  • value – value to be used as replacement

Returns

the object.

samples_matching(value=None, feature=None)[source]

Return the a list with the indexes of those samples matching a given criteria. The match can be set on target variable, or any other column name.

Args:

value: feature:

Returns:

A list with the index values of those samples matching.

Examples:

my_data.samples_matching('red')

returns the indices of those samples whose target matches the value red.

my_data.samples_matching(75, ‘column_3’)

returns the indices of those samples whose feature column_3 values 75.

scale(features_of_type='numerical', method='StandardScaler', return_series=False)[source]

Scales numerical features in the dataset, unless the parameter ‘what’ specifies any other subset selection primitive. The method to be used is the sckikit learn StandardScaler.

Examples:

# scale all my numerical features
my_data.scale()
Parameters
  • features_of_type – Subset selection primitive

  • method – ‘StandardScaler’, ‘MinMaxScaler’

Returns

the subset scaled.

select(what)[source]

Returns a subset of the columns of the dataset. what specifies what subset of features to return If it is a list, it returns those feature names in the list, And if it is a keywork from: ‘all’, ‘categorical’, ‘categorical_na’, ‘numerical’, ‘numerical_na’, ‘complete’, ‘features’, ‘target’, then the list of features is extracted from the metainformation of the dataset.

Parameters

what

Possible values are

  • all: (Default) Include very feature, including the target

  • numerical: Only numerical features

  • categorical: Only categorical features

  • complete: Only features without NA

  • numerical_na: Numerical features with NA

  • categorical_na: Categorical features with NA

  • features: Only features, NOT the target variable.

  • target: Only the target variable.

Returns

Reference to the columns specified.

set_target(target_name)[source]

Set the target variable for this dataset. This will create a new property of the object called ‘target’ that will contain the target column of the dataset, and that column will be removed from the list of features.

Parameters

target_name – The name of the column we want to be set as the target variable for this dataset.

Example:

my_data.set_target('SalePrice')
skewed_features(threshold=0.75, fix=False, return_series=True)[source]

Returns the list of numerical features that present skewness. This method optionally can fix detected skewness whose ABS is greater than the threshold passed, using BoxCox method.

Parameters
  • threshold – The limit over which considering that the skew() return value is considered a skewed feature.

  • fix – (Default: False) Boolean indicating whether or not fixing the skewed features. If True, those with values above the threshold will be fixed using BoxCox.

  • return_series – (Default: True) Boolean indicating whether returning the features (pandas DataFrame) that present skewness.

Returns

A pandas Series with the features and their skewness

split(seed=1024, test_size=0.2, validation_split=False)[source]

From an Dataset, produce splits (with or without validation) for training and test. The objects of type Split will only contain properties with the names train or test to reference the different splits.

Parameters
  • seed – The seed to be used to generate the random split.

  • test_size – The test size as a percentage of the base dataset.

  • validation_split – Boolean indicating whether it is also needed to generate a third split for validation purposes, same size as the test_size.

Returns

The X and y objects that contain the splits.

Example:

# Generate the splits (80-20)
X, y = my_data.split()

# Create an instance of the model, and use the training set to
# fit it, and the test set to score it.
model = LinearRegression()
model.fit(X.train, y.train)
model.score(X.test, y.test)
stepwise_selection(initial_list=None, threshold_in=0.01, threshold_out=0.05, verbose=False)[source]

Perform a forward/backward feature selection based on p-value from statsmodels.api.OLS Your features must be all numerical, so be sure to onehot_encode them before calling this method. Always set threshold_in < threshold_out to avoid infinite looping. All features involved must be numerical and types must be float. Target variable must also be float. You can convert it back to a categorical type after calling this method.

Parameters
  • initial_list – list of features to start with (column names of X)

  • threshold_in – include a feature if its p-value < threshold_in

  • threshold_out – exclude a feature if its p-value > threshold_out

  • verbose – whether to print the sequence of inclusions and exclusions

Returns

List of selected features

Example:

my_data.stepwise_selection()

See <https://en.wikipedia.org/wiki/Stepwise_regression> for the details

Taken from: <https://datascience.stackexchange.com/a/24823>

summary(what='all')[source]

Printout a summary of each feature.

Parameters

what

Possible values are

  • all: (Default) Include very feature, including the target

  • numerical: Only numerical features

  • categorical: Only categorical features

  • complete: Only features without NA

  • numerical_na: Numerical features with NA

  • categorical_na: Categorical features with NA

  • features: Only features, NOT the target variable.

  • target: Only the target variable.

Returns

N/A

table(what='all', max_width=80)[source]

Print a tabulated version of the list of elements in a list, using a max_width display (default 80).

Parameters
  • what

    Possible values are

    • all: (Default) Include very feature, including the target

    • numerical: Only numerical features

    • categorical: Only categorical features

    • complete: Only features without NA

    • numerical_na: Numerical features with NA

    • categorical_na: Categorical features with NA

    • features: Only features, NOT the target variable.

    • target: Only the target variable.

  • max_width – The max_width used in the display.

Returns

None

target = None
to_categorical(to_convert)[source]

Convert the specified column or columns to categories

Parameters

to_convert – column or column list to be converted

Returns

object

to_float(to_convert=None)[source]

Convert a column or list of columns to float values. The columns must be numerical.

Args:
to_convert: the column name or list of column names that we want

to convert. If this argument is empty, then every numerical feature in the dataset is converted.

Returns: The dataset

Example:

my_data.to_float(my_data.numerical_features)

# which is equivalent to::
my_data.to_float()

# We can also specify a single or multiple features::
my_data.to_float('feature_15')
my_data.to_float(['feature_15', 'feature_21'])
to_int(to_convert=None)[source]

Convert a column or list of columns to integer values. The columns must be numerical

Args:
to_convert: the column name or list of column names that we want

to convert. If none specified, all numerical columns are converted to int type.

Returns: The dataset

Example:

my_data.to_int(my_data.numerical_features)

# which is equivalent to::
my_data.to_int()

# We can also specify a single or multiple features::
my_data.to_int('feature_15')
my_data.to_int(['feature_15', 'feature_21'])
to_numerical(to_convert)[source]

Convert the specified column or columns to numbers

Parameters

to_convert – column name or list of column names to be converted

Returns

object

TODO: It must be possible to perform label encoding if specified.

For example, I might want to convert a target variable with strings valued “Yes” and “No” to type “category” or to type “int” with values 1 and 0.

under_represented_features(threshold=0.98)[source]

Returns the list of categorical features with unrepresented categories or a clear unbalance between the values that can take.

Parameters

threshold – The upper limit of the most represented category of the feature.

Returns

the list of features that with unrepresented categories.

unset_target()[source]

Undo the set_target() operation. The feature target_name returns to the DataFrame with the rest of the features.

Example:

my_data.unset_target()

dataset.split module

class dataset.split.Split(splits)[source]

Bases: object

This class represents a split from a dataset, it will assign each dataframe partition passed as argument to a different attribute of the class: ‘train’, ‘test’ (and ‘validation’). The class method ‘split’ performs the splitting of the dataframe passed, according to the parameters passed.

Example:

from src import split

X, Y = split.Split(my_dataframe, my_target_column)

split_name = ['train', 'test', 'validation']

Module contents

Dataset module for machine learning basic dataframe manipulation and feature engineering tasks