dataset package¶
Submodules¶
dataset.correlations module¶
- dataset.correlations.conditional_entropy(x, y)[source]¶
Calculates the conditional entropy of x given y: S(x|y)
Wikipedia: <https://en.wikipedia.org/wiki/Conditional_entropy>
- Parameters
x – list / NumPy ndarray / Pandas Series A sequence of measurements
y – list / NumPy ndarray / Pandas Series A sequence of measurements
- Returns
float
- dataset.correlations.correlation_ratio(categories, measurements)[source]¶
Calculates the Correlation Ratio (sometimes marked by the greek letter Eta) for categorical-continuous association. Answers the question - given a continuous value of a measurement, is it possible to know which category is it associated with? Value is in the range [0,1], where 0 means a category cannot be determined by a continuous measurement, and 1 means a category can be determined with absolute certainty.
Wikipedia: https://en.wikipedia.org/wiki/Correlation_ratio
- Parameters
categories – list / NumPy ndarray / Pandas Series A sequence of categorical measurements
measurements – list / NumPy ndarray / Pandas Series A sequence of continuous measurements
- Returns
float in the range of [0,1]
- dataset.correlations.cramers_v(x, y)[source]¶
Calculates Cramer’s V statistic for categorical-categorical association. Uses correction from Bergsma and Wicher, Journal of the Korean Statistical Society 42 (2013): 323-328. This is a symmetric coefficient: V(x,y) = V(y,x)
Original function taken from: https://stackoverflow.com/a/46498792/5863503 Wikipedia: <https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V>
- Parameters
x – list / NumPy ndarray / Pandas Series A sequence of categorical measurements
y – list / NumPy ndarray / Pandas Series A sequence of categorical measurements
- Returns
float in the range of [0,1]
- dataset.correlations.theils_u(x, y)[source]¶
Calculates Theil’s U statistic (Uncertainty coefficient) for categorical-categorical association. This is the uncertainty of x given y: value is on the range of [0,1] - where 0 means y provides no information about x, and 1 means y provides full information about x. This is an asymmetric coefficient: U(x,y) != U(y,x)
Wikipedia: <https://en.wikipedia.org/wiki/Uncertainty_coefficient>
- Parameters
x – list / NumPy ndarray / Pandas Series A sequence of categorical measurements
y – list / NumPy ndarray / Pandas Series A sequence of categorical measurements
- Returns
float in the range of [0,1]
dataset.dataset module¶
This is the package dataset.
- class dataset.dataset.Dataset(data_location=None, data_frame=None, *args, **kwargs)[source]¶
Bases:
objectThis class allows a simpler representation of the dataset used to build a model in class. It allows to load a remote CSV by providing an URL to the initialization method of the object, and work on the most common tasks related to data preparation and feature engineering.:
my_data = Dataset(URL) my_data = Dataset.from_dataframe(my_dataframe)
- add_columns(new_features)[source]¶
Add a Series as a new column to the dataset.
- Parameters
new_features – A pandas Series object or a DataFrame with the data to be added to the Dataset. It must contain a valid name not present in the Dataset already.
Examples:
my_data.add_column(my_series) my_data.add_column(pandas.Series().values) my_data.add_column(my_dataframe)
- aggregate(col_list, new_column, operation='sum', drop_columns=True)[source]¶
Perform an arithmetic operation on the given columns, and places the result on a new column, removing the original ones.
- Parameters
col_list – the list of columns over which the operation is done
new_column – the name of the new column to be generated from the operation
drop_columns – whether remove the columns used to perfrom the aggregation
operation – the operation to be done over the column values for each row. Examples: ‘sum’, ‘diff’, ‘max’, etc. By default, the operation is the sum of the values.
- Returns
The Dataset object
Example:
If we want to sum the values of column1 and column2 into a new column called ‘column3’, we use:
my_data.aggregate(['column1', 'column2'], 'column3', 'sum')
As a result,
my_datawill removecolumn1andcolumn2, and the operation will be the sum of the values, as it is the default operation.
- all = None¶
- categorical = None¶
Generates a correlation matrix for the categorical variables in dataset Calculates Cramer’s V statistic for categorical-categorical association. Uses correction from Bergsma and Wicher, Journal of the Korean Statistical Society 42 (2013): 323-328. This is a symmetric coefficient: V(x,y) = V(y,x) Original function taken from:
- Parameters
threshold – Limit from which correlations is considered high.
- Returns
The list of categorical variables with HIGH correlation and the correlation matrix
- categorical_dtypes = ['bool', 'object', 'string', 'category']¶
- property categorical_features¶
- property categorical_features_na¶
Return the features that are highly correlated to with other variables, either numerical or categorical, based on the threshold. For numerical variables Spearman correlation is used, for categorical cramers_v.
- Parameters
threshold – correlation limit above which features are considered highly correlated.
- Returns
the list of features that are highly correlated, and should be safe to remove.
- describe(feature_name=None, inline=False)[source]¶
Wrapper. Calls the proper feature description method, depending on whether the feature is numerical or categorical. If no arguments are passed, the description of the entire dataset is provided.
- Parameters
feature_name – The feature to be described. Default value is None, which implies that all features are described.
inline – whether the output is multiple lines or inline. This is used when describing from
summary()function or from a console or cell.
- Returns
The string, only when inline=True, that contains the description.
- TODO: Implement a limit of characters for each line that is printed
out in the screen, so that when reaching that limit ‘…’ is printed.
- describe_dataset()[source]¶
Printout the metadata information collected when calling the metainfo() method.
- Returns
nothing
- discretize(column, bins, category_names=None)[source]¶
Makes a feature, which is normally numerical, categorical by binning its contents into the specified buckets.
- Args:
column: The name of the feature to be binned bins: the list of bins as an array of values of the form
[(15, 20), (20, 25), (25, 30), (30, 35), (35, 40)]
- category_names: An array with names or values we want for our new
categories. If None a simple array with ordinal number of the category is used. In the example above, it should be an array from 1 .. 5.
Returns: The dataset modified
Example:
# Variable "x3" contains the number of sons of a person as an # integer ranging between values 0 and 10. We want to convert # that numerical value into a categorical one with a list # of (say) 4 possible values, for the number of sons within # given ranges: my_data.discretize('x3', [(0, 2), (2, 4), (4, 6), (6, 8)], [1, 2, 3, 4])
- drop_columns(columns_list)[source]¶
Drop one or a list of columns from the dataset.
- Parameters
columns_list – An array-type expression with the names of the columns to be removed from the Dataset. In case a single string is passed, it will be considered the name of a sinle columns to be dropped.
Examples:
my_data.drop_columns('column_name') my_data.drop_columns(['column1', 'column2', 'column3'])
- drop_na()[source]¶
Drop samples with NAs from the features. If any value is infinite or -infinite, it is converted to NA, and removed also.
Examples:
my_data.drop_na()
- Returns
object
- drop_samples(index_list)[source]¶
Remove the list of samples from the dataset.
- Parameters
index_list – The list of indices in the DataFrame to be removed from the features and the target DataFrames.
- Returns
self
- property feature_names¶
- features = None¶
- features_importance(num_features=None, num_neighbors=None, abs_imp=False)[source]¶
Computes NUMERICAL features importance, using the ReliefF algorithm as implemented in the rebate library.
- Args:
num_features: The nr of features we want to display num_neighbors: The nr of neighbors to consider when computing the
features importance
abs_imp: if True, importance is displayed taking the ABS()
- Returns:
A sorted dictionary with the feature names and their importance.
- fix_skewness(feature_names=None, return_series=False)[source]¶
Ensures that the numerical features in the dataset, fit into a normal distribution by applying the Yeo-Johnson transform. If not already scaled, they’re scaled as part of the process.
- Parameters
feature_names – Features to be fixed. If not specified, all numerical features are examined.
return_series – Return the normalized series
- Returns
The subset fitted to normal distribution, or None
- property incomplete_features¶
- information_gain()[source]¶
Computes the information gain between each categorical and target variable.
Examples:
my_data.information_gain() Name : 0.18 Speed : 0.00 Type 1 : 0.04 Type 2 : 0.03
- Returns:
A dictionary with the IG value for each categorical feature name
- keep_columns(to_keep)[source]¶
Keep only one or a list of columns from the dataset.
- Parameters
to_keep – A string or array-like expression indicating the columns to be kept in the Dataset. The columns not in the list of names passed are dropped.
Example:
my_data.keep_columns('column_name') my_data.keep_columns(['column1', 'column2', 'column3'])
- merge_categories(column, old_values, new_value)[source]¶
Merge a subset of categories present in one of the columns into a new single category. This is normally done when this list of categs is not enough representative.
- Parameters
column – The column with the categories to be merged
old_values – The list of categories to be merged
new_value – The resulting new category after the merge.
- Returns
self.
Example:
my_data.merge_categories(column='color', old_values=['grey', 'black'], new_value='dark')
- merge_values(column, old_values, new_value)[source]¶
Same method as ‘merge_categories’ but for numerical values. Merge a subset of values present in one of the columns into a new single category. This is normally done when this list of values is not enough representative.
- Parameters
column – The column with the values to be merged
old_values – The list of values to be merged
new_value – The resulting new value after the merge.
- Returns
self.
Example:
my_data.merge_values(column='years', old_values=['2001', '2002'], new_value='2000')
- meta = None¶
- meta_tags = ['all', 'numerical', 'categorical', 'complete', 'numerical_na', 'categorical_na', 'features', 'target']¶
- names(what='all')[source]¶
Returns a the names of the columns of the dataset for which the arg what is specified. If it is a list, it returns those feature names in the list, And if it is a keywork from: ‘all’, ‘categorical’, ‘categorical_na’, ‘numerical’, ‘numerical_na’, ‘complete’, then the list of features is extracted from the metainformation of the dataset.
- Parameters
what –
Possible values are
all: (Default) Include very feature, including the target
numerical: Only numerical features
categorical: Only categorical features
complete: Only features without NA
numerical_na: Numerical features with NA
categorical_na: Categorical features with NA
features: Only features, NOT the target variable.
target: Only the target variable.
- nas()[source]¶
Returns the list of features that present NA entries
- Returns
the list of feature names presenting NA
- property num_features¶
- property num_samples¶
- numerical = None¶
Build a correlation matrix between all the features in data set
- Parameters
threshold – Threshold beyond which considering high correlation. Default is 0.9
- Returns
The list of columns that are highly correlated and could be drop out from dataset.
- property numerical_features¶
- property numerical_features_na¶
- onehot_encode(feature_names=None)[source]¶
Encodes the categorical features in the dataset, with OneHotEncode
- Parameters
feature_names – column or list of columns to be one-hot encoded. The only restriction is that the target variable cannot be specifiedin the list of columns and therefore, cannot be onehot encoded. Default = all categorical features in dataset.
- Returns
self
Example:
# Encodes a single column named 'my_column_name' my_data.onehot_encode('my_column_name') # Encodes 'col1' and 'col2' my_data.onehot_encode(['col1', 'col2']) # Encodes all categorical features in the dataset my_data.onehot_encode(my_data.names('categorical'))
or:
my_data.onehot_encode()
- outliers(n_neighbors=20)[source]¶
Find outliers, using LOF criteria, from the numerical features. Returns a list of indices where outliers are present
- Parameters
n_neighbors – Number of neighbors to use by default for kneighbors queries. If n_neighbors is larger than the number of samples provided, all samples will be used.
- # TODO Implement a simple set of methods to select from in order to
detect outliers.
- plot_covariance()[source]¶
Plots the covariance matrix as explained by scikit contributor Andreas Mueller in Columbia lectures, ordering and grouping (numerical) features with higher correlation.
- Returns:
None
- plot_density(feature_names=None, category=None)[source]¶
Double density plot(s) between feature(s) and a reference category.
- Parameters
feature_names – The name of a feature(s) in the dataset.
category – The name of the reference category we want to represent the double density plot against. If None, then the target variable is used.
- Returns
None
Example:
# represent multiple density plots, one per unique value of the # target my_data.plot_density(my_feature) # represent double density plots, one per unique value of the # categorical feature 'my_feature2' my_data.plot_density(my_feature1, my_feature2) # Plot double density plots for all numerical features. my_data.plot_density(my_data.numerical_features) # or my_data.plot_density()
- plot_histogram(feature_names=None, category=None)[source]¶
Double histogram plot between a feature and a reference category.
- Parameters
feature_names – The name(s) of the feature(s) in the dataset.
category – The name of the reference category we want to represent the double density plot against. If None, then the target variable is used.
- Returns
None
Example:
# represent multiple density plots, one per unique value of the # target my_data.plot_double_hist(my_feature) # represent double density plots, one per unique value of the # categorical feature 'my_feature2' my_data.double_hist(my_feature1, my_feature2) # or my_data.plot_density()
- plot_importance(num_features=None, num_neighbors=None, abs_imp=False)[source]¶
Plots the NUMERICAL features importance, using the ReliefF algorithm as implemented in the rebate library.
- Args:
- num_features: The nr of features we want to display. Default is
all features.
- num_neighbors: The nr of neighbors to consider when computing the
features importance. Default is 20.
- abs_imp: if True, importance is displayed taking the ABS()
Default value is False.
- Returns:
None
- replace_na(column, value)[source]¶
Replace any NA occurrence from the column or list of columns passed by the value passed as second argument.
- Parameters
column – Column name or list of column names from which to replace NAs with the value passes in the second argument
value – value to be used as replacement
- Returns
the object.
- samples_matching(value=None, feature=None)[source]¶
Return the a list with the indexes of those samples matching a given criteria. The match can be set on target variable, or any other column name.
- Args:
value: feature:
- Returns:
A list with the index values of those samples matching.
Examples:
my_data.samples_matching('red')
returns the indices of those samples whose target matches the value red.
my_data.samples_matching(75, ‘column_3’)
returns the indices of those samples whose feature column_3 values 75.
- scale(features_of_type='numerical', method='StandardScaler', return_series=False)[source]¶
Scales numerical features in the dataset, unless the parameter ‘what’ specifies any other subset selection primitive. The method to be used is the sckikit learn StandardScaler.
Examples:
# scale all my numerical features my_data.scale()
- Parameters
features_of_type – Subset selection primitive
method – ‘StandardScaler’, ‘MinMaxScaler’
- Returns
the subset scaled.
- select(what)[source]¶
Returns a subset of the columns of the dataset. what specifies what subset of features to return If it is a list, it returns those feature names in the list, And if it is a keywork from: ‘all’, ‘categorical’, ‘categorical_na’, ‘numerical’, ‘numerical_na’, ‘complete’, ‘features’, ‘target’, then the list of features is extracted from the metainformation of the dataset.
- Parameters
what –
Possible values are
all: (Default) Include very feature, including the target
numerical: Only numerical features
categorical: Only categorical features
complete: Only features without NA
numerical_na: Numerical features with NA
categorical_na: Categorical features with NA
features: Only features, NOT the target variable.
target: Only the target variable.
- Returns
Reference to the columns specified.
- set_target(target_name)[source]¶
Set the target variable for this dataset. This will create a new property of the object called ‘target’ that will contain the target column of the dataset, and that column will be removed from the list of features.
- Parameters
target_name – The name of the column we want to be set as the target variable for this dataset.
Example:
my_data.set_target('SalePrice')
- skewed_features(threshold=0.75, fix=False, return_series=True)[source]¶
Returns the list of numerical features that present skewness. This method optionally can fix detected skewness whose ABS is greater than the threshold passed, using BoxCox method.
- Parameters
threshold – The limit over which considering that the
skew()return value is considered a skewed feature.fix – (Default: False) Boolean indicating whether or not fixing the skewed features. If True, those with values above the threshold will be fixed using BoxCox.
return_series – (Default: True) Boolean indicating whether returning the features (pandas DataFrame) that present skewness.
- Returns
A pandas Series with the features and their skewness
- split(seed=1024, test_size=0.2, validation_split=False)[source]¶
From an Dataset, produce splits (with or without validation) for training and test. The objects of type
Splitwill only contain properties with the namestrainortestto reference the different splits.- Parameters
seed – The seed to be used to generate the random split.
test_size – The test size as a percentage of the base dataset.
validation_split – Boolean indicating whether it is also needed to generate a third split for validation purposes, same size as the test_size.
- Returns
The X and y objects that contain the splits.
Example:
# Generate the splits (80-20) X, y = my_data.split() # Create an instance of the model, and use the training set to # fit it, and the test set to score it. model = LinearRegression() model.fit(X.train, y.train) model.score(X.test, y.test)
- stepwise_selection(initial_list=None, threshold_in=0.01, threshold_out=0.05, verbose=False)[source]¶
Perform a forward/backward feature selection based on p-value from statsmodels.api.OLS Your features must be all numerical, so be sure to onehot_encode them before calling this method. Always set threshold_in < threshold_out to avoid infinite looping. All features involved must be numerical and types must be float. Target variable must also be float. You can convert it back to a categorical type after calling this method.
- Parameters
initial_list – list of features to start with (column names of X)
threshold_in – include a feature if its p-value < threshold_in
threshold_out – exclude a feature if its p-value > threshold_out
verbose – whether to print the sequence of inclusions and exclusions
- Returns
List of selected features
Example:
my_data.stepwise_selection()
See <https://en.wikipedia.org/wiki/Stepwise_regression> for the details
Taken from: <https://datascience.stackexchange.com/a/24823>
- summary(what='all')[source]¶
Printout a summary of each feature.
- Parameters
what –
Possible values are
all: (Default) Include very feature, including the target
numerical: Only numerical features
categorical: Only categorical features
complete: Only features without NA
numerical_na: Numerical features with NA
categorical_na: Categorical features with NA
features: Only features, NOT the target variable.
target: Only the target variable.
- Returns
N/A
- table(what='all', max_width=80)[source]¶
Print a tabulated version of the list of elements in a list, using a max_width display (default 80).
- Parameters
what –
Possible values are
all: (Default) Include very feature, including the target
numerical: Only numerical features
categorical: Only categorical features
complete: Only features without NA
numerical_na: Numerical features with NA
categorical_na: Categorical features with NA
features: Only features, NOT the target variable.
target: Only the target variable.
max_width – The max_width used in the display.
- Returns
None
- target = None¶
- to_categorical(to_convert)[source]¶
Convert the specified column or columns to categories
- Parameters
to_convert – column or column list to be converted
- Returns
object
- to_float(to_convert=None)[source]¶
Convert a column or list of columns to float values. The columns must be numerical.
- Args:
- to_convert: the column name or list of column names that we want
to convert. If this argument is empty, then every numerical feature in the dataset is converted.
Returns: The dataset
Example:
my_data.to_float(my_data.numerical_features) # which is equivalent to:: my_data.to_float() # We can also specify a single or multiple features:: my_data.to_float('feature_15') my_data.to_float(['feature_15', 'feature_21'])
- to_int(to_convert=None)[source]¶
Convert a column or list of columns to integer values. The columns must be numerical
- Args:
- to_convert: the column name or list of column names that we want
to convert. If none specified, all numerical columns are converted to int type.
Returns: The dataset
Example:
my_data.to_int(my_data.numerical_features) # which is equivalent to:: my_data.to_int() # We can also specify a single or multiple features:: my_data.to_int('feature_15') my_data.to_int(['feature_15', 'feature_21'])
- to_numerical(to_convert)[source]¶
Convert the specified column or columns to numbers
- Parameters
to_convert – column name or list of column names to be converted
- Returns
object
- TODO: It must be possible to perform label encoding if specified.
For example, I might want to convert a target variable with strings valued “Yes” and “No” to type “category” or to type “int” with values 1 and 0.
- under_represented_features(threshold=0.98)[source]¶
Returns the list of categorical features with unrepresented categories or a clear unbalance between the values that can take.
- Parameters
threshold – The upper limit of the most represented category of the feature.
- Returns
the list of features that with unrepresented categories.
dataset.split module¶
- class dataset.split.Split(splits)[source]¶
Bases:
objectThis class represents a split from a dataset, it will assign each dataframe partition passed as argument to a different attribute of the class: ‘train’, ‘test’ (and ‘validation’). The class method ‘split’ performs the splitting of the dataframe passed, according to the parameters passed.
Example:
from src import split
X, Y = split.Split(my_dataframe, my_target_column)
- split_name = ['train', 'test', 'validation']¶
Module contents¶
Dataset module for machine learning basic dataframe manipulation and feature engineering tasks