TTML
Implements TTML, a tensor train based machine learning estimator.
Uses existing machine learning estimators to initialize a tensor train decomposition on a particular feature space discretization.
Then this tensor train is further optimized with Riemannian conjugate gradient descent. This library also implements much functionality related to tensor trains. And their Riemannian optimization in general.
One can use TTMLRegressor and TTMLClassifier just like any
scikit-learn estimator. As parameter it just needs another scikit-learn-like
estimator. For example, here we fit a TTMLRegressor using
RandomForestRegressor for initialization.
>>> from ttml import ttml
... import numpy as np
... from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
...
... # Try to learn the summation function in 10 dimensions
... X = np.random.normal(size=(1000, 10))
... y = np.sum(X, axis=1)
...
... forest = RandomForestRegressor()
... ttml = ttml.TTMLRegressor(forest)
... ttml.fit(X, y)
TTMLRegressor(estimator=RandomForestRegressor())
Note that here there is no need to fit random forest to data separately; this is
done automatically when calling ttml.fit() (but only if the random forest
has not been fitted to data yet). After fitting, we can use predict()
for predicting values:
>>> ttml.predict([[0.3] * 10]) # ten times the same value
array([2.89993562])
For classification problems only data with 0/1 labels is supported, but otherwise the procedure is very similar to regression. For example below we try to learn the function which returns 1 only if the sum of 10 numbers is positive.
>>> X = np.random.normal(size=(1000, 10))
... y = (np.sum(X, axis=1) > 0).astype(int)
...
... forest = RandomForestClassifier()
... ttml = ttml.TTMLClassifier(forest)
... ttml.fit(X, y)
TTMLClassifier(estimator=RandomForestClassifier())
For prediction, we can use predict() which gives 0/1 labels as output,
predict_proba() which gives a probability, and predict_logit()
which outputs a logit.
>>> ttml.predict([[0.3] * 10])
array([1.])
>>> ttml.predict_proba([[0.3] * 10])
array([0.9999665])
>>> ttml.predict_logit([0.3] * 10)
array([10.30384462])
Other than the base estimator, the most important hyperparameters for the
ttml are the tensor train rank and number of thresholds per feature,
controlled by the keywords max_rank and num_thresholds respectively. The
respective default values are 5 and 50. Changing the max_rank
parameter can cause fitting and inference (prediction) to take significantly
longer. The num_thresholds parameter also affects fittings speed, but does
not affect inference speed. Both parameters affect the accuracy of the model,
and there is no general rule of thumb for the optimal value of these parameters.
>>> ttml = ttml.TTMLClassifier(forest, max_rank=2, num_thresholds=10)
... ttml.fit(X, y)
TTMLClassifier(estimator=RandomForestClassifier())
Another feature that can greatly improve performance of ttml, is early
stopping. During the Riemannian optimization phase, we can monitor the
performance on a validation dataset. This greatly reduces the tendency to
overfit. To do this, we simply need to specify a validation dataset to the
fit() method:
>>> from sklearn.model_selection import train_test_split
... X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.15)
... ttml.fit(X_train, y_train, X_val=X_val, y_val=y_val)
TTMLClassifier(estimator=RandomForestClassifier())
- class ttml.ttml.TTML(tt, thresholds, categorical_features=None)[source]
Implements a TTML.
This stores a TensorTrain and a list of thresholds for each feature. Can be used as an sklearn-like estimator once trained.
Not intended to be initialized directly for most use cases, use
TTMLClassifierandTTMLRegressorinstead.- Parameters
tt (TensorTrain) –
thresholds (list<np.ndarray>) –
categorical_features (tuple<int> or None) –
- expand_thresholds(thresholds)[source]
Add new tresholds to TT (inplace), merging duplicate thresholds.
The TT-cores for the new thresholds are copied from those already present. The predictions of expanded model are guaranteed to be the same.
- classmethod fit(X, y, estimator, X_val=None, y_val=None, task='regression', num_thresholds=50, tt_cross_its=5, max_rank=10, opt_steps=100, opt_tol=1e-05, tt_cross_method='dmrg', estimator_output='logit', verbose=False, _thresholds=None, _ttls_kwargs=None, _predict_fun=None)[source]
Fit a TTML, using estimator for initialization.
The tensor train is initialized using the DMRG TT-cross algorithm from the function values of the estimator. It is then further optimized to lower training loss using Riemannian conjugate gradient descent.
- Parameters
X (np.ndarray) – Training X data values
y (np.ndarray) – Training truth labels. Should be 0/1 for classification.
estimator – An sklearn-like estimator for fitting. If the estimator is not yet fitted, this calls the estimator’s .fit function. For classification tasks, take note of the estimator_output keyword. If your estimator does not have an appropriate .predict method, then pass _predict_fun as a last resort. For classification we assume this outputs logits.
X_val (np.ndarray (optional)) – Validation set to monitor during training
y_val (np.ndarray (optional)) –
task (str (default: "regression")) – Whether to use “regression” or “classification” as task. In the case of “classification” this estimator will output logits.
num_thresholds (int (default: 50)) – The number of thresholds per feature to pick from the forest. Ignored if the _thresholds kwarg is specified.
tt_cross_its (int (default: 5)) – Number of iterations for the TT-cross algorithm
max_rank (int (default: 10)) – Maximum rank for the tensor train
opt_steps (int (default: 100)) – Number of steps of Riemannian conjugate gradient descent to take
opt_tol (float (default: 1e-5)) – After 3 steps of no relative improvement of at least opt_tol, the Riemannian conjugate gradient descent is stopped. If X_val is supplied then error is monitored on validation set, otherwise on training set.
tt_cross_method (str (default: "dmrg")) – Whether to use “regular” or “dmrg” type TT-cross algorithm for initialization.
estimator_output (str (default: "logit")) – For classification tasks, the output of the estimators’ .predict function. Supported arguments are “logit” and “proba”. If the estimator has a .predict_proba method, then this is ignored and .predict_proba is used instead.
verbose (bool (default: False)) – If True, print convergence and debug information
_thresholds (list[np.ndarray] (optional)) – Use this list of thresholds instead of inferring from the forest. Should be a list of arrays, one array per feature. The last element of each array is expected to be np.inf. TODO: update
_ttls_kwargs (dict (optional)) – Keyword arguments to pass to the Riemannian conjugate gradient descent optimizer. See TensorTrainLineSearch for details.
_predict_fun (method (optional)) –
- classmethod from_tree(decision_tree)[source]
Initialize from sklearn decision tree.
This creates a lossless encoding of the decision tree as TTML. The thresholds are precisely the decision boundaries of the tree.
- predict(X, task='regression')[source]
Predict values for the TT tt using threshold arrays and input X
- classmethod random_from_data(X, rank, n_thresholds, backend='numpy', categorical_features=None)[source]
Make a TTML with random TT and with thresholds determined by
TTML.thresholds_from_data()- Parameters
X (np.ndarray) –
rank (int or iterable<int>) – The tensor-train rank. If a list, it should be of length one shorter than number of features
n_thresholds (int or interable<int>) – Number of thresholds to use for each feature. This determines the outer dimensions of the TT
backend (str, optional (default: 'numpy')) –
categorical_features (tuple<int> or None) – The indices of the categorical features if any
- static thresholds_from_data(X, n_thresholds, categorical_features=None)[source]
Compute thresholds from data by binning according to percentile.
The last threshold is always np.infty. For categorical features all the unique values are used as thresholds, and the last value is replaced by np.infty.
- Parameters
X (np.ndarray) –
n_thresholds (int or iterable<int>) – The number of thresholds to use for each feature. Ignored for categorical features.
categorical_features (tuple (optional)) – The indices of the categorical features.
- Returns
thresholds
- Return type
list<np.ndarray>
- class ttml.ttml.TTMLClassifier(estimator, **kwargs)[source]
Wrapper to turn TTML into an sklearn classifier.
- Parameters
X (np.ndarray) – Training X data values
y (np.ndarray) – Training truth labels. Should be 0/1 for classification.
estimator – An sklearn-like estimator for fitting. If the estimator is not yet fitted, this calls the estimator’s .fit function. For classification tasks, take note of the estimator_output keyword. If your estimator does not have an appropriate .predict method, then pass _predict_fun as a last resort. For classification we assume this outputs logits.
X_val (np.ndarray (optional)) – Validation set to monitor during training
y_val (np.ndarray (optional)) –
task (str (default: "regression")) – Whether to use “regression” or “classification” as task. In the case of “classification” this estimator will output logits.
num_thresholds (int (default: 50)) – The number of thresholds per feature to pick from the forest. Ignored if the _thresholds kwarg is specified.
tt_cross_its (int (default: 5)) – Number of iterations for the TT-cross algorithm
max_rank (int (default: 10)) – Maximum rank for the tensor train
opt_steps (int (default: 100)) – Number of steps of Riemannian conjugate gradient descent to take
opt_tol (float (default: 1e-5)) – After 3 steps of no relative improvement of at least opt_tol, the Riemannian conjugate gradient descent is stopped. If X_val is supplied then error is monitored on validation set, otherwise on training set.
estimator_output (str (default: "logit")) – For classification tasks, the output of the estimators’ .predict function. Supported arguments are “logit” and “proba”. If the estimator has a .predict_proba method, then this is ignored and .predict_proba is used instead.
verbose (bool (default: False)) – If True, print convergence and debug information
_thresholds (list[np.ndarray] (optional)) – Use this list of thresholds instead of inferring from the forest. Should be a list of arrays, one array per feature. The last element of each array is expected to be np.inf. TODO: update
_ttls_kwargs (dict (optional)) – Keyword arguments to pass to the Riemannian conjugate gradient descent optimizer. See TensorTrainLineSearch for details.
_predict_fun (method (optional)) –
- class ttml.ttml.TTMLEstimator(estimator, task=None, max_rank=5, num_thresholds=50, tt_cross_its=5, opt_steps=100, opt_tol=1e-05, verbose=False, **kwargs)[source]
Meta class to use TTML as sklearn estimator. Use derivative classes TTMLClassifier and TTMLRegressor instead.
- class ttml.ttml.TTMLRegressor(estimator, **kwargs)[source]
Wrapper to turn TTML into an sklearn classifier.
- Parameters
X (np.ndarray) – Training X data values
y (np.ndarray) – Training truth labels. Should be 0/1 for classification.
estimator – An sklearn-like estimator for fitting. If the estimator is not yet fitted, this calls the estimator’s .fit function. For classification tasks, take note of the estimator_output keyword. If your estimator does not have an appropriate .predict method, then pass _predict_fun as a last resort. For classification we assume this outputs logits.
X_val (np.ndarray (optional)) – Validation set to monitor during training
y_val (np.ndarray (optional)) –
task (str (default: "regression")) – Whether to use “regression” or “classification” as task. In the case of “classification” this estimator will output logits.
num_thresholds (int (default: 50)) – The number of thresholds per feature to pick from the forest. Ignored if the _thresholds kwarg is specified.
tt_cross_its (int (default: 5)) – Number of iterations for the TT-cross algorithm
max_rank (int (default: 10)) – Maximum rank for the tensor train
opt_steps (int (default: 100)) – Number of steps of Riemannian conjugate gradient descent to take
opt_tol (float (default: 1e-5)) – After 3 steps of no relative improvement of at least opt_tol, the Riemannian conjugate gradient descent is stopped. If X_val is supplied then error is monitored on validation set, otherwise on training set.
estimator_output (str (default: "logit")) – For classification tasks, the output of the estimators’ .predict function. Supported arguments are “logit” and “proba”. If the estimator has a .predict_proba method, then this is ignored and .predict_proba is used instead.
verbose (bool (default: False)) – If True, print convergence and debug information
_thresholds (list[np.ndarray] (optional)) – Use this list of thresholds instead of inferring from the forest. Should be a list of arrays, one array per feature. The last element of each array is expected to be np.inf. TODO: update
_ttls_kwargs (dict (optional)) – Keyword arguments to pass to the Riemannian conjugate gradient descent optimizer. See TensorTrainLineSearch for details.
_predict_fun (method (optional)) –