CatBoost
CatBoost 모델 이용
Great quality without parameter tuning
- Reduce time spent on parameter tuning, because CatBoost provides great results with default parameters
Categorical features support
- Improve your training results with CatBoost that allows you to use non-numeric factors, instead of having to pre-process your data or spend time and effort turning it to numbers.
Fast and scalable GPU version
- Train your model on a fast implementation of gradient-boosting algorithm for GPU. Use a multi-card configuration for large datasets.
Improved accuracy
- Reduce overfitting when constructing your models with a novel gradient-boosting scheme.
Fast prediction
- Apply your trained model quickly and efficiently even to latency-critical tasks using CatBoost's model applier
Default Parameter (CatBoost_Regressor)
class CatBoostRegressor(iterations=None,
learning_rate=None,
depth=None,
l2_leaf_reg=None,
model_size_reg=None,
rsm=None,
loss_function='RMSE',
border_count=None,
feature_border_type=None,
per_float_feature_quantization=None,
input_borders=None,
output_borders=None,
fold_permutation_block=None,
od_pval=None,
od_wait=None,
od_type=None,
nan_mode=None,
counter_calc_method=None,
leaf_estimation_iterations=None,
leaf_estimation_method=None,
thread_count=None,
random_seed=None,
use_best_model=None,
best_model_min_trees=None,
verbose=None,
silent=None,
logging_level=None,
metric_period=None,
ctr_leaf_count_limit=None,
store_all_simple_ctr=None,
max_ctr_complexity=None,
has_time=None,
allow_const_label=None,
one_hot_max_size=None,
random_strength=None,
name=None,
ignored_features=None,
train_dir=None,
custom_metric=None,
eval_metric=None,
bagging_temperature=None,
save_snapshot=None,
snapshot_file=None,
snapshot_interval=None,
fold_len_multiplier=None,
used_ram_limit=None,
gpu_ram_part=None,
pinned_memory_size=None,
allow_writing_files=None,
final_ctr_computation_mode=None,
approx_on_full_history=None,
boosting_type=None,
simple_ctr=None,
combinations_ctr=None,
per_feature_ctr=None,
ctr_target_border_count=None,
task_type=None,
device_config=None,
devices=None,
bootstrap_type=None,
subsample=None,
sampling_unit=None,
dev_score_calc_obj_block_size=None,
max_depth=None,
n_estimators=None,
num_boost_round=None,
num_trees=None,
colsample_bylevel=None,
random_state=None,
reg_lambda=None,
objective=None,
eta=None,
max_bin=None,
gpu_cat_features_storage=None,
data_partition=None,
metadata=None,
early_stopping_rounds=None,
cat_features=None,
grow_policy=None,
min_data_in_leaf=None,
min_child_samples=None,
max_leaves=None,
num_leaves=None,
score_function=None,
leaf_estimation_backtracking=None,
ctr_history_unit=None,
monotone_constraints=None,
feature_weights=None,
penalties_coefficient=None,
first_feature_use_penalties=None,
model_shrink_rate=None,
model_shrink_mode=None,
langevin=None,
diffusion_temperature=None,
posterior_sampling=None,
boost_from_average=None)
In short, you can do something like
pd.DataFrame({'feature_importance': model.get_feature_importance(train_pool), 'feature_names': x_val.columns}).sort_values(by=['feature_importance'], ascending=False)
you can also make a function like
def plot_feature_importance(importance,names,model_type):
#Create arrays from feature importance and feature names
feature_importance = np.array(importance)
feature_names = np.array(names)
#Create a DataFrame using a Dictionary
data={'feature_names':feature_names,'feature_importance':feature_importance}
fi_df = pd.DataFrame(data)
#Sort the DataFrame in order decreasing feature importance
fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True))
#Define size of bar plot
plt.figure(figsize=(10,8))
#Plot Searborn bar chart
sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
#Add chart labels
plt.title(model_type + 'FEATURE IMPORTANCE')
plt.xlabel('FEATURE IMPORTANCE')
plt.ylabel('FEATURE NAMES')
and plot the feature importance from different boosting algorithm
#plot the xgboost result
plot_feature_importance(xgb_model.feature_importances_,train.columns,'XG BOOST')
#plot the catboost result
plot_feature_importance(cb_model.get_feature_importance(),train.columns,'CATBOOST')
x1_train = train[['Numerical variable', ... , 'Categorical variable']]
y1_train = train['예측하고 싶은 값']
x1_test = train[['Numerical variable', ... , 'Categorical variable']]
categorical_features_indices1 = np.where(x1_train.dtypes == np.object)[0] # 이걸로 범주형 변수를 지정
from catboost import CatBoostRegressor
cat = CatBoostRegressor(loss_function='MAE')
model1 = cat
model1.fit(x1_train, y1_train, cat_features=categorical_features_indices1) # categorical feature 지정.
pred1 = model1.predict(x1_test)
submission['예측하고 싶은 값'] = pred1
submission.to_csv('catcategory2.csv', index=False)