zoofs - variable selection (python)

Aug 2021 by Francisco Juretig

Feature selection is an important part of every ML pipeline. There are many interesting libraries out there, but zoofs stands out due to the sophistication used in its algorithms. It is meant to be used after we select the hyperparameters for a ML model. Let's review how we do ML. For any ML regression or classification problem, we need to do four things:

  1. choose the model type
  2. tune the hyper-parameters
  3. choose the features
  4. use the model for prediction

zoofs is meant to help us in (3). In this example, we will work with a dummy dataset with 5 good features, 5 useless features (just noise), and a 0/1 target. The objective will be to evaluate how well the different algorithms implemented in zoofs work. We will first load the dataset via pandas, build the test/train sets, optimise the hyper-parameters for a Random Forest, then feed this estimator into 4 different algorithms implemented in zoofs.

Short video

Just a quick overview, you can see the specific images below.

Here we load our data from a csv file. The dataframe that we loaded is named "data". It has NUM1-NUM5 as relevant features, and rand1-rand5 as irrelevant (they are just noise, and a good variable selection algorithm should discard them). The target is IS_VIP_Client which is a 1/0 variable (in consequence we will work with a classification model).

Note that we connected this panel to a panel below that actually trains a Random Forest model. Here we find the best hyper-parameters using the well-known GridSearchCV function. This is because we need to first determine the hyper-parameters before we do variable selection.



After the hyperparameter optimization, we need to define the objective function that we will optimize. At the end, the zoofs library is an optimization one. We need to build a function that returns something to be optimized. In our case, we will maximize the area under the ROC curve for the test data.

We will test 4 algorithms. GreyWolf, ParticleSwarm, a genetic algorithm, and the dragonfly algorithm. In order to run this, we just go to the panel at the top, and click on >> in the panel. That runs everything that is connected to OUT, which in our case is everything.

Here we can see two other algorithms



We can click on the 7th icon from the left on each panel to create a replica of the output (linked in real time). We can then move these relicas next to each other as we did here. This makes it easy to compare multiple models very easily. As we can see ParticleSwarm got the best AUC, followed by DragonFly. Both the genetic algorithm and GreyWolf didn't perfom well, as they ended up keeping all the variables.



Remember that a good algorithm here should keep the 5 relevant variables, and throw out the other irrelevant ones.