Logistic Regression Workflow

The XML document 'SimpleML.Samples\Resources\Logistic Regression Regularization Comparison.xml' contains a basic implementation of logistic regression based on the second programming exercise of the Coursera Machine Learning course. The workflow reads a set of data which contains numeric results of two tests on a microchip, and result data which defines whether the microchip failed after manufacture. Logistic regression is used to predict whether future chips will fail or pass based on the same two test results. The data points are shown on the below chart, including the logistic regression decision boundary which is found using a regularization parameter of 1.0...

Logistic Regression Data

Additionally, the workflow demonstrates...

  • Using MMF to implement a dependency inversion design pattern
  • Modules which run a process multiple times to evaluate between different parameter settings

Workflow Overview

The training data is first read from the file 'SimpleML.Samples\Resources\LogisticRegressionData.csv', and then randomized and polynomial features generated using the same modules as in the linear regression workflows. The data is split into 70% training and 30% test portions using the 'Matrix Train Test Splitter' module, and the training data is fed into the 'Multi Parameter Trainer' module. 'Multi Parameter Trainer' runs multiple iterations of gradient descent optimization using an underlying instance of the FunctionMinimizer class. For each gradient descent iteration, different regularization parameters are passed to the FunctionMinimizer using the 'RegularizationParameterSet' input slot. Specifically in this workflow, 5 different regularization parameters are used (0, 0.5, 1.0, 1.5, and 100.0). After running gradient descent using the 5 regularization parameters, 5 sets of optimized theta parameters are produced at the 'Multi Parameter Trainer' module's output slot. These theta parameters, and the test portion of the initial data are fed into the 'Multi Parameter Error Rate Calculator' module, which calculates the error rate on the test data using the 5 different theta parameter sets (and hence using the 5 different regularization parameters). The workflow is depicted below...

Logistic Regression Regularization Comparison Module Graph

Uncomment the following code in the Program class to execute the workflow...

// Uncomment lines below to process the various module graphs... //ProcessGraph("Basic Linear Regression.xml", "Basic Linear Regression without Feature Scaling", xmlDataSerializer, "Results"); //ProcessGraph("Linear Regression with Feature Scaling.xml", "Linear Regression with Feature Scaling", xmlDataSerializer, "RescaledMatrix"); //ProcessGraph("Linear Regression with Feature Scaling and Test Data Cost.xml", "Linear Regression with Feature Scaling, Cost Calculated against Test Data", xmlDataSerializer, "RescaledMatrix"); //ProcessGraph("Linear Regression with Polynomial Features.xml", "Linear Regression with Polynomial Features", xmlDataSerializer, "RescaledMatrix"); //ProcessGraph("Linear Regression with Polynomial Features and Test Data Cost.xml", "Linear Regression with Polynomial Features, Cost Calculated against Test Data", xmlDataSerializer, "RescaledMatrix"); //ProcessAndCancelGraph("Linear Regression with Polynomial Features.xml", "Cancellation of Processing", xmlDataSerializer); ProcessLogisticRegressionGraph(xmlDataSerializer);

Code in the ProcessLogisticRegressionGraph() method writes the theta parameter sets and corresponding error rates to the console...

For theta parameters... 4.65315305153921 4.27860883307589 4.21282394334502 -12.3542735575188 -7.73125908322724 -10.4865818184096 Error rate = 14.29% For theta parameters... 1.4364065194311 0.762817264058104 0.882012172987732 -3.24234910559426 -1.08092946644244 -2.66821930280605 Error rate = 11.43% For theta parameters... 1.01546646282374 0.398921211688569 0.504292950226816 -2.13786185758041 -0.503349601947032 -1.68969275003891 Error rate = 5.71% For theta parameters... 0.81573879622489 0.248912041698959 0.343415808544853 -1.6209003026182 -0.298597030972277 -1.24660410723077 Error rate = 11.43% For theta parameters... 0.184484481892829 -0.00316288714075302 0.0029251702656624 -0.0367693432189964 -0.00193267544339699 -0.0244560353840423 Error rate = 62.86%

Dependency Inversion

The workflow demonstrates how MMF can support a dependency inversion design pattern where a specific implementation of a class can be decided by changes to the workflow. The module 'Logistic Regression Cost Function Calculator' does not implement any 'real' functionality in its ImplementProcess() method, other than to instantiate and copy an instance of the LogisticRegressionCostFunctionCalculator class to the module output slot.

protected override void ImplementProcess() { GetOutputSlot(costFunctionCalculatorOutputSlotName).DataValue = new SimpleML.LogisticRegressionCostFunctionCalculator(); }

This instance of the LogisticRegressionCostFunctionCalculator class is then passed to the Minimize() method of the FunctionMinimizer class. As the Minimize() method accepts any class which implements the ICostFunctionGradientCalculator interface, other modules similar to 'Logistic Regression Cost Function Calculator' could be written to instantiate and return other objects implementing ICostFunctionGradientCalculator. These modules could then be substituted into the workflow as required.

Parameter Evaluation

The need to run processes with multiple sets of parameters in order to find the best parameter set is commonplace in machine learning. The 'Multi Parameter Trainer' and 'Multi Parameter Error Rate Calculator' modules show how MMF modules can be built to easily perform such parameter evaluation.