Prerequisites 'Evaluate solution & neighbor on GPU'

A. Evaluation of solution

Before starting any local search algorithms, you must evaluate completely the solution on CPU. As consequence and similarly to Paradiseo-MO you need a class that inherits from "eoEvalFunc" and define by the way the solution evaluation functor.


class moGPUEvalOneMax: public eoEvalFunc<EOT> {

  void operator()(EOT & _bitVector) {
    unsigned sum = 0;
    for (unsigned i = 0; i < _bitVector.size(); i++)
      sum += _bitVector[i];;

B. Evaluation of neighbors

The fitness evaluation method of a large neighborhood may influence on the performance of metaheuristics. The parallel evaluation of the neighborhood (set of neighbors) provides an efficient way to speed-up the search process. The parallel evaluation on GPU requires to take into account the GPU characteristics. The GPU computing is based on hyper-threading (massively parallel multi-threading) and the order in which the threads are executed is not known.Therefore, an efficient mapping has to be defined between each neighboring candidate solution and a thread designated by a unique identifier assigned by the GPU runtime. In our developments,we proposed a different mapping(bit-flipping, swap,K-swap ...).

The evaluation of neighbors can be classified into two ways:

  • Incremental evaluation
  • Full evaluation

It is strongly advised for more efficiency in terms of performance to use the incremental evaluation when it is possible. The Full evaluation may influence on the performance of Local search as it requires to copy a set of solutions(number of threads within a kernel) on GPU.


To implement evaluation of neighbors on GPU, you have to create a class which inherits from "moGPUFuncEval" and define by the way the neighbor evaluation functor,that should be an inline function which is accessible from the host and the takes on arguments :

_solution : the solution which generate the neighborhood. _fitness : the current fitness of the solution which generate the neighborhood _index : an array that contains a set of indexes corresponding to the current neighbor evaluated on GPU.

Taking an example of permutation neighborhood, the index array is represented like :

  • index[0] : The first index of the permutation.
  • index[1] : The second index of the permutation.
  • index[2] : The neighborhood size.
  • index[3] : The GPU thread identifier that evaluate the current neighbor

template<class Neighbor> class moGPUEvalFunc {

virtual inline __host__ __device__ Fitness operator() (T * _solution,Fitness _fitness, unsigned int * _index) {

	return _fitness;



  • T & Fitness are simultaneously the templates of a solution and the fitness type
  • The inline key-word is to notify that it is an inline function
  • The __device__ qualifier declares a function callable only from the GPU and executed on it.
  • The __host__ qualifier declares a function callable from the CPU and executed on it.

The combination between the __host__ & __device__ qualifiers declares that a function is compiled and executed for both the CPU and the GPU.


Based on a highly parallel architecture, CUDA programs work in a SIMD fashion. The threads within kernel execute the same portion of code and they are divided into equal sized threads blocks. The threads in the same thread block are executed on the same multiprocessor, and each multiprocessor on GPU can simultaneously execute many thread blocks (warp). For the One-Max problem we can define the incremental evaluation code as :

inline __host__ __device__ Fitness operator() (T & _bitVector,Fitness _fitness, unsigned int *_index){

      	Fitness tmp=_fitness;
	for(unsigned i=0;i<NB_POS;i++) {

		if (_bitVector[_index[i]] == 0)
		tmp= tmp+1;
		tmp= tmp-1;

	return tmp;