## Problem Statement:

Given a signal, which is regularly sampled over time and is “noisy”, how can the noise be reduced while minimizing the changes to the original signal.

## Discussion:

A common problem in reconstructing data is elimination of noise. Noise can corrupt a signal through many means: quantization, measurement noise, errors in sampling time, sensor bias, sensor nonlinearities, signal cross coupling, etc. There are many methods which can be used to eliminate the noise on a signal. Some common approaches include use of a linear filter, Kalman filtering, Wiener filtering, construction of a custom optimization problem, and any number of ad-hoc approaches. Furthermore, the filtering of the signal can be done causally or noncausally. In a causal filter, the filtering for a particular sample if determined based only on the sample received before that sample. In acausal or noncausal filtering, all of the data before and after the sample can be used to determine the ‘best’ value for each sample.
One way to quickly filter a dataset without much effort is to use a Fourier transform. A Fourier transform is a way to decompose a signal into a sum of sine waves. The amplitude and phase associated with each sine wave is known as the spectrum of a signal. If the spectrum of the noise if away from the spectrum of the original signal, then original signal can be filtered by taking a Fourier transform, filtering the Fourier transform, then using the inverse Fourier transform to reconstruct the signal.
Lets start with a  simple example. Consider a signal consisting of a single sine wave, $$s(t)=sin(w*t)$$. Let the signal be subject to white noise which is added in during measurement, $$s_{measured}(t)=s(t)+n$$. Let the $$F$$ be the Fourier transform of $$s$$. Now by setting the value of $$F$$ to zero for frequencies above and below $$w$$, the noise is reduced. Let $$F_{filtered}$$ be the filtered Fourier transform. Taking the inverse Fourier transform of $$F_{filtered}$$ yields $$s_{filtered}(t)$$. The code sample which follows illustrates these operations.

 """
Script to demonstrate the use of Fourier Transforms to acausally filter
a signal.

"""

__author__ = 'Ed Tate'
__email__  = 'edtategmail-dot-com'
__website__ = 'exnumerus.blogspot.com'

from pylab import *

# setup the problem
num_samples  = 1000 # number of samples

# generate an ideal signal
f_signal  = 6   # signal frequency  in Hz
dt = 0.01 # sample timing in sec
p  = 30   # 30 degrees of phase shift
a  = 1    # signal amplitude
s = [a*sin((2*pi)*f_signal*k*dt) for k in range(0,num_samples)]
s_time = [k*dt for k in range(0,num_samples)]

# simulate measurement noise
from random import gauss
mu = 0
sigma = 2
n = [gauss(mu,sigma) for k in range(0,num_samples)]

# measured signal
s_measured = [ss+nn for ss,nn in zip(s,n)]

# take the fourier transform of the data
F = fft(s_measured)

# calculate the frequencies for each FFT sample
f = fftfreq(len(F),dt)  # get sample frequency in cycles/sec (Hz)

# filter the Fourier transform
def filter_rule(x,freq):
band = 0.05
if abs(freq)>f_signal+band or abs(freq)<f_signal-band:
return 0
else:
return x

F_filtered = array([filter_rule(x,freq) for x,freq in zip(F,f)])

# reconstruct the filtered signal
s_filtered = ifft(F_filtered)

# get error
err = [abs(s1-s2) for s1,s2 in zip(s,s_filtered) ]

figure()
subplot(4,1,1)
plot(s_time,s)
ylabel('Original Signal')
xlabel('time [s]')

subplot(4,1,2)
plot(s_time,s_measured)
ylabel('Measured Signal')
xlabel('time [s]')

subplot(4,1,3)
semilogy(f,abs(F_filtered),'or')
semilogy(f,abs(F),'.')
legend(['Filtered Spectrum','Measured Spectrum',])
xlabel('frequency [Hz]')

subplot(4,1,4)
plot(s_time,s_filtered,'r')
plot(s_time,s,'b')
legend(['Filtered Signal','Original Signal'])
xlabel('time [s]')

show()


The output from this script shows the initial signal, followed by the measured signal which is corrupted with noise. The spectrum of this signal is filtered to recover the original signal.

Test Configuration:
• win7
• PythonXY 2.7.2.1

## Problem Statement

How do you validate a model of a system against a physical system when a controller is necessary to make the system operate and the the operational policies of the controllers were developed independently.

## Discussion

Consider a development process with a well defined operation cost model which drives both the model and physical system to optimization that operations cost. If the model uses full state feedback for control and the physical system uses either full state feedback or implements output feedback with state estimators and relies on certainty equivalence, there is no guarantee that under identical test conditions the trajectories of the cost and the trajectories of the states will be identical even if both system are optimal with respect to the operational costs. This is because optimal operational costs are unique, but there are no guarantees that optimal tratectories are unique. Furthermore, if the controls in both cases are not globally optimal, by only near optimal, then likelihood of non-unique trajectories is even more likely. However, because the operational costs can be unique, the validation exercise can be decomposed into two validation steps.

First, the equations which model the physics can be validated against test data on the physical system by measuring the states in the real system, then substituting the integrator in the model with the state measurements. Ideally, the physical system could execute both policies. The error in costs, and derivative calculations can be compared to quantify the error between the model of the physics and the real physics.

Second, once the errors in the model of the physics are quantified, the error in the costs under the different controllers can be quantified. Ideally, from this step, the optimality of the controls wrt to a globally optimal controller (or minimizing controller) can be established. Once interesting possibility is to use policy improvement to see if the independently developed policies can merged for better performance. Alternatively, if there are unexplained differences, then the constraints respected by the different policies need to be reconciled. Things like robustness may also contribute to differences. Robustness will in general be a driven by different view of noise and risk sensitivity. Following on that, there is the possibility that the equation structure in the different policies lead to different performance limitations. Again, policy improvement may provide a way to identify these structure imposed limitations.

## Wednesday, August 3, 2011

### Random Notes: Thoughts on metrics for engineering tools

• If training is required for successful usage, what is the average half-life of the training. In other words, if a group of users is trained, how long until 50% of the users will forget some key aspect of tool usage which drives them to abandon the tool?
• What is the average time for user to need to go to the help files to complete a task if they do not use the tool constantly?
• Can a user successfully use the tool without training?
• Are the documentation and examples sufficient for self learning?
• How many actions are required to complete a ‘quickstart’ example?
• How many decisions are required to complete a ‘quickstart’ example?
• How many choices are the in each decision in a typical workflow?
• How difficult is it to integrate the tool into automated work flows?
• How difficult is it to customize the tool?
• Can a power user customize the tool?
• How long does it take to introduce a new feature in the tool?
• How many sentences does it take to describe why a user should adopt the tool?
• In the absence of process enforcement, would the users naturally adopt this solution?
• What is the time saving for the individual, team, and organization from the adoption of the tool?
• If the tool reduces error rates, is there feedback to the users to help them understand the improvement?
• Can the input and output to the tool be reused so that the effort can be reapplied?
• What is the ‘activation potential’ to get a new user to adopt the tool?
• In a corporate setting, how difficult are the permissions to manage?
• If a new user if not setup, will the team be able to duplicate the permissions with without calling the developers?

## Deterministic Cost Models

 Description Cost Model Dynamic Programming Equations Restrictions Finite Horizon Total Cost $$J^{\pi}\left(x_{0}\right)=\sum_{k=0}^{K}\alpha^{k}\cdot c_{k}\left(x_{k},\pi\left(x_{k}\right)\right)$$ $$V_{k}^{\pi}\left(x\right)=c_{k}\left(x,\pi\left(x\right)\right)+\alpha\cdot V_{k+1}^{\pi}\left(f\left(x,\pi\left(x\right)\right)\right)$$,$$\forall k\in\left\{ 0,\cdots,K-1\right\}$$ $$V_{K}^{\pi}\left(x\right)=c_{K}\left(x,\pi\left(x\right)\right)$$ $$0\leq\alpha<1$$ Infinite Horizon Total Cost $$J^{\pi}\left(x_{0}\right)=\sum_{k=0}^{\infty}\alpha^{k}\cdot c\left(x_{k},\pi\left(x_{k}\right)\right)$$ $$V^{\pi}\left(x\right)=c\left(x,\pi\left(x\right)\right)+\alpha\cdot V^{\pi}\left(f\left(x,\pi\left(x\right)\right)\right)$$ $$0\leq\alpha<1$$ Finite Horizon Shortest Path $$J^{\pi}\left(x_{0}\right)=\sum_{k=0}^{K}\alpha^{k}\cdot c_{k}\left(x_{k},\pi\left(x_{k}\right)\right)$$ $$V_{k}^{\pi}\left(x\right)=c_{k}\left(x,\pi\left(x\right)\right)+\alpha\cdot V_{k+1}^{\pi}\left(f\left(x,\pi\left(x\right)\right)\right)$$,$$\forall k\in\left\{ 0,\cdots,K-1\right\}$$ $$V_{K}^{\pi}\left(x\right)=c_{K}\left(x,\pi\left(x\right)\right)$$ $$0\leq\alpha\leq1$$ $$\left\{ x\in\chi|c\left(x,\pi\left(x\right)\right)=0\right\} \neq\left\{ \oslash\right\}$$ Infinite Horizon Shortest Path $$J^{\pi}\left(x_{0}\right)=\sum_{k=0}^{\infty}\alpha^{k}\cdot c\left(x_{k},\pi\left(x_{k}\right)\right)$$ $$V^{\pi}\left(x\right)=c\left(x,\pi\left(x\right)\right)+\alpha\cdot V^{\pi}\left(f\left(x,\pi\left(x\right)\right)\right)$$ $$0\leq\alpha\leq1$$ $$\left\{ x\in\chi|c\left(x,\pi\left(x\right)\right)=0\right\} \neq\left\{ \oslash\right\}$$ Average Cost $$J^{\pi}\left(x_{0}\right)=\underset{K\rightarrow\infty}{\lim}\frac{1}{K}\sum_{k=0}^{K}\alpha^{k}\cdot c\left(x_{k},\pi\left(x_{k}\right)\right)$$ $$V^{\pi}\left(x\right)+\lambda=c\left(x,\pi\left(x\right)\right)+V^{\pi}\left(f\left(x,\pi\left(x\right)\right)\right)$$ $$0\leq\alpha<1$$ $$V^{\pi}\left(x_{ref}\right)=0$$ for some $$x_{ref}\in\chi$$

## Stochastic Cost Models

 Description Cost Model Dynamic Programming Equations Restrictions Finite Horizon Total Cost $$J^{\pi}\left(x_{0}\right)=E^{W}\left[\sum_{k=0}^{K}\alpha^{k}\cdot c_{k}\left(x_{k},\pi\left(x_{k}\right),w\right)\right]$$ $$V_{k}^{\pi}\left(x\right)=E^{W}\left[c_{k}\left(x,\pi\left(x\right),w\right)+\alpha\cdot V_{k+1}^{\pi}\left(f\left(x,\pi\left(x\right),w\right)\right)\right]$$ $$V_{K}^{\pi}\left(x\right)=E^{W}\left[c_{K}\left(x,\pi\left(x\right)\right)\right]$$ $$0\leq\alpha<1$$ Infinite Horizon Total Cost $$J^{\pi}\left(x_{0}\right)=E^{W}\left[\sum_{k=0}^{\infty}\alpha^{k}\cdot c\left(x_{k},\pi\left(x_{k}\right),w\right)\right]$$ $$V^{\pi}\left(x\right)=E^{W}\left[c\left(x,\pi\left(x\right),w\right)+\alpha\cdot V^{\pi}\left(f\left(x,\pi\left(x\right),w\right)\right)\right]$$ $$0\leq\alpha<1$$ Finite Horizon Shortest Path $$J^{\pi}\left(x_{0}\right)=E^{W}\left[\sum_{k=0}^{K}\alpha^{k}\cdot c_{k}\left(x_{k},\pi\left(x_{k}\right),w\right)\right]$$ $$V_{k}^{\pi}\left(x\right)=E^{W}\left[c_{k}\left(x,\pi\left(x\right),w\right)+\alpha\cdot V_{k+1}^{\pi}\left(f\left(x,\pi\left(x\right),w\right)\right)\right]$$ $$V_{K}^{\pi}\left(x\right)=E^{W}\left[c_{K}\left(x,\pi\left(x\right)\right)\right]$$ $$0\leq\alpha\leq1$$ $$\left\{ x\in\chi|c\left(x,\pi\left(x\right)\right)=0\right\} \neq\left\{ \oslash\right\}$$ Infinite Horizon Shortest Path $$J^{\pi}\left(x_{0}\right)=E^{W}\left[\sum_{k=0}^{\infty}\alpha^{k}\cdot c\left(x_{k},\pi\left(x_{k}\right),w\right)\right]$$ $$V^{\pi}\left(x\right)=E^{W}\left[c\left(x,\pi\left(x\right),w\right)+\alpha\cdot V^{\pi}\left(f\left(x,\pi\left(x\right),w\right)\right)\right]$$ $$0\leq\alpha\leq1$$ $$\left\{ x\in\chi|c\left(x,\pi\left(x\right)\right)=0\right\} \neq\left\{ \oslash\right\}$$ Average Cost $$J^{\pi}\left(x_{0}\right)=E^{W}\left[\underset{K\rightarrow\infty}{\lim}\frac{1}{K}\sum_{k=0}^{K}\alpha^{k}\cdot c\left(x_{k},\pi\left(x_{k}\right),w\right)\right]$$ $$V^{\pi}\left(x\right)+\lambda=E\left[c\left(x,\pi\left(x\right),w\right)+V^{\pi}\left(f\left(x,\pi\left(x\right),w\right)\right)\right]$$ $$0\leq\alpha<1$$ $$V^{\pi}\left(x_{ref}\right)=0$$ for some $$x_{ref}\in\chi$$

## Risk Aware/Averse Stochastic Cost Models

 Description Cost Model Dynamic Programming Equations Restrictions Certainty Equivalence with exponential utility $$J^{\pi}\left(x_{0}\right)=\underset{K\rightarrow\infty}{\limsup}\frac{1}{K}\cdot\frac{1}{\gamma}\cdot\ln\left(E^{W}\left[\exp\left(\sum_{k=0}^{K-1}c\left(x,\pi\left(x\right),w\right)\right)\right]\right)$$ Mean-Variance

## Cost Models That don’t work or have issues

 Description Cost Model Issues Expected exponential disutility $$J^{\pi}\left(x_{0}\right)=\underset{K\rightarrow\infty}{\limsup}\frac{1}{K}\cdot E^{W}\left[\textrm{sgn}\left(\gamma\right)\cdot\exp\left(\gamma\cdot\sum_{k=0}^{K-1}c\left(x,\pi\left(x\right),w\right)\right)\right]$$ Does not discriminate among policies Different version of expected exponential disutility $$J^{\pi}\left(x_{0}\right)=\underset{K\rightarrow\infty}{\limsup}\frac{1}{\gamma}\cdot\log\left(E^{W}\left[\exp\left(\gamma\cdot\frac{\gamma}{K}\sum_{k=0}^{K-1}c\left(x,\pi\left(x\right),w\right)\right)\right]\right)$$ Generally reduces to cost average

## Sunday, July 31, 2011

### Notes on Policy Improvement and Controlled Dynamic Systems

A controlled dynamic system has inputs which can steer the evolution of the state of the system.

 $$x_{k+1}=f\left(x_{k},u_{k}\right)$$ (1)

The inputs to the dynamic system can be determined by a policy, $$\pi\( that maps the state of a dynamic system to an input of the dynamic system. This policy makes the controlled dynamic system behave like an autonomous dynamic system.  \(x_{k+1}=f\left(x_{k},\pi\left(x_{k}\right)\right)=\widetilde{f}\left(x_{k}\right)$$ (2)

Given a cost of operation for the dynamic system,

 $$J\left(x_{0}\right)=\sum_{k=0}^{\infty}\left(\alpha^{k}\cdot c\left(x_{k}\right)\right)$$, (3)

a value function which is a function of the control policy and the initial state can be found using a variation of dynamic programming. This value function is

 $$V^{\pi}\left(x,\pi\left(x\right)\right)=c\left(x,\right)+\alpha^{k}\cdot V^{\pi}\left(f\left(x,\pi\left(x\right)\right)\right)$$ (4)

One engineering challenge with a controlled dynamic system is optimizing its performance. Policy improvement provides some insight into how to incrementally improve a policy. The key idea in policy improvement, is that if a change can be made in the policy that improves the immediate and future operational costs, then this change improves the policy. If

 $$c\left(x,u\right)+\alpha^{k}\cdot V^{\pi}\left(f\left(x,u\right)\right)\leq V^{\pi}\left(x\right)$$ (5)

then the choice of $$u$$ at $$x$$ is an improvement on the policy $$\pi$$ and will reduce the operating costs.

## Other key ideas:

### Notes on Dynamic Systems and the Value Function

Dynamic systems can be described by differential and difference equations. Without a loss of generality, consider a dynamic system represented by a difference equation: $$x_{k+1}=f\left(x_{k}\right)$$. The state of the system is represented by $$x$$ and the function $$f$$ is the mapping from one state to the next. One way to characterize a dynamic system is with an additive cost: $$J$$. An additive cost summarizes the cost of operation for the system from some initial state, $$x_{0}$$. To ensure that the sums are finite an exponential weighting factor, $$alpha$$, is introduced. This factor has a value between between 0 and 1.  Under some circumstances, $$alpha$$ can be equal to one. One special case is where the system is guaranteed to eventually enter a zero cost state. However, in general, it will need to be less than one. This additive cost is a function of each initial condition:

 $$J\left(x_{0}\right)=\sum_{k=0}^{\infty}\left(\alpha^{k}\cdot c\left(x_{k}\right)\right)$$. (1)

The value of the additive cost can be solved using the dynamic programming equations:

 $$V\left(x\right)=c\left(x\right)+\alpha^{k}\cdot V\left(f\left(x\right)\right)$$. (2)

The function $$V$$ is referred to as the value function.

## Wednesday, July 27, 2011

### Organizational Incentives and Responsibilities: An Observation

In a large organization, properly aligning incentives and responsibilities is always a challenge. However, sometimes it all comes together like a grand plan. Saw a situation today that demonstrated this. A year ago, a particular facility was very undesirable:  old paint, old carpets, intermittent wireless, worn furniture, etc. In a large organization, fixing mundane issues like this can be a royal challenge: forms, approvals, policy, and more. By chance (at least to a distant observer), the machinery of the organization moved people responsible for facilities into the area. One year later… new paint, new carpets, new furniture. The other groups in the area had their productivity and satisfaction improved. Responsibilities met incentives and action was taken.

## Problem Statement:

Where is a grammar for Octave?

## Discussion:

Octave implements an open source language which is very close to Matlab’s language. Octave is not an exact clone, but is close enough to make conversion of Matlab scripts into Octave relatively easy. Since this grammar is defined and available in an open source form, it can be reused in other projects.

See:

• Where can I find a Matlab grammar

## Problem Statement:

Can using OpenCL with Python significantly improve the performance of the Ellipse Fit problem? This problem was introduced here and improved on here using vectorization, and here using parallelization.

## Discussion:

OpenCL is an open, royalty free, cross platform standard for parallel programming on heterogeneous systems. PyOpenCL is a wrapper for OpenCL which allows blending an OpenCL program into a Python script. PyOpenCL requires some setup to incorporate into a script. Any OpenCL program must be configured to work with a Python script. In general, a text representation of the OpenCL function (work task) is created. The device which will execute the program needs to be selected. The function needs to have input and output buffers defined to move data between the main program and the device which will execute the work task. Before the work task is called, the input buffers must be filled with data. The text representation of the work task must be compiled before execution. Finally, the function is called and it loads data into the output buffers when it completes.

One advantage of OpenCL is the ability to move a program among different devices for execution. One of the disadvantages is that there are multiple tuning aspects to tune which can affect performance. The best configuration for one device is usually not optimal for other devices.

## Testing Setup:

To recreate these experiments, recreate the test setup here, then add the objective function from here. Then add the new function objective_openCL.py. Finally, overwrite the main.py script with the one here.

objective_openCL.py

'''
This module contains an objective function for the ellipse fitting problem.
The objective is coded using vectorized operations.

'''

from numpy import *
import pyopencl as cl
from string import Template

# local imports
import objective_scalar

class Objective(objective_scalar.Objective):

def __init__(self,parameters):
objective_scalar.Objective.__init__(self,parameters)

precision_dict = {
'float':float32,
'double':float64}

self.platform_name      = parameters.get('platform_name',None)
self.platform_precision = parameters.get('platform_precision','double')
self.partial_sum_loc    = parameters.get('partial_sum_loc','__global')

self.numpy_prec = precision_dict[self.platform_precision]
openCL_prec = self.platform_precision

point_list = self._p

x,y  = zip(*point_list)
self.x  = array(x,dtype=self.numpy_prec)
self.y  = array(y,dtype=self.numpy_prec)
self.f1 = array([0,0],copy=True,dtype=self.numpy_prec)
self.f2 = array([0,0],copy=True,dtype=self.numpy_prec)
self.a  = array([0],copy=True,dtype=self.numpy_prec)

self.num_elements = len(x)

platforms = cl.get_platforms()
if len(platforms)>1 and not self.platform_name is None:
# select the the first device in list
for platform in platforms:
if self.platform_name in platform.name:
# use first device for context
for device in platform.get_devices():
self.ctx = cl.Context([device])
self.device = device
break
break
else:
self.ctx = cl.create_some_context()

if 'NVIDIA' in self.platform_name:
else:
compiler_options = ''

#self.ctx = cl.create_some_context()
self.queue = cl.CommandQueue(self.ctx)

# define the input buffers and transfer data into them
mf = cl.mem_flags
self.x_buf  = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.x)
self.y_buf  = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.y)
self.f1_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.f1)
self.f2_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.f2)
self.a_buf  = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.a)

# define the output buffer

# prep the container to the output
self.partial_sum = empty(self.x.shape,dtype=self.numpy_prec)
self.full_sum    = empty(array([0],dtype=self.numpy_prec).shape,dtype=self.numpy_prec)

# define local memory if needed
if not self.partial_sum_loc=='__global':
self.local_partial_sum = cl.LocalMemory(self.x.nbytes)

code_pattern= Template("""
#pragma OPENCL EXTENSION cl_khr_fp64: enable
__kernel void partial_sum(
__global const ${precision} *x, __global const${precision} *y,
__global const ${precision} *f1, __global const${precision} *f2,
__global const ${precision} *a,${partial_sum_loc} ${precision} *partial_sum) { int gid = get_global_id(0);${precision} x_f1_diff = x[gid]-f1[0];
${precision} x_f2_diff = x[gid]-f2[0];${precision} y_f1_diff = y[gid]-f1[1];
${precision} y_f2_diff = y[gid]-f2[1];${precision} x_f1_diff_sq = x_f1_diff*x_f1_diff;
${precision} x_f2_diff_sq = x_f2_diff*x_f2_diff;${precision} y_f1_diff_sq = y_f1_diff*y_f1_diff;
${precision} y_f2_diff_sq = y_f2_diff*y_f2_diff;${precision} norm_pt_to_f1 = sqrt(x_f1_diff_sq+y_f1_diff_sq);
${precision} norm_pt_to_f2 = sqrt(x_f2_diff_sq+y_f2_diff_sq);${precision} t1 = norm_pt_to_f1+norm_pt_to_f2-2*a[0];

partial_sum[gid]=t1*t1;
}

__kernel void full_sum(
${partial_sum_loc}${precision} *partial_sum,
__global ${precision} *full_sum) { long i;${precision} sum=0;

for(i=0; i<${num_elements}; i++) { sum += partial_sum[i]; } full_sum[0] = sum; } __kernel void full_sum2( __global${precision} *partial_sum)
{
int gid = get_global_id(0);
long i;
long mid_point = (${num_elements}+1)/2; if (gid>mid_point) { return; } else { partial_sum[gid]+=partial_sum[gid+mid_point]; } } """) substitutions = {'precision':openCL_prec, 'num_elements':self.num_elements, 'partial_sum_loc':self.partial_sum_loc} prg = code_pattern.substitute(substitutions) # define the function to be executed self.prg = cl.Program(self.ctx, prg).build(options=compiler_options) def __call__(self,x): ''' Calculate the objective cost in the optimization problem. ''' a = x[0] self.a = array([x[0]],copy=False,dtype=self.numpy_prec) self.f1 = array([x[1],x[2]],copy=False,dtype=self.numpy_prec) self.f2 = array([x[3],x[4]],copy=False,dtype=self.numpy_prec) # trigger transfer of values to the GPU buffers # only those values which changed before this evaluation need to be updated mf = cl.mem_flags self.f1_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.f1) self.f2_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.f2) self.a_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.a) # execute the partial_sum function if self.partial_sum_loc == '__global': self.prg.partial_sum(self.queue, self.x.shape, None, self.x_buf,self.y_buf,self.f1_buf,self.f2_buf,self.a_buf,self.partial_sum_buf) # trigger the copy of the results into the partial_sum container after completion cl.enqueue_read_buffer(self.queue, self.partial_sum_buf, self.partial_sum).wait() # execute the full_sum function self.prg.full_sum(self.queue, (1,), None, self.partial_sum_buf,self.full_sum_buf) else: self.prg.partial_sum(self.queue, self.x.shape, (8,), self.x_buf,self.y_buf,self.f1_buf,self.f2_buf, self.a_buf,self.local_partial_sum) self.prg.full_sum(self.queue, (1,), None, self.local_partial_sum,self.full_sum_buf) # trigger the copy of the results into the partial_sum container after completion cl.enqueue_read_buffer(self.queue, self.full_sum_buf, self.full_sum).wait() # compare sums #print (sum(self.full_sum)-sum(self.partial_sum)) #print (sum(self.full_sum),sum(self.partial_sum)) # finish up the calculations point_list = self._p n = float(len(point_list)) _lambda =0.1 #full_sum = sum(self.partial_sum) total = (self.full_sum/float(n) + _lambda*self._ahat_max*self._sigma*exp((a/self._ahat_max)**4)) return total def _del_(self): pass if __name__=='__main__': import time from random import uniform,normalvariate,seed # local modules import ellipse #################################################################### # setup test conditions num_reps = 1 num_pts = 1e6 precision = 'float' seed(1234567890) # set the random generator to get repeatable results # setup the reference ellipse # define the foci locations foci1_ref = array([2,-1]) foci2_ref = array([-2,1]) # define distance from foci to ellipse circumference a_ref = 2.5 point_list = ellipse.generate_point_list(num_pts,a_ref,foci1_ref,foci2_ref) parameters = { "point_list" : point_list, "platform_name" : 'NVIDIA', "platform_precisions" : precision} # test the function on an NVIDIA card t0 = time.time() my_objective = Objective(parameters) x = my_objective.x0 t1 = time.time() for i in range(0,num_reps): y = my_objective(x) t2 = time.time() print '' print 'NVIDIA initialization took %f sec' % (t1-t0) print 'NVIDIA execution took %f sec' % (t2-t1) print 'Executed %i times on %s' % (num_reps,my_objective.device.name) print '' # test the function on an intel CPU parameters = { "point_list" : point_list, "platform_name" : 'Intel', "platform_precisions" : precision} t0 = time.time() my_objective = Objective(parameters) x = my_objective.x0 t1 = time.time() for i in range(0,num_reps): y = my_objective(x) t2 = time.time() print '' print 'Intel initialization took %f sec' % (t1-t0) print 'Intel execution took %f sec' % (t2-t1) print 'Executed %i times on %s' % (num_reps,my_objective.device.name) print ''  main.py ''' This module tests the performance of different ways of forming the ellipse fit objective function and its impact on the ellipse fit optimization problem. ''' __author__ = 'Ed Tate' __email__ = 'edtate<at>gmail-dot-com' __website__ = 'exnumerus.blogspot.com' __license__ = 'Creative Commons Attribute By - http://creativecommons.org/licenses/by/3.0/us/''' import time from random import uniform, normalvariate, seed from numpy import * import pylab as plt from openopt import NLP import gc # local modules import ellipse import objective_scalar import objective_vectorized import objective_vectorized_parallel import objective_openCL #################################################################### # setup test conditions num_reps = 2 seed(1234567890) # set the random generator to get repeatable results # setup the reference ellipse # define the foci locations foci1_ref = array([2,-1]) foci2_ref = array([-2,1]) # define distance from foci to ellipse circumference a_ref = 2.5 ###################################################### # setup the OpenCL Objective classes for CPU and GPU verions of the # OpenCL version of the objective function class objective_openCL_GPU_Objective(objective_openCL.Objective): def __init__(self,parameters): parameters["platform_name"]="NVIDIA" objective_openCL.Objective.__init__(self,parameters) class objective_openCL_CPU_Objective(objective_openCL.Objective): def __init__(self,parameters): parameters["platform_name"]="Intel" objective_openCL.Objective.__init__(self,parameters) ##################################################### # objective_functions = [objective_scalar.Objective, objective_vectorized.Objective, objective_vectorized_parallel.Objective, objective_openCL_GPU_Objective, objective_openCL_CPU_Objective] legends = ['scalar','vectorized','vectorized parallel', 'OpenCL on NVIDIA GPU', 'OpenCL on Intel i7 CPU'] ###################################################### # measure the timing to setup and call the objective function num_pts_list = [10,100,250,500,1000,2500,5000,10000,50000,1e5,1e6] #num_pts_list = [10,100,250] setup_times = [] call_times = [] for num_pts in num_pts_list: seed(1234567890) # set the random generator to get repeatable results point_list = ellipse.generate_point_list(num_pts,a_ref,foci1_ref,foci2_ref) parameters = { "point_list" : point_list } test_setup_times = [] test_call_times = [] for objective_function in objective_functions: t0 = time.time() my_objective = objective_function(parameters) x0 = my_objective.x0 t1 = time.time() for i in range(0,num_reps): y = my_objective(x0) t2 = time.time() test_setup_times.append(t1-t0) test_call_times.append(t2-t1) del my_objective setup_times.append(test_setup_times) call_times.append(test_call_times) print 'Evaluated %i points' % num_pts # plot the results plt.figure() plt.loglog(num_pts_list,call_times,lw=2) plt.legend(legends,loc='upper left') plt.xlabel('Number of points') plt.ylabel('Execution Time [sec]') plt.title('Times for %i calls to objective function' % num_reps) plt.grid(True) ############################################################# # test the performance when used in an optimizer num_pts_list = [10,100,250,500,1000,2500,5000] #num_pts_list = [10,100,250] setup_times = [] call_times = [] for num_pts in num_pts_list: seed(1234567890) # set the random generator to get repeatable results point_list = ellipse.generate_point_list(num_pts,a_ref,foci1_ref,foci2_ref) parameters = { "point_list" : point_list } test_setup_times = [] test_call_times = [] for objective_function in objective_functions: t0 = time.time() my_objective = objective_function(parameters) x0 = my_objective.x0 p_vectorized = NLP(my_objective, x0, iprint = -1) t1 = time.time() opt_result = p_vectorized.solve('ralg') t2 = time.time() test_setup_times.append(t1-t0) test_call_times.append(t2-t1) print '%s : %s' % (opt_result.xf,objective_function) del my_objective gc.collect() # force release of resources held by the objective setup_times.append(test_setup_times) call_times.append(test_call_times) print 'Evaluated %i points' % num_pts print # plot the results plt.figure() plt.loglog(num_pts_list,call_times,lw=2) plt.legend(legends,loc='upper left') plt.xlabel('Number of points') plt.ylabel('Execution Time [sec]') plt.title('Times to solve the ellipse fit problem') plt.grid(True) plt.show() ## Testing results: Figure 1 shows the difference in time to call the objective function for the various methods of implementing the objective function. From this chart, the worst performing method is to implement the objective function as a scalar. The best performing implementation is to implement the function using OpenCL on the CPU. This variation in performance is not as stark when the objective function is used in the Ellipse Fit problem. In fact, because the ellipse fit problem measures data transfer times and computation times, the ordering of the performance changes. When just measuring the time to evaluate the objective function, the increasing order of performance is scalar, parallelized & vectorized, vectorized, OpenCL on GPU, and the best performance is OpenCL on the CPU. Furthermore, as the problem scale changed, the relative performance ratios stay relatively constant.When the ellipse fit problem is solved, the ordering changes and the relative slopes change. One reason for this is the overhead associated with the optimization routines. Another reason is the increased time which includes time to transfer data into the GPU. Another interesting result is that the execution times do not change much as the problem size increases. This is probably due to the efficiency of the computation and the efficiency of block transfers of data. Figure 2 One other thing to note is that these different methods of computation have different resolutions in the floating point numbers. Because of these differences in the resolution, the optimization problem finds slightly different answers. The console output shows the differences in the results of the optimization. In general, for the larger problems, the difference are about one part in 10,0000. Console Output: Evaluated 10 points Evaluated 100 points Evaluated 250 points Evaluated 500 points Evaluated 1000 points Evaluated 2500 points Evaluated 5000 points Evaluated 10000 points Evaluated 50000 points Evaluated 100000 points Evaluated 1000000 points [ 2.53977781 -1.9791083 1.11795369 1.90068376 -0.97398384] : <class 'objective_scalar.Objective'> [ 2.53978396 -1.97911311 1.11797418 1.90068553 -0.97399285] : <class 'objective_vectorized.Objective'> [ 2.53980864 -1.9791325 1.11805566 1.90069183 -0.97402764] : <class 'objective_vectorized_parallel.Objective'> [ 2.53972748 -1.97906577 1.1177896 1.90067428 -0.9739137 ] : <class '__main__.objective_openCL_GPU_Objective'> [ 2.53977167 -1.97910332 1.11793354 1.90068234 -0.97397521] : <class '__main__.objective_openCL_CPU_Objective'> Evaluated 10 points [ 2.566147 -1.95135142 1.0549304 2.16680656 -1.10160299] : <class 'objective_scalar.Objective'> [ 2.56614697 -1.95134842 1.05492776 2.16681334 -1.10160238] : <class 'objective_vectorized.Objective'> [ 2.56614691 -1.95134982 1.05492914 2.16680969 -1.1016027 ] : <class 'objective_vectorized_parallel.Objective'> [ 2.56614661 -1.95134401 1.05492394 2.16682259 -1.10160128] : <class '__main__.objective_openCL_GPU_Objective'> [ 2.56614672 -1.95134937 1.05492891 2.16680991 -1.10160258] : <class '__main__.objective_openCL_CPU_Objective'> Evaluated 100 points [ 2.56330383 -2.0896741 1.06288154 2.04214077 -1.05030978] : <class 'objective_scalar.Objective'> [ 2.5633038 -2.08967399 1.06288158 2.04214065 -1.05031 ] : <class 'objective_vectorized.Objective'> [ 2.56330231 -2.08967161 1.06288221 2.04213788 -1.05031417] : <class 'objective_vectorized_parallel.Objective'> [ 2.56330174 -2.08967069 1.0628825 2.04213675 -1.05031584] : <class '__main__.objective_openCL_GPU_Objective'> [ 2.56330119 -2.08966978 1.0628827 2.04213573 -1.05031742] : <class '__main__.objective_openCL_CPU_Objective'> Evaluated 250 points [ 2.55213105 -2.06212704 1.05749405 2.02108216 -1.05369719] : <class 'objective_scalar.Objective'> [ 2.55390449 -2.0623902 1.05823288 2.02404738 -1.05682491] : <class 'objective_vectorized.Objective'> [ 2.54907192 -2.05769363 1.05956421 2.01575473 -1.05475975] : <class 'objective_vectorized_parallel.Objective'> [ 2.55212491 -2.06209759 1.05747762 2.02105559 -1.05370312] : <class '__main__.objective_openCL_GPU_Objective'> [ 2.5518505 -2.06217932 1.05803857 2.02239691 -1.05392966] : <class '__main__.objective_openCL_CPU_Objective'> Evaluated 500 points [ 2.61796899 -2.14539556 1.11170214 2.12027043 -1.09253186] : <class 'objective_scalar.Objective'> [ 2.61708437 -2.14479931 1.10835605 2.12087692 -1.090721 ] : <class 'objective_vectorized.Objective'> [ 2.61761133 -2.14669529 1.1089428 2.12139812 -1.09022015] : <class 'objective_vectorized_parallel.Objective'> [ 2.61753131 -2.14662822 1.10869732 2.12152618 -1.09010436] : <class '__main__.objective_openCL_GPU_Objective'> [ 2.61763094 -2.14672541 1.10895434 2.12136288 -1.09017078] : <class '__main__.objective_openCL_CPU_Objective'> Evaluated 1000 points [ 2.60471429 -2.11563461 1.06998647 2.13374471 -1.0697982 ] : <class 'objective_scalar.Objective'> [ 2.60416974 -2.11432462 1.07083334 2.13291023 -1.07026969] : <class 'objective_vectorized.Objective'> [ 2.60416139 -2.1136674 1.0722599 2.13226116 -1.07119697] : <class 'objective_vectorized_parallel.Objective'> [ 2.60527276 -2.11598152 1.07174066 2.13404717 -1.07158798] : <class '__main__.objective_openCL_GPU_Objective'> [ 2.60428855 -2.11418258 1.07163536 2.13309981 -1.07148272] : <class '__main__.objective_openCL_CPU_Objective'> Evaluated 2500 points [ 2.65236805 -2.17113736 1.11832436 2.18090649 -1.10477807] : <class 'objective_scalar.Objective'> [ 2.65236757 -2.17113597 1.11832359 2.18090679 -1.10477991] : <class 'objective_vectorized.Objective'> [ 2.65236655 -2.1711387 1.1183202 2.18090519 -1.10477049] : <class 'objective_vectorized_parallel.Objective'> [ 2.65124389 -2.17148832 1.11786432 2.17945815 -1.10311976] : <class '__main__.objective_openCL_GPU_Objective'> [ 2.65124383 -2.17148872 1.11786446 2.17945781 -1.10311962] : <class '__main__.objective_openCL_CPU_Objective'> Evaluated 5000 points ## Conclusions: • Using PyOpenCl is significantly faster then Numpy in this problem when using the CPU or the GPU to perform the calculation. The OpenCL version of the objective function solved on the GPU was over 10 times faster than the Numpy version of the objective function. The OpenCL version solved on the CPU was almost 100 times faster then the Numpy version. • The OpenCL version of the objective function was able to be redirected from the the GPU to the CPU using a a single line of code. • Using the native CPU with OpenCL is significantly faster then using the GPU for this problem. For this problem, the performance difference is related to the fact that the computational load is roughly proportional the quantity of data which is transferred into the device. When the OpenCL program executes on the CPU, the data transfer times are very low. When the OpenCL program runs on the GPU, global memory is used, which requires a data transfer from the CPU’s memory space to the GPU memory space. This is usually relatively slow. • Because of the differences in the numeric representation of the floating point values, the OpenCL calculations introduce differences in the calculations compared to the scalar, the Numpy, and parallelized Numpy representations. ## Hardware Configuration: • Pythonxy 2.6.6 • PyOpenCL 0.92 • Sony Vaio VPCF1 • Intel i7 Q740 @ 1.73 GHz • NVIDIA GeForce GT 425M @ 1.12 GHz ## References: • Khronos Group OpenCL Standard • NVIDIA OpenCL manuals • Intel OpenCL manuals • PyOpenCL Home Page • PyOpenCL Documents This work is licensed under a Creative Commons Attribution By license. ## Sunday, June 12, 2011 ### First Draft: Solving the ellipse fit problem using PyOpenCL–recoding the objective function ## Problem Statement Can PyOpenCL be used to improve the performance of the ellipse fit problem. ## Discussion The objection function can be recoded using PyOpenCL. This module simplifies experimentation with global and local memory and it simplifies selecting the calculation precision. To test this get the other code snippets from here. Install all of the modules in a directory, then run this module. objective_pyopencl.py ''' This module contains an objective function for the ellipse fitting problem. The objective is coded using vectorized operations. ''' __author__ = 'Ed Tate' __email__ = 'edtate<at>gmail-dot-com' __website__ = 'exnumerus.blogspot.com' __license__ = 'Creative Commons Attribute By - http://creativecommons.org/licenses/by/3.0/us/''' from numpy import * import pyopencl as cl from string import Template # local imports import objective_scalar class Objective(objective_scalar.Objective): def __init__(self,parameters): objective_scalar.Objective.__init__(self,parameters) precision_dict = { 'float':float32, 'double':float64} self.platform_name = parameters.get('platform_name',None) self.platform_precision = parameters.get('platform_precision','double') self.partial_sum_loc = parameters.get('partial_sum_loc','__global') self.numpy_prec = precision_dict[self.platform_precision] openCL_prec = self.platform_precision point_list = self._p x,y = zip(*point_list) self.x = array(x,dtype=self.numpy_prec) self.y = array(y,dtype=self.numpy_prec) self.f1 = array([0,0],copy=True,dtype=self.numpy_prec) self.f2 = array([0,0],copy=True,dtype=self.numpy_prec) self.a = array([0],copy=True,dtype=self.numpy_prec) self.num_elements = len(x) platforms = cl.get_platforms() if len(platforms)>1 and not self.platform_name is None: # select the the first device in list for platform in platforms: if self.platform_name in platform.name: # use first device for context for device in platform.get_devices(): self.ctx = cl.Context([device]) self.device = device break break else: self.ctx = cl.create_some_context() if 'NVIDIA' in self.platform_name: compiler_options = '-cl-mad-enable -cl-fast-relaxed-math' else: compiler_options = '' #self.ctx = cl.create_some_context() self.queue = cl.CommandQueue(self.ctx) # define the input buffers and transfer data into them mf = cl.mem_flags self.x_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.x) self.y_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.y) self.f1_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.f1) self.f2_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.f2) self.a_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.a) # define the output buffer self.partial_sum_buf = cl.Buffer(self.ctx, mf.READ_WRITE, self.x.nbytes) self.full_sum_buf = cl.Buffer(self.ctx, mf.READ_ONLY, array([0],dtype=self.numpy_prec).nbytes) # prep the container to the output self.partial_sum = empty(self.x.shape,dtype=self.numpy_prec) self.full_sum = empty(array([0],dtype=self.numpy_prec).shape,dtype=self.numpy_prec) # define local memory self.local_partial_sum = cl.LocalMemory(self.x.nbytes) code_pattern= Template(""" #pragma OPENCL EXTENSION cl_khr_fp64: enable __kernel void partial_sum( __global const${precision} *x,
__global const ${precision} *y, __global const${precision} *f1,
__global const ${precision} *f2, __global const${precision} *a,
${partial_sum_loc}${precision} *partial_sum)
{
int gid = get_global_id(0);

${precision} x_f1_diff = x[gid]-f1[0];${precision} x_f2_diff = x[gid]-f2[0];
${precision} y_f1_diff = y[gid]-f1[1];${precision} y_f2_diff = y[gid]-f2[1];

${precision} x_f1_diff_sq = x_f1_diff*x_f1_diff;${precision} x_f2_diff_sq = x_f2_diff*x_f2_diff;
${precision} y_f1_diff_sq = y_f1_diff*y_f1_diff;${precision} y_f2_diff_sq = y_f2_diff*y_f2_diff;

${precision} norm_pt_to_f1 = sqrt(x_f1_diff_sq+y_f1_diff_sq);${precision} norm_pt_to_f2 = sqrt(x_f2_diff_sq+y_f2_diff_sq);

${precision} t1 = norm_pt_to_f1+norm_pt_to_f2-2*a[0]; partial_sum[gid]=t1*t1; } __kernel void full_sum(${partial_sum_loc} ${precision} *partial_sum, __global${precision} *full_sum)
{
long i;
${precision} sum=0; for(i=0; i<${num_elements}; i++) {
sum += partial_sum[i];
}
full_sum[0] = sum;
}

__kernel void full_sum2(
__global ${precision} *partial_sum) { int gid = get_global_id(0); long i; long mid_point = (${num_elements}+1)/2;
if (gid>mid_point) {
return;
} else {
partial_sum[gid]+=partial_sum[gid+mid_point];
}
}
""")
substitutions = {'precision':openCL_prec,
'num_elements':self.num_elements,
'partial_sum_loc':self.partial_sum_loc}
prg = code_pattern.substitute(substitutions)

# define the function to be executed
self.prg = cl.Program(self.ctx, prg).build(options=compiler_options)

def __call__(self,x):
'''
Calculate the objective cost in the optimization problem.
'''

a      = x[0]
self.a  = array([x[0]],copy=False,dtype=self.numpy_prec)
self.f1 = array([x[1],x[2]],copy=False,dtype=self.numpy_prec)
self.f2 = array([x[3],x[4]],copy=False,dtype=self.numpy_prec)

# trigger transfer of values to the GPU buffers
# only those values which changed before this evaluation need to be updated
mf = cl.mem_flags
self.f1_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.f1)
self.f2_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.f2)
self.a_buf  = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.a)

# execute the partial_sum function
if self.partial_sum_loc == '__global':
self.prg.partial_sum(self.queue,self.x.shape,(8,),
self.x_buf,self.y_buf,self.f1_buf,self.f2_buf,self.a_buf,self.partial_sum_buf)
# trigger the copy of the results into the partial_sum container after completion
# execute the full_sum function
self.prg.full_sum(self.queue,(1,),None,
self.partial_sum_buf,self.full_sum_buf)
else:
self.prg.partial_sum(self.queue,self.x.shape,(8,),
self.x_buf,self.y_buf,self.f1_buf,self.f2_buf,
self.a_buf,self.local_partial_sum)
self.prg.full_sum(self.queue,(1,),None,
self.local_partial_sum,self.full_sum_buf)

# trigger the copy of the results into the partial_sum container after completion

# compare sums
print (sum(self.full_sum)-sum(self.partial_sum))
print (sum(self.full_sum),sum(self.partial_sum))

# finish up the calculations
n = float(len(point_list))
_lambda =0.1
#full_sum = sum(self.partial_sum)
total = (self.full_sum/float(n) +
_lambda*self._ahat_max*self._sigma*exp((a/self._ahat_max)**4))

def _del_(self):
pass

if __name__=='__main__':
import time
from random import uniform,normalvariate,seed
# local modules
import ellipse

####################################################################

# setup test conditions
num_reps = 100
num_pts = 256
precision = 'float'
seed(1234567890)      # set the random generator to get repeatable results

# setup the reference ellipse

# define the foci locations
foci1_ref = array([2,-1])
foci2_ref = array([-2,1])
# define distance from foci to ellipse circumference
a_ref = 2.5

point_list = ellipse.generate_point_list(num_pts,a_ref,foci1_ref,foci2_ref)

parameters = { "point_list" : point_list,
"platform_name" : 'NVIDIA',
"platform_precisions" : precision}

# test the function on an NVIDIA card
t0 = time.time()
my_objective = Objective(parameters)
x = my_objective.x0

t1 = time.time()
for i in range(0,num_reps):
y  = my_objective(x)
t2 = time.time()

print ''
print 'NVIDIA initialization took %f sec' % (t1-t0)
print 'NVIDIA execution took %f sec' % (t2-t1)
print 'Executed %i times on %s' % (num_reps,my_objective.device.name)
print ''

# test the function on an intel CPU

parameters = { "point_list" : point_list,
"platform_name" : 'Intel',
"platform_precisions" : precision}

t0 = time.time()
my_objective = Objective(parameters)
x = my_objective.x0

t1 = time.time()
for i in range(0,num_reps):
y  = my_objective(x)
t2 = time.time()

print ''
print 'Intel initialization took %f sec' % (t1-t0)
print 'Intel execution took %f sec' % (t2-t1)
print 'Executed %i times on %s' % (num_reps,my_objective.device.name)
print ''

Next Steps:

• Integrate with the ellipse fit problem
• Explain the behavior of the code example
• Measure performance wrt scalar and vectorized solutions

## Problem Statement

Can Parallel Python improve the performance of the ellipse fit algorithm? Under which conditions will Parallel Python offer performance advantages.

## Discussion

Breaking an algorithm into pieces which are executed in parallel on multiple CPU’s can speed up execution time. One way to estimate the best theoretical improvement is to use Amdahl’s law. This law estimates the performance improvement by breaking an algorithm in a portion which can be parallelized and a portion which is serial in nature. This is an upper estimate of the benefits of parallelization.

In practical parallelization, there may be overheads associated with getting things running on multiple CPUs. In Python, there are several issues to consider. The first is that that C implementation of Python does not natively support true parallelization. This is associated with issues deep in the interpreter (search on Python GIL for more information). Therefore, any library that supports parallelization needs to work around the issues with the GIL. Implementations like JPython and IronPython do not suffer from these issues.

One easy to use library is Parallel Python. It allows a program to establish a set of local and remote servers which are passed  functions and all of the information for successfully call those functions. The relative ease of use is offset by the fact that when the server’s are setup, there is a time cost. Also, then passing the functions, parameters, and everything else, there is a time cost. The experiments here looked at the use of this library in the ellipse fitting problem and compared the execution time to other solutions.

## Testing

To test the use of Parallel Python, the objective class previously developed was reused. An instance of this class behaves like a function by supporting the __call__ method. More importantly, the __init__ method and the __del__ method are overridden to create and destroy the Parallel Python job servers. To use these scripts, install them in the same directory as the scripts from here.

The first script implements the objective function, parallelizes it using Parallel Python. All of the calculations are performed using vectorized math. The objective function is implemented in the Objective class. In this class, when __init__ is called, the parameters used by the objective function are stored with the class instance, the number of parallel processes for execution are determined, and the Parallel Python jobs server is started. This causes a new instance of Python to be started for each process which will be used. When the __call__ method is invoke by a call to the class instance, then the calculation is broken up into pieces and dispatched to the job servers. When the objective function is no longer needed and the garbage collector invokes the __del__ method, the servers are destroyed.

objective_vectorized_parallel.py

'''
This module contains an objective function for the ellipse fitting problem.
The objective is coded using vectorized operations which are
parallelized using parallel python.

'''

from numpy import *
from numpy import linalg as LA
import objective_scalar
import pp

class Objective(objective_scalar.Objective):

def __init__(self,parameters):
objective_scalar.Objective.__init__(self,parameters)
self.ncpus = parameters.get('ncpus','autodetect')
self.job_server = pp.Server(self.ncpus)
# because autodetect may have been used, use get_ncpus to
# get physical number of cpus
self.ncpus = self.job_server.get_ncpus()
self.job_server.set_ncpus(ncpus=self.ncpus)

def __call__(self,x):
'''
Calculate the objective cost in the optimization problem using
vectorized equations.
'''

point_list = self._p
foci1 = array([x[1],x[2]])
foci2 = array([x[3],x[4]])
a     = x[0]
n = float(len(point_list))
_lambda = 0.1

def solve_sub_problem(point_list,foci1,foci2,a):

pt_diff1 = point_list - foci1
pt_diff2 = point_list - foci2

x_f1_diff = pt_diff1[:,0]
x_f2_diff = pt_diff2[:,0]
y_f1_diff = pt_diff1[:,1]
y_f2_diff = pt_diff2[:,1]

x_f1_diff_sq = numpy.power(x_f1_diff,2)
x_f2_diff_sq = numpy.power(x_f2_diff,2)
y_f1_diff_sq = numpy.power(y_f1_diff,2)
y_f2_diff_sq = numpy.power(y_f2_diff,2)

norm_pt_to_f1 = numpy.power(x_f1_diff_sq+y_f1_diff_sq,0.5)
norm_pt_to_f2 = numpy.power(x_f2_diff_sq+y_f2_diff_sq,0.5)

temp = numpy.power(norm_pt_to_f1+norm_pt_to_f2-2*a,2)
part_sum = numpy.sum(temp)
return part_sum

jobs = []
numpts = n
sigma    = self._sigma
ahat_max = self._ahat_max
inc = math.ceil(n/float(self.ncpus))
endi = 0
for i in range(0,self.ncpus):
starti = endi
endi = int(min(starti+inc,n))
# make a copy of point list which is smaller
#   to minimize the time in transferring to the
#   parallel processes
local_point_list = array(point_list)[starti:endi,:]
jobs.append(self.job_server.submit(solve_sub_problem,
(local_point_list,foci1,foci2,a),
(),("numpy",)))
total = sum([job() for job in jobs])/n
total += _lambda*ahat_max*sigma*exp((a/ahat_max)**4)

def __del__(self):
self.job_server.destroy()

if __name__=='__main__':
import time
from random import seed

# local modules
import ellipse

####################################################################

# setup test conditions
num_reps = 100
num_pts = 256
precision = 'float'
seed(1234567890)      # set the random generator to get repeatable results

# setup the reference ellipse

# define the foci locations
foci1_ref = array([2,-1])
foci2_ref = array([-2,1])
# define distance from foci to ellipse circumference
a_ref = 2.5

point_list = ellipse.generate_point_list(num_pts,a_ref,foci1_ref,foci2_ref)

parameters = { "point_list" : point_list ,
"ncpus"      : 'autodetect'}

# test the function
t0 = time.time()
my_objective = Objective(parameters)
x0 = my_objective.x0
t1 = time.time()
for i in range(0,num_reps):
y  = my_objective(x0)
t2 = time.time()

print ''
print 'Initialization took %f sec' % (t1-t0)
print 'Using %i cpus' % my_objective.ncpus
print 'Execution took %f sec' % (t2-t1)
print 'Executed %i times.' % (num_reps)
print ''

ref_objective = objective_scalar.Objective(parameters)
# compare x0 calculation
print ''
print ('Difference between x0 calculations = %f'
% LA.norm(array(ref_objective.x0)-array(my_objective.x0)))
print ('Difference between objective calcs = %f'
% (ref_objective(x0)-my_objective(x0)))
print ''

The second script measures the execution times for this new objective implementation versus the previous scalar and vectorized implementations. One important item in this script is forcing the garbage collector to run after each test. Without doing this, when new classes are created, additional Python instances are created while the old instances are left running. This could have been solved through a more clever class design. However, for this testing, the simplest solution was to force the garbage collector to handle the issue.

## Results

The first test measured the time to call the objective function once. This showed that the scalar solution was the slowest and the simple vectorized solution was the fastest. Surprisingly, parallelizing the problem (and using vectorization) resulted in a solution which was consistently between the scalar and vectorized solution.

The second test embedded the parallelized version of the objective function in the ellipse fit problem. In this implementation, the results were mixed. For small problems, the scalar solution performed best. As the number of points on the ellipse increased, the vectorized version provided the best results. However, the parallelized version appeared to converge towards the vectorized version for larger problems.

From this test, the conclusion was that parallelization using an approach like Parallel Python is not effective. One of the reasons for this is the large amount of data transfer that needs to happen to setup the function call. In this example, the ratio of computation to data transfer is low enough that the transfer mechanisms make the benefits of parallelization, for a system with 8 cpus, not worth the effort. If the data transfer were faster or the computation times were longer, then parallelizing using Parallel Python might have offered advantages.

## Further Testing/Development

• Evaluate the impact of imported modules
• Evaluate ways to share read only resources among processes efficiently

## Problem Statement:

The optimization problem to fit an ellipse is too slow.

## Discussion:

The easiest thing to improve performance in a script is to try compiling it using a tool like psyco or py2exe. Beyond that, the time critical portions of the code can be written in C/C++ and embedded into the script using a library like scipy.weave or a tool like Cython.

If a Python script has elements which are naturally parallelizable, then there are additional approaches to improving the script’s performance. One approach is to break the code up and execute it among several CPU cores using a library like the multiprocessing library or parallel python. Another approach to try taking advantage of the CPU and GPU hardware using PyOpenCL. Finally, the script can be vectorized, where all of the equations are expressed as vector operations.

Since the objective in the Ellipse fitting problem is easily decomposed into a set of parallel equations, the solution considered here is the use of numpy and vectorization of the code.

## Solution:

The original problem, presented previously, is broken into a couple of modules. The first module generates the ellipse data used in the fitting exercise.

ellipse.py

"""
Module to generate noisy data samples about an ellipse.
"""
from numpy import *
from numpy import linalg as LA
from random import uniform,normalvariate

def generate_noisy_pts(a,foci1,foci2,
num_pts=200, angles=None,
x_noise = None, y_noise=None):

'''
Generate points for an ellipse given
the foci, and
the distance to the intersection of the minor axis and ellipse.

Optionally,
the number of points can be specified,
the angles for the points wrt to the centroid of the ellipse, and
a noise offset for each point in the x and y axis.
'''
c = (1/2.0)*LA.norm(foci1-foci2)
b = sqrt(a**2-c**2)
x1 = foci1[0]
y1 = foci1[1]
x2 = foci2[0]
y2 = foci2[1]
if angles is None:
t = arange(0,2*pi,2*pi/float(num_pts))
else:
t = array(angles)

ellipse_x = (x1+x2)/2 +(x2-x1)/(2*c)*a*cos(t) - (y2-y1)/(2*c)*b*sin(t)
ellipse_y = (y1+y2)/2 +(y2-y1)/(2*c)*a*cos(t) + (x2-x1)/(2*c)*b*sin(t)
try:
# try adding noise to the ellipse points
ellipse_x = ellipse_x + x_noise
ellipse_y = ellipse_y + y_noise
except TypeError:
pass
return (ellipse_x,ellipse_y)

def generate_point_list(num_samples,a_ref,foci1,foci2,sigma=0.2):
# generate list of noisy samples on the ellipse
angles = [uniform(-pi,pi) for i in range(0,num_samples)]
sigma = 0.2
x_noise = [normalvariate(0,sigma) for t in angles]
y_noise = [normalvariate(0,sigma) for t in angles]
x_list,y_list = generate_noisy_pts(a_ref,foci1,foci2,
angles  = angles,
x_noise = x_noise,
y_noise = y_noise)
point_list = [array([x,y]) for x,y in zip(x_list,y_list)]
return point_list

if __name__=='__main__':
import matplotlib.pylab as plt

# define the foci locations
foci1 = array([2,-1])
foci2 = array([-2,1])
# define distance from foci to ellipse circumference
a_ref = 2.5

num_pts = 200
point_list = generate_point_list(num_pts,a_ref,foci1,foci2)

plt.figure()
x,y = zip(*point_list)
plt.plot(x,y,'.b')
plt.show()

The next two modules define the objective function in two different ways. In the first, the code calculates the objective function using scalar math. In the second, the objective function is calculated using numpy vector math.

The first module, objective_scalar.py, uses a class to define the objective function. This is done for several reasons. The first is that the class allows the parameters, the values which will not be changed by the optimization search, to be bound to the objective without a need to pass them through the optimizer. Additionally, the __init__ and __del__ methods in the class provide a convenient place to allocate special purpose services and hardware which may be used when evaluating the objective. To simplify the use of the class, the __call__ method is used so the instance of the class looks like a function to the optimizer.

objective_scalar.py

'''
This module contains an objective function and logic to determine the starting
point for the ellipse fitting problem.

This objective is coded using operations on scalar quantities.

'''

from numpy import *
from numpy import linalg as LA

class Objective(object):
def __init__(self,parameters):
# setup parameters which will be used later
self._p = parameters['point_list']
self._sigma = parameters.get('sigma',0.2)
self.x0 = self._find_x0()
self._ahat_max = self.x0[0]

def _find_x0(self):
'''
Determine the initial value, x0, for the optimization problem.
'''
points = array(self._p)
x_list,y_list = zip(*points)
# find x mean
x_mean = array(x_list).mean()
# find y mean
y_mean = array(y_list).mean()
# find point farthest away from mean
center = array([x_mean,y_mean])
distances = zeros((len(x_list),1))
for i,point in enumerate(points):
distances[i,0]=LA.norm(point-center)
ind = where(distances==distances.max())
max_pt = points[ind[0],:][0]
# find point between mean and max point
foci1 = (max_pt+center)/2.0
# find point opposite from
foci2 = 2*center - max_pt
return [distances.max(), foci1[0],foci1[1],foci2[0],foci2[1]]

def __call__(self,x):
'''
Calculate the objective cost in the optimization problem using scalar
equations which are summed.
'''

point_list = self._p

foci1 = array([x[1],x[2]])
foci2 = array([x[3],x[4]])
a     = x[0]
n = float(len(point_list))
lambda_   = 0.1
sigma    = self._sigma
ahat_max = self._ahat_max
total = 0
for point in point_list:
total += ((LA.norm(point-foci1,2)+LA.norm(point-foci2,2)-2*a)**2)/n
total += lambda_*ahat_max*sigma*exp((a/ahat_max)**4)

if __name__=='__main__':
import time
from random import uniform, normalvariate, seed

# local modules
import ellipse

####################################################################

# setup test conditions
num_reps = 100
num_pts = 256
precision = 'float'
seed(1234567890)      # set the random generator to get repeatable results

# setup the reference ellipse

# define the foci locations
foci1_ref = array([2,-1])
foci2_ref = array([-2,1])
# define distance from foci to ellipse circumference
a_ref = 2.5

point_list = ellipse.generate_point_list(num_pts,a_ref,foci1_ref,foci2_ref)

parameters = { "point_list" : point_list }

# test the function on an NVIDIA card
t0 = time.time()
my_objective = Objective(parameters)
x0 = my_objective.x0
t1 = time.time()
for i in range(0,num_reps):
y  = my_objective(x0)
t2 = time.time()

print ''
print 'Initialization took %f sec' % (t1-t0)
print 'Execution took %f sec' % (t2-t1)
print 'Executed %i times.' % (num_reps)
print ''



The next module vectorizes the codes for faster execution. To simplify the coding, the vectorized objective inherits the interfaces and initialization from the scalar objective code module.

objective_vectorized.py

'''
This module contains an objective function for the ellipse fitting problem.
The objective is coded using vectorized operations.

'''

from numpy import *
from numpy import linalg as LA
import objective_scalar

class Objective(objective_scalar.Objective):

def __call__(self,x):
'''
Calculate the objective cost in the optimization problem using
vectorized equations.
'''

point_list = self._p
foci1 = array([x[1],x[2]])
foci2 = array([x[3],x[4]])
a     = x[0]
n = float(len(point_list))
_lambda = 0.1
sigma    = self._sigma
ahat_max = self._ahat_max

pt_diff1 = point_list - foci1
pt_diff2 = point_list - foci2

x_f1_diff = pt_diff1[:,0]
x_f2_diff = pt_diff2[:,0]
y_f1_diff = pt_diff1[:,1]
y_f2_diff = pt_diff2[:,1]

x_f1_diff_sq = power(x_f1_diff,2)
x_f2_diff_sq = power(x_f2_diff,2)
y_f1_diff_sq = power(y_f1_diff,2)
y_f2_diff_sq = power(y_f2_diff,2)

norm_pt_to_f1 = power(x_f1_diff_sq+y_f1_diff_sq,0.5)
norm_pt_to_f2 = power(x_f2_diff_sq+y_f2_diff_sq,0.5)

temp = power(norm_pt_to_f1+norm_pt_to_f2-2*a,2)/n
total = sum(temp)
total += _lambda*ahat_max*sigma*exp((a/ahat_max)**4)

if __name__=='__main__':
import time
from random import seed

# local modules
import ellipse

####################################################################

# setup test conditions
num_reps = 100
num_pts = 256
precision = 'float'
seed(1234567890)      # set the random generator to get repeatable results

# setup the reference ellipse

# define the foci locations
foci1_ref = array([2,-1])
foci2_ref = array([-2,1])
# define distance from foci to ellipse circumference
a_ref = 2.5

point_list = ellipse.generate_point_list(num_pts,a_ref,foci1_ref,foci2_ref)

parameters = { "point_list" : point_list }

# test the function on an NVIDIA card
t0 = time.time()
my_objective = Objective(parameters)
x0 = my_objective.x0
t1 = time.time()
for i in range(0,num_reps):
y  = my_objective(x0)
t2 = time.time()

print ''
print 'Initialization took %f sec' % (t1-t0)
print 'Execution took %f sec' % (t2-t1)
print 'Executed %i times.' % (num_reps)
print ''


Executing objective_scalar.py and objective_vectorized.py will provide some sense of the difference in speed between the two approach. However, it is a good idea to measure the performance under different conditions. The next script, main.py, evaluates the implementation of the ellipse fit objective when different numbers of points are available for the fit. It also evaluates how the time to solve the ellipse fit problem changes between the scalar and the vectorized objectives when the number of points changes. To reduce the variation due to changes in the data points, the random seed is set to the same value each time points are generated.

main.py

'''
This module contains an objective function for the ellipse fitting problem.
The objective is coded using vectorized operations.

'''

__author__ = 'Ed Tate'
__email__  = 'edtate<at>gmail-dot-com'
__website__ = 'exnumerus.blogspot.com'

import time
from random import uniform, normalvariate, seed
from numpy import *
import pylab as plt
from openopt import NLP

# local modules
import ellipse
import objective_scalar
import objective_vectorized

####################################################################

# setup test conditions
num_reps = 100

seed(1234567890)      # set the random generator to get repeatable results

# setup the reference ellipse

# define the foci locations
foci1_ref = array([2,-1])
foci2_ref = array([-2,1])
# define distance from foci to ellipse circumference
a_ref = 2.5

######################################################
# measure the timing to setup and call the objective function

num_pts_list = [10,100,250,500,1000,2500,5000,10000,50000]

objective_functions = [objective_scalar,objective_vectorized]
setup_times = []
call_times  = []
for num_pts in num_pts_list:
seed(1234567890)      # set the random generator to get repeatable results
point_list = ellipse.generate_point_list(num_pts,a_ref,foci1_ref,foci2_ref)

parameters = { "point_list" : point_list }

test_setup_times = []
test_call_times = []
for objective_function in objective_functions:
t0 = time.time()
my_objective = objective_function.Objective(parameters)
x0 = my_objective.x0
t1 = time.time()
for i in range(0,num_reps):
y  = my_objective(x0)
t2 = time.time()
test_setup_times.append(t1-t0)
test_call_times.append(t2-t1)

setup_times.append(test_setup_times)
call_times.append(test_call_times)
print 'Evaluated %i points' % num_pts

# plot the results
plt.figure()
plt.loglog(num_pts_list,call_times)
plt.legend(['scalar','vectorized'],loc='upper left')
plt.xlabel('Number of points')
plt.ylabel('Execution Time [sec]')
plt.title('Times for %i calls to objective function' % num_reps)
plt.grid(True)

#############################################################
# test the performance when used in an optimizer

num_pts_list = [10,100,250,500,1000,2500,5000]

objective_functions = [objective_scalar,objective_vectorized]
setup_times = []
call_times  = []
for num_pts in num_pts_list:
seed(1234567890)      # set the random generator to get repeatable results
point_list = ellipse.generate_point_list(num_pts,a_ref,foci1_ref,foci2_ref)

parameters = { "point_list" : point_list }

test_setup_times = []
test_call_times = []
for objective_function in objective_functions:
t0 = time.time()
my_objective = objective_function.Objective(parameters)
x0 = my_objective.x0
p_vectorized = NLP(my_objective, x0, iprint = -1)
t1 = time.time()
opt_result = p_vectorized.solve('ralg')
t2 = time.time()
test_setup_times.append(t1-t0)
test_call_times.append(t2-t1)

print opt_result.xf

setup_times.append(test_setup_times)
call_times.append(test_call_times)
print 'Evaluated %i points' % num_pts
print

# plot the results
plt.figure()
plt.loglog(num_pts_list,call_times)
plt.legend(['scalar','vectorized'],loc='upper left')
plt.xlabel('Number of points')
plt.ylabel('Execution Time [sec]')
plt.title('Times to solve the ellipse fit problem')
plt.grid(True)

plt.show()

## Testing Results and Conclusions

To easily see the results which cover a large range of numbers, the results are plotted on a log-log graph. Comparing the scalar calculations against the vectorized calculations of the objective function shows almost an order of magnitude improvement in computing time.

When the vectorized objective is used in the ellipse fit problem, the performance improvement is almost an order of magnitude. For a small number of points, the overhead from the optimization routine is significant and reduces the improvement (as should be expected).