Friday, February 24, 2012

Lesson Learned: Managing Large Numbers of Plots, with an Example in Python

28056200-5ef7-11e1-8741-08002700a056

Problem Statement:

A project generates hundreds, thousands, or more graphs over its life. These graphs are copied and pasted into e-mail, power point slides, etc. The plots become divorced from any of the documents they were originally distributed with. Invariably, at some point in the project, a plot is brought back and the question is what were the assumptions used to generate this graph. With only the graph available, it can be difficult or impossible to answer this question.

To complicate matters, the plots are generated using legacy codes and modifying all of the existing code base is a detailed endeavor.

How can this situation be improved?

Discussion:

There are two problems here. First, a given graph is not traceable to its origin. This can be remedied in one of many ways.  If the source data is well controlled and can be described using a short phase, then adding that phrase somewhere on the chart is helpful. If the source data is constantly changing or requires too much information to describe with a short phrase, then something else is needed. A hash of the input data can help identify and verify the source data set used to generate the graph. A universally unique ID (UUID) can be used to give a graph a unique name. If the source data, assumptions, etc are stored using that same UUID, then when a graph if brought back for review, all of the necessary parts can be found.

The second problem is handling legacy code. There are at least three choices.

  • The first is probably the easiest. A function to add hash and uuid could be created and inserted at the appropriate location in each of the major pieces of plotting code. This is problematic because there are several interfaces and actions in the scripts which could make this work poorly. Also, every plotting routine would need to be modified.
  • A second choice is to wrap the plotting routines into function, then pass this function and its data to a wrapper function which would add a hash and uuid as the last thing done by the plotting routine. If the plotting functions already exist, then this can be done without changing any of the plotting code.   
  • A third choice is to create a decorator which wraps plotting routines, adding the hash and uuid. This has the same issues as using a wrapper call and requires changes to the plotting routines source. However, the changes consist of an import statement and application of the decorator at the correct location.

For this problem, the decorator solution is used. The code that following implements a decorator that creates a hash of the plot data and a UUID which are added to the right side of the plot.  This way, no matter where the plot goes, there is a high likelihood that its pedigree can be preserved.

'''
Script to demonstrate the use of decorators to add a unique identifier to
   a plot. The identifier incudes a hash of input data, to help see if 
   version of a plot really have different data, and a UUID to uniquely 
   identify this plot independent of when or where it was generated.
'''

__author__  = 'Ed Tate'
__email__   = 'edtate<AT>gmail-dot-com'
__website__ = 'exnumerus.blogspot.com'
__license__ = 'Creative Commons Attribute By - http://creativecommons.org/licenses/by/3.0/us/'

from matplotlib.pylab import *
import random
import md5
import uuid

def identifiable_plot(fn):
    def newfun(*args,**kwargs):
        # do before decorated function call
        fn(*args,**kwargs)
        # do after decorated function call
        # create the tag string from a hash of the data and a 
        #    universially unique ID
        x = args[0]
        y = args[1]
        xy = zip(x,y)
        m = md5.new()
        m.update(str(xy))
        this_uuid = str(uuid.uuid1())
        this_tag = 'hash=' + m.hexdigest() + ',' + 'UUID=' + this_uuid
        # write the tag to the figure for future reference
        figtext(1,0.5,this_tag ,rotation='vertical',
                horizontalalignment='right',
                verticalalignment='center',
                size = 'x-small',
                )
        return this_uuid

    return newfun

###############################
    
@identifiable_plot
def my_plot(x,y):
    plot(x,y,'o')
    grid(True)
    
###############################
    
x = [random.random() for i in range(100)]
y = [random.random() for i in range(100)]
    
plot_uuid = my_plot(x,y)
savefig(plot_uuid+'.png')

show()

 


Test Configuration:
  • win7
  • PythonXY 2.7.2.1

References:
This work is licensed under a Creative Commons Attribution By license.

Using HTML to View Large Sets of Plots - An Example in Python



This example doesn't work because of Blogger limitations. However if you run the example you will be able to select graphs in the generated HTML page.

Problem Statement

You have a program which generates lots of similar plots that end users would like to compare and explore. The end users may not be able to install any code. You can't setup a web server to nagivate the data set. You can not install any new programs on their window's desktop. How to you provide a solution?

Discussion

You can assume that any modern computer at least has a copy of Firefox, Safari, or Explorer. Since there browers support javascript (except in the worst case security settings), you can build a very lightweight data viewer using a few simple methods. The most important design decisions when generating the plots is to name the plots so they are easy to recreate from selections an end user might make.

Example

The following snippet of Python code generates 9 graphs that have random numbers plotted on two axes using different colors and markers. There are three choices of colors and three choices of markers. After generating the plots and saving them, the script creates an HTML file which simplies the navigation of the images. A user can open the HTML page and select the graph by changing the form selections at the top of the page.

There are a couple of key concepts that help make this work:
  • The plot names can be created from selections using javascript. For example, in this example there are three colors and three different markers. All of the plot file names are formed by concatenating the color and marker description to form a plot name.
  • When a user changes their choice of color or marker a javascript function rebuilds the plot file name and causes the browser to reload the image by change the img source.
  • The python script uses templates to set up the bulk of the HTML page, then substitutes in specific options for the user after the plots have been generated. 

import pylab as plt
from random import random
from string import Template

colors  = {'Red_Plot':'r',
           'Blue_Plot':'b',
           'Green_Plot':'g',
           }
markers = {'Circle_Plot':'o',
           'Square_Plot':'s',
           'Diamond_Plot':'d',
           } 

plt.figure()
for c_key in colors.keys():
    for m_key in markers.keys():
        plt.clf()
        plot_name = c_key + ',' + m_key + '.png'
        x = [random() for i in range(0,100)]
        y = [random() for i in range(0,100)]
        color = colors[c_key]
        marker = markers[m_key]
        plt.plot(x,y,color+marker,markersize=15)
        plt.savefig(plot_name)
        
HTML_template = Template('''<head>
   <script language="JavaScript"><!--
      function sel_plot() {
         // only do this if the brower supports images
         if(document.images) {
            // get plot color name
            var e=document.getElementById("color_name");
            var c_name = e.options[e.selectedIndex].text;
            // get plot marker name
            var e=document.getElementById("marker_name");
            var m_name = e.options[e.selectedIndex].text; // create the filename
            // build the filename from these selections
            var plot_filename = c_name + "," + m_name + ".png";
            // cause the correct plot to be loaded
            document["plot"].src = plot_filename;
            }
         }
      // select the plot to display initially after loading document
      window.onload = function() { sel_plot() };
      // silence errors
      window.onerr = null;
   </script>
</head>
<body>
   <center>
      <form name="Plot Select Form" id="plot_select_form">
         <select id="color_name" size="3"
            onchange="sel_plot()">
            $color_options
         </select>
         <select id="marker_name" size="3"
            onchange="sel_plot()">
            $marker_options
         </select>
      </form>
      <img name="plot" src="dummy.png" height="200"
         width="500">
   </center>
</body>
''')
        

# build the option strings
def build_options(opt_dict):
    s = ''
    for i,key in enumerate(opt_dict.keys()):
        if i==1:
            s += '<option selected>'
        else:
            s += '<option>'
        s += key
        s += '</option>\n'
    return s    

color_options = build_options(colors)
marker_options = build_options(markers)

# build the HTML page        
HTML = HTML_template.substitute(color_options=color_options,
                     marker_options=marker_options)
    

# write the HTML page
f = open('example.html','wb')
f.write(HTML)
f.close()

Test Configuration

  • PythonXY 2.7.2.1
  • IE 9


This work is licensed under a Creative Commons Attribution By license.