« Simulation Optimization | Main
Monday
Aug172009

Gathering and Preparing a Data Set 

Use the right tools. Data analysis software has come a long way in recent years. If you haven't researched what's available lately, you might be surprised at how much power you can get for relatively low cost. At a bare minimum, you should invest in a tool to help with distribution fitting. SIMUL8 Professional includes the Stat:Fit plug-in. We think that ExpertFit is another excellent choice. Excel is good for some simple tabulation, but there is a wide array of techniques out there that are invaluable for exploratory data analysis. If you are working on projects involving data sets of any substantial size or complexity, you will appreciate the features of a complete statistical analysis package. There are many choices available on the market. For full-fledged data analyses, we use Statistica, from StatSoft, because it offers an impressive array of features at a very reasonable price. The right choice for you will depend upon your needs, background and budget.

Request data as early in the study as you can. Data is commonly tucked away in a corporate data warehouse where there is one poor overworked soul who must, on top of his 47 other duties, pull data for anyone wanting to perform an analysis. It is well worth the effort to get in line as soon as you can. Be sure to treat this person well because chances are pretty good that you'll need more data from him/her again  before too long. We recently heard the story of a colleague whose first visit to the data specialist involved a box of Krispy Kreme donuts. Far be it from us to judge such methods, but we do hear that he's not having any trouble getting the data he needs!

Be specific in the type of data you request. If possible, specify the source of the data, the time period, and even some of the values you expect to appear. This information can help the data gatherer verify that you are both talking about the same field.

Be prepared to merge multiple data sources. In a recent project, we needed to have information on order characteristics (such as height, width and quantity) along with production data (which machine ran the order, how long it took, etc.). The corporate data warehouse contained lots of excellent information on these topics, but they lived in two different files: an order history file, and a production history file. In order to perform the analysis, we needed to have a combined file. This was easy enough to create by merging the two files based on order numbers.

Don't take anything at face value. Unless you are in the fortunate (and rare) position of pulling your own data, carefully check any data sets that you receive. Is it from the correct time period? Do you have consistent keys (such as Order ID)? Is it from the correct plant? Do the numbers seem reasonable? Does the set seem to have approximately the right number of records?

Play with the data to get comfortable with it. Run some basic tabulations. Add up the number of weekly orders and check to see whether that matches the general estimates you've been hearing from your local experts (if you have them, read this article if you don't). The people who spend all day every day with a process usually have a good sense of what's going on. If the data is telling you something different, investigate why. Are you seeing different arrival patterns by day of week? There are plenty of fancy statistical tests available to test this formally, but for now, just look at the results to see if it is something you will likely have to deal with.

Beware of missing data. If you don't have a complete set of fields for a given record, treat that record with caution. Throw out cases that don't make sense.

Check for duplicates. Before you use that production data to determine order volume, be sure that you've only got one record per order. Often production data will contain one line per process. If an order goes through four production stages, there may be one record for each process. Simply adding up the number of orders would lead to seriously inaccurate results.

References (1)

References allow you to track sources for this article, as well as articles that were written in response to this article.

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>