Much (maybe too much) of the work of the modern epidemiology involves spending time in front of a computer screen. This is no different for injury or disaster epidemiology. There are a lot of options for statistical software, database management systems, text editing and presentation tools. These options evolve in response to technology and preference. My suggestions are just that, though they are informed by years of trial and error.
Public health practitioners and researchers too often use the software they first learned. I can understand why. There are just so many hours in a day and synapses in a brain. But, there are advantages to change and to learning more than one program that appears to do the same thing. Some procedures and data manipulations can most easily and effectively be performed by certain packages. Perhaps more importantly, different tools allow us to look at the same problem differently. What follows is a by no means exhaustive or complete annotated list, in no particular order, of some of the more popular programs I’ve used. For my purposes, the choice has increasingly come down to R and SAS, with R very much in the ascendent. While I find that R is the single best choice for the work I do, SAS can still simply not be beat for it’s ability to slide and dice enormous data sets of the kind I frequently find myself confronting.
Microsoft Excel – This humble and ubiquitous spreadsheet program has a quick learning curve, many useful statistical functions and the advantage of being readily understood and shared for collaborative purposes. If cost is an issue,
Open Office and
Google Spreadsheets are fine alternatives. Many researchers will find, though, that they will soon outgrow spreadsheets when they develop a need for more advanced and specialized procedures, data manipulation, and graphics.
EpiINFO – Once upon a time, when there was little else, there was EpiInfo. Developed by CDC for their own investigators, this program has something for everyone, including sample size calculators, simple intuitive 2x2 table calculations, and questionnaire-development and analysis programs for case-control studies, even a mapping module. Every epidemiologist in practice owes it to themselves to download this free program.
SPSS – A flexible and reasonable alternative if you prefer graphical user interfaces and a menu-driven approach to data analysis. Depending on the type of work you do, this may very well be the only statistical software package you will need. It is particularly well-suited to procedures such as factor analysis and is very popular among social sciences. (In face SPSS stands for Statistical Package for Social Science). The spreadsheet-like interface provides a relatively painless transition from a program like Excel. An added advantage for many users is that the package can be purchased during one’s graduate education avoiding yearly licensing fees. Some researchers see a disadvantage to the primarily menu-driven interface, and prefer syntax-based analyses like those in programs like R and SAS.
SAS – A powerful suite of programs that is well known, well-documented and flexible enough to handle most any statistical procedure. While a menu driven version is available, most researchers and epidemiological analysts will use the syntax-driven SAS-language based approach. Writing syntax presents a greater learning curve, but is amply rewarded by the increased control, flexibility and ability to document and recall analyses. SAS is configured to work across most any computer platform, perhaps contributing to its popularity among US government agencies. Its principle disadvantage is the yearly licensing fee which may be prohibitively expensive for some researchers. Multiple-user site license agreements with academic institutions may offset the expense to some extent. It is particularly powerful at handling the extremely large datasets that are increasingly available to practitioners and that constitute the databases for many surveillance activities. I developed and taught a course on SAS for epidemiology masters-level students for a number of years, so many of my statistical notes, use SAS examples.
R – I have, over the last few years, become an unabashed convert to this open-source program. It involves a bit of a learning curve, but I have yet to find anything it can’t do. It is a calculator, a spreadsheet, a graphics creator, a simulation lab, a full programming language and (increasingly) the lingua franca of statistical scientific discourse. The language is similar to S-Plus. Researchers and statisticians contribute to its development and documentation, it is actively maintained, and it allows a level of customization and scalability that I find uniquely suited to epidemiological analysis. Many folks fall in love with the graphics, which are as good as or (in many cases) better than those found in expensive packages. It even includes mapping and spatial analytic tools to rival those found in packages costing thousands of dollars. It has a lot to recommend it. Including an unbeatable cost: free.
GRASS - And speaking of spatial analysis. This free-ware program is in some respects the spatial bookend to R. Again, it involves a bit of initial effort to learn, but all effort is amply rewarded.
SUDAAN – A software program developed expressly for the purpose of analyzing survey data. The procedures themselves are familiar, but the analyses take account of complex sampling strategies, nesting of variables within strata, and correlations among data elements. Available in either a stand-alone version or as a SAS-callable version, the program is utilized for the analysis of many large, ongoing US government health surveys such as CDC’s Behavioral Risk Factor Surveillance System. Within the last few years SAS has introduced survey procedures, and R (as usual) has similarly effective tools, which has in many cases obviated the pressing need for SUDAAN in an epidemiologist’s armamentarium, but if your business primarily involves surveying folks, this is still the standard tool.
ArcGIS – A suite of software packages from ESRI that is the tool of choice among geographers, spatial analysts and government planners. If your work calls on you to do more than create simple choropleths of health-related or risk data, you will undoubtedly come across this software. The program comes bundled with more cartographic and geostatistical tools than most anyone will ever have time or need to learn. One can also marry the statistical power of SAS to ArcGIS through the integration offered by the SAS bridge for ESRI add on module. Like SAS it comes with a not inconsiderable and continual financial commitment, and like R is a a free open source alternative in GRASS.
SatScan – This program was initially developed by Martin Kuldorf and his colleagues for use in detecting cancer clusters. The easy to use software will identify locations of clusters of any outcome of interest across both time and space. An excellent public health surveillance tool available for free download after registration.
WinBUGS/OpenBUGS – BUGS stands for Bayesian Analysis Utilizing Gibbs Sampling. This freeware program allows sophisticated iterative sample-based analyses that were unavailable just a few years ago It comes with mapping tools that allow hierarchical Bayesian modeling to control for and address many of the difficulties and drawbacks of performing spatial analyses using a frequentist approach. It has been a game changer in terms of my epidemiological practice and has introduced me to the world of Bayesian analysis, which is a place that I like. The R2WinBUGS library clears the away the awkward point-and-clickiness of the original software and allows you to work within the powerful confines of R. The only issue I have (if anyone can have an issue with such a remarkable contribution to the science of statistics) is that there is as yet no native OS X version, necessitating (at least in my case) the need for a virtual machine like Parallels to access its wonderfulness.