title: "Using Automatic Persistent Memoization to Facilitate Data Analysis Scripting"
authors: Philip J. Guo and Dawson Engler
venue: International Symposium on Software Testing and Analysis (ISSTA)
year: 2011
links:
- <a href="incpy-paper.htm">Blog post</a>

tweet: IncPy speeds up scientists' iteration times by automatically caching results of long computations

abstract: >
  Programmers across a wide range of disciplines (e.g., bioinformatics,
  neuroscience, econometrics, finance, data mining, information retrieval,
  machine learning) write scripts to parse, transform, process, and
  extract insights from data.  To speed up iteration times, they split
  their analyses into stages and write extra code to save the intermediate
  results of each stage to files so that those results do not have to be
  re-computed in every subsequent run.  As they explore and refine
  hypotheses, their scripts often create and process lots of intermediate
  data files.  They need to properly manage the myriad of dependencies
  between their code and data files, or else their analyses will produce
  incorrect results.
  <br/><br/>
  To enable programmers to iterate quickly without needing to manage
  intermediate data files, we added a set of dynamic analyses to the
  programming language interpreter so that it automatically memoizes
  (caches) the results of long-running pure function calls to disk,
  manages dependencies between code and on-disk data, and later re-uses
  memoized results, rather than re-executing those functions, when
  guaranteed safe to do so.  We created an implementation for Python and
  show how it enables programmers to iterate faster on their data analysis
  scripts while writing less code and not having to manage dependencies
  between their code and datasets.

bibtex: >
  @inproceedings{GuoIncPy2011,
   author = {Guo, Philip J. and Engler, Dawson},
   title = {Using Automatic Persistent Memoization to Facilitate Data Analysis Scripting},
   booktitle = {Proceedings of the 2011 International Symposium on Software Testing and Analysis},
   series = {ISSTA '11},
   year = {2011},
   isbn = {978-1-4503-0562-4},
   location = {Toronto, Ontario, Canada},
   pages = {287--297},
   numpages = {11},
   url = {http://doi.acm.org/10.1145/2001420.2001455},
   doi = {10.1145/2001420.2001455},
   acmid = {2001455},
   publisher = {ACM},
   address = {New York, NY, USA},
   keywords = {dependency management, scientific workflows},
  }