title: "Using Automatic Persistent Memoization to Facilitate Data Analysis Scripting"
authors: Philip J. Guo and Dawson Engler
venue: International Symposium on Software Testing and Analysis (ISSTA)
year: 2011
links:
- Blog post
tweet: IncPy speeds up scientists' iteration times by automatically caching results of long computations
abstract: >
Programmers across a wide range of disciplines (e.g., bioinformatics,
neuroscience, econometrics, finance, data mining, information retrieval,
machine learning) write scripts to parse, transform, process, and
extract insights from data. To speed up iteration times, they split
their analyses into stages and write extra code to save the intermediate
results of each stage to files so that those results do not have to be
re-computed in every subsequent run. As they explore and refine
hypotheses, their scripts often create and process lots of intermediate
data files. They need to properly manage the myriad of dependencies
between their code and data files, or else their analyses will produce
incorrect results.
To enable programmers to iterate quickly without needing to manage
intermediate data files, we added a set of dynamic analyses to the
programming language interpreter so that it automatically memoizes
(caches) the results of long-running pure function calls to disk,
manages dependencies between code and on-disk data, and later re-uses
memoized results, rather than re-executing those functions, when
guaranteed safe to do so. We created an implementation for Python and
show how it enables programmers to iterate faster on their data analysis
scripts while writing less code and not having to manage dependencies
between their code and datasets.
bibtex: >
@inproceedings{GuoIncPy2011,
author = {Guo, Philip J. and Engler, Dawson},
title = {Using Automatic Persistent Memoization to Facilitate Data Analysis Scripting},
booktitle = {Proceedings of the 2011 International Symposium on Software Testing and Analysis},
series = {ISSTA '11},
year = {2011},
isbn = {978-1-4503-0562-4},
location = {Toronto, Ontario, Canada},
pages = {287--297},
numpages = {11},
url = {http://doi.acm.org/10.1145/2001420.2001455},
doi = {10.1145/2001420.2001455},
acmid = {2001455},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {dependency management, scientific workflows},
}