title: "Using Automatic Persistent Memoization to Facilitate Data Analysis Scripting" authors: Philip J. Guo and Dawson Engler venue: International Symposium on Software Testing and Analysis (ISSTA) year: 2011 links: - Blog post tweet: IncPy speeds up scientists' iteration times by automatically caching results of long computations abstract: > Programmers across a wide range of disciplines (e.g., bioinformatics, neuroscience, econometrics, finance, data mining, information retrieval, machine learning) write scripts to parse, transform, process, and extract insights from data. To speed up iteration times, they split their analyses into stages and write extra code to save the intermediate results of each stage to files so that those results do not have to be re-computed in every subsequent run. As they explore and refine hypotheses, their scripts often create and process lots of intermediate data files. They need to properly manage the myriad of dependencies between their code and data files, or else their analyses will produce incorrect results.

To enable programmers to iterate quickly without needing to manage intermediate data files, we added a set of dynamic analyses to the programming language interpreter so that it automatically memoizes (caches) the results of long-running pure function calls to disk, manages dependencies between code and on-disk data, and later re-uses memoized results, rather than re-executing those functions, when guaranteed safe to do so. We created an implementation for Python and show how it enables programmers to iterate faster on their data analysis scripts while writing less code and not having to manage dependencies between their code and datasets. bibtex: > @inproceedings{GuoIncPy2011, author = {Guo, Philip J. and Engler, Dawson}, title = {Using Automatic Persistent Memoization to Facilitate Data Analysis Scripting}, booktitle = {Proceedings of the 2011 International Symposium on Software Testing and Analysis}, series = {ISSTA '11}, year = {2011}, isbn = {978-1-4503-0562-4}, location = {Toronto, Ontario, Canada}, pages = {287--297}, numpages = {11}, url = {http://doi.acm.org/10.1145/2001420.2001455}, doi = {10.1145/2001420.2001455}, acmid = {2001455}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {dependency management, scientific workflows}, }