Large Scale System debugging

This is a reading list I have prepared for Large Scale Systems Debugging and Monitoring. The main focus is Section 2 Analytics, with topics covered in Section 1 Monitoring section as an enabling mechanism. Section 3 talks about Testing for completion’s sake

Section 1: Monitoring

Buck, Bryan, and Jeffrey K. Hollingsworth. “An API for runtime code patching.” International Journal of High Performance Computing Applications 14.4 (2000): 317-329.
Cantrill, Bryan M., Michael W. Shapiro, and Adam H. Leventhal. Dynamic instrumentation of production systems, USENIX Annual Technical Conference. 2004.
Luk, Chi-Keung, et al. “Pin: building customized program analysis tools with dynamic instrumentation.” ACM SIGPLAN Notices. Vol. 40. No. 6. ACM, 2005.
Nethercote, Nicholas, and Julian Seward. “Valgrind: a framework for heavyweight dynamic binary instrumentation.” ACM Sigplan Notices 42.6 (2007): 89-100.
Sigelman, Benjamin H., et al. Dapper, a large-scale distributed systems tracing infrastructure Google research (2010) Laurenzano, Michael A., et al. “PEBIL: Efficient static binary instrumentation for linux.” Performance Analysis of Systems & Software
(ISPASS), 2010 IEEE International Symposium on. IEEE, 2010. Erlingsson, Úlfar, et al. “Fay: extensible distributed tracing from kernels to clusters.” ACM Transactions on Computer Systems (TOCS) 30.4 (2012): 13.

Section 2: Analytics

Chen, Mike Y., et al. “Pinpoint: Problem determination in large, dynamic internet services.” Dependable Systems and Networks, 2002. DSN 2002. Proceedings. International Conference on. IEEE, 2002.
Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, and Athicha Muthitacharoen. 2003. Performance debugging for distributed systems of black boxes. InProceedings of the nineteenth ACM symposium on Operating systems principles (SOSP ‘03)
Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using magpie for request extraction and workload modelling. In Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (OSDI’04)
Chanda, Anupam, Alan L. Cox, and Willy Zwaenepoel. “Whodunit: Transactional profiling for multi-tier applications.” ACM SIGOPS Operating Systems Review. Vol. 41. No. 3. ACM, 2007.
Sapan Bhatia, Abhishek Kumar, Marc E. Fiuczynski, and Larry Peterson. 2008. “Lightweight, high-resolution monitoring for troubleshooting production systems.” In Proceedings of the 8th USENIX conference on Operating systems design and implementation (OSDI’08). USENIX Association, Berkeley, CA, USA, 103-116 - (Chopstix)
Tak, Byung Chul, et al. “vPath: precise discovery of request processing paths from black-box observations of thread and network activities.” USENIX ATC. 2009.
Sambasivan, Raja R., et al. “Diagnosing performance changes by comparing request flows.” Proceedings of the 8th USENIX conference on Networked systems design and implementation. USENIX Association, 2011.
Ravindranath, Lenin, et al. “AppInsight: mobile app performance monitoring in the wild.” Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation. USENIX Association, 2012.
Tudor Marian, Hakim Weatherspoon, Ki-Suh Lee, and Abhishek Sagar. 2012. Fmeter: extracting indexable low-level system signatures by counting kernel function calls. In Proceedings of the 13th International Middleware Conference (Middleware ‘12), Priya Narasimhan and * Peter Triantafillou (Eds.). Springer-Verlag New York, Inc., New York, NY, USA, 81-100
Chengwei Wang, Infantdani Abel Rayan, Greg Eisenhauer, Karsten Schwan, Vanish Talwar, Matthew Wolf, and Chad Huneycutt. 2012. VScope: middleware for troubleshooting time-sensitive data center applications. In Proceedings of the 13th International Middleware Conference(Middleware ‘12)

Section 3: Testing

Monica S. Lam, John Whaley, V. Benjamin Livshits, Michael C. Martin, Dzintars Avots, Michael Carbin, and Christopher Unkel. 2005. Context-sensitive program analysis as database queries.(BDDDB Tool) InProceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS ‘05). ACM, New York, NY, USA
Cadar, Cristian, et al. “EXE: A system for automatically generating inputs of death using symbolic execution.” Proceedings of the ACM Conference on Computer and Communications Security. 2006.
Cadar, Cristian, Daniel Dunbar, and Dawson R. Engler. “KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs.” OSDI. Vol. 8. 2008.
Pacheco, C., Lahiri, S. K., Ernst, M. D., & Ball, T. (2007, May). Feedback-directed random test generation. In Software Engineering, 2007. ICSE 2007. 29th International Conference on (pp. 75-84). IEEE.
Garg, Pranav, et al. “Feedback-directed unit test generation for C/C++ using concolic execution.” Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, 2013.
Liblit, Ben, et al. “Scalable statistical bug isolation.” ACM SIGPLAN Notices40.6 (2005): 15-26.
Chilimbi, T. M., Liblit, B., Mehra, K., Nori, A. V., & Vaswani, K. (2009, May). HOLMES: Effective statistical debugging via efficient path profiling. InSoftware Engineering, 2009. ICSE 2009. IEEE 31st International Conference on (pp. 34-44). IEEE.
Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. 2011. FATE and DESTINI: a framework for cloud recovery testing. In Proceedings of the 8th USENIX conference on Networked systems design and implementation (NSDI’11).