Apparently, it is a term used by the folks at Netflix. This term refers to running some tests or benchmarks to see changes in performance and calculate the breaking point of the application. Then see if that last change was inefficient or determine the recommended auto-scaling parameters before deploying it.
For more automated tests: