Stress Testing FastAPI ML Apps with Locust

Kdnuggets

In the rapidly evolving landscape of artificial intelligence, deploying machine learning models via web APIs has become commonplace. Yet, the true test of such a system often comes not from its accuracy in a single prediction, but from its ability to perform under immense pressure. Stress testing, a critical practice in software development, offers a window into an application’s behavior when confronted with heavy user loads, making it indispensable for CPU-intensive machine learning APIs. By simulating a multitude of simultaneous users, developers can pinpoint performance bottlenecks, ascertain system capacity, and bolster overall reliability.

To demonstrate this crucial process, a practical setup leverages several powerful Python tools. FastAPI, renowned for its speed and modern architecture, serves as the web framework for building the API. Uvicorn, an ASGI server, is the engine that runs the FastAPI application. For simulating user traffic, Locust, an open-source load testing tool, allows for defining user behavior through Python code and then swarming the system with hundreds of concurrent requests. Finally, Scikit-learn provides the machine learning capabilities for the example model.

The core of this stress test lies in a robust FastAPI application designed to predict California housing prices using a Random Forest Regressor model from Scikit-learn. To ensure efficient resource management, the application incorporates a singleton pattern for the machine learning model, guaranteeing that only one instance is loaded into memory. This model is designed to either load a pre-trained version or train a new one using the California housing dataset if no existing model is found. Data validation and serialization for API requests and responses are handled seamlessly by Pydantic models, ensuring data integrity.

A pivotal architectural decision within the FastAPI application is the strategic use of asyncio.to_thread. This is vital because Scikit-learn’s prediction methods are CPU-bound and synchronous, meaning they could block FastAPI’s asynchronous event loop, hindering its ability to handle multiple requests concurrently. By offloading these CPU-intensive tasks to a separate thread, the server’s main event loop remains free to process other incoming requests, significantly enhancing concurrency and responsiveness. The API exposes three key endpoints: a basic health check, a /model-info endpoint providing metadata about the deployed machine learning model, and a /predict endpoint that accepts a list of features to return a house price prediction. The application is configured to run with multiple Uvicorn workers, further boosting its capacity for parallel processing.

To truly push the application’s limits, Locust is employed to orchestrate the stress test. A dedicated Locust script defines the behavior of simulated users, which includes generating realistic, random feature data for prediction requests. Each simulated user is configured to make a mix of requests to the /model-info and /predict endpoints, with a higher weighting given to the prediction requests to simulate real-world usage patterns more accurately. The script also includes robust error handling to identify and report any failures during the test.

The stress test itself involves a straightforward two-step process. First, the FastAPI application is launched, ensuring the machine learning model is loaded and the API endpoints are operational. Developers can then interact with the API documentation to verify functionality. Following this, Locust is initiated, either via its intuitive web UI for real-time monitoring or in a headless mode for automated reporting. The test can be configured with specific parameters, such as the total number of simulated users, the rate at which new users are spawned, and the duration of the test. As the test progresses, Locust provides real-time statistics on request counts, failure rates, and response times for each endpoint, culminating in a comprehensive HTML report upon completion.

Initial test observations often reveal fascinating insights into application performance. For instance, in some scenarios, the /model-info endpoint might exhibit a slightly longer response time compared to the prediction endpoint. This outcome, while seemingly counter-intuitive, can be impressive as it highlights the optimized speed of the core prediction service, even with a relatively simple machine learning model. This entire process offers an invaluable opportunity to rigorously test an application locally, identifying and mitigating potential performance bottlenecks long before it reaches a production environment, thereby ensuring a seamless and reliable user experience.