Performance Testing and Artificial Intelligence (1/2)

Summary¶

AI is becoming central in software delivery, but relying on it alone for performance testing can limit accuracy and real insight.
A balanced approach is needed: AI can help, yet human expertise remains essential for defining requirements, assessing risks, and understanding real system behaviour.
This first part of a two blog post serie compares the methodology of performance testers with the output produced by ChatGPT, focusing on requirements gathering and risk assessment.
A fictional application is used to evaluate how both approaches differ in depth, relevance, and business awareness.
The analysis shows strong overlaps but also highlights where AI oversimplifies or lacks contextual judgment.
Used wisely, AI strengthens performance testing—but replacing human reasoning is not realistic or effective.

Elevate your Load Testing!

Request a Demo

Introduction¶

If you believe many articles online you would believe that automation in testing will soon be defined, managed and executed by Artificial Intelligence (AI). AI is embedded in many organisations technology landscape and to think that this model will change is shortsighted. AI is here to stay undoubtedly in one form or another, but should it be responsible for the automated testing of your applications under test ?

Clearly the level of involvement that AI has in your automation is up to you, but to depend on it exclusively will reduce its effectiveness and make it little more than a series of tests that provide little benefit when it comes to truly understanding how your application behaves. This post will look at performance testing and the automation that surrounds this discipline but many of the observations are also true for functional automation.

The many articles that support the wholesale use of AI in automation fail to see the bigger picture: most organizations are complex and any technology change is difficult and involves many applications and many stakeholders. So, while simply getting AI to build you a performance test suite in a language of your choice might be possible in some organizations for the vast majority, this is not a viable option.

Contrary to what this introduction might imply, the use of AI in performance testing is advantageous when used sensibly, this post will look to discuss this in more detail.

Approach to using AI¶

To explore this topic in more detail we will use ChatGPT and compare it with how we would normally work to cover these aspects of performance testing.

Requirements Gathering
Risk Assessment
Script Creation
Results Analysis

In order to demonstrate the differences between the two approaches, we will ask ChatGPT once to provide the information we want and then follow up with a further prompt. If we were to just continually refine our prompts based on what we know as performance testers, we would not be replicating the conditions under which a AI based performance test capability would operate, because the testing is being driven by a performance tester and not defined by AI.

To keep this fair, we will try and make our prompts as concise as we can. The post will be split into two parts due to the fact that to try and compress all this into one would mean missing the opportunity to explore the subject in detail. The first part will cover Requirements Gathering and Risk Assessment and the second part will cover Script Creation and Results Analysis.

Fictional Application¶

Technology stack¶

We will use a fictional application for this experiment. This is a new application and is currently not live; our application will be a web-based application hosted in AWS that sells computer games, it's URL is: https://dummy-application.co.uk

It will have five services written in Spring Boot:

User service
Search service
Application service
Payment service
Reporting service

These services will have these endpoints:

User service
- POST /user/login
- POST /user/logout
Search service
- GET /search/search-item
Application service
- POST /application/add-to-basket
- GET /application/list-all-games
- GET /application/in-stock
Payment service
- POST /payment/add card
- POST /payment/add-shipping-address
- POST /payment/confirm-purchase
Reporting service
- GET /reporting/sales-by-date

The database will be PostgreSQL. Kafka will be used to distribute sales messages for reporting.

Expected response times and volumes¶

This is a seasonal organisation with peaks at Christmas and at the time of new game releases.

At peak times you can expect:

500 logons per minute
500 searches per minute
250 purchases per minute
The reporting runs overnight between 2am - 3am UTC
The site is available 23 hours a day

All API response times must take less than 1000ms at the 95^th percentile. Reporting must not exceed the one-hour window. CPU must not exceed 60% CPU on all containers. Kafka ingress and egress rates must match. The application is expected to have 10,000 registered users in the first year and grow at 10% annually.

Topics to compare¶

As discussed earlier we will firstly define our approach to this topic on performance testing and then ask ChatGPT, using no more than two prompts. We are going to have to keep our analysis on each topic quite short as otherwise this post will become far to long. Each of these topics could be considered an individual blog post, maybe we will consider that for the future.

Requirements Gathering¶

Performance test requirements need to be testable, you need to combine aspects of performance response times and volumes. See this post on requirements gathering for guidance on how performance risks should be approached to make them testable. For ChatGPT the first prompt we provided was:

"Given this definition of a theoretical application can you define a set of testable non-functional requirements specifically for performance testing."

Where the definition was exactly what was defined in the Fictional application and Expected response times and volumes sections. The second prompt was:

"Can you make the requirements testable by combining response time, load and utilization for example"

Comparison table¶

The table below looks at both how we and ChatGPT have defined non-functional requirements, I have tried to line up the responses where possible.

Performance Tester Analysis	ChatGPT Analysis
The application must support 500 logons per minute, and the logon request must take no longer than 1000ms to respond at the 95^th percentile while not consuming more than 60% CPU on all containers against a database that contains 10,000 existing users; all errors must be accounted for.	When 500 login requests per minute are executed concurrently: 95^th percentile response time ≤ 1000 ms. Average CPU utilisation per container ≤ 60%. Error rate ≤ 0.5%. Memory utilisation ≤ 70%. Measured over a sustained 15-minute steady-state period.
The application must support 500 searches per minute, and the search request must take no longer than 1000ms to respond at the 95^th percentile while not consuming more than 60% CPU on all containers against a database that contains 10,000 existing users; all errors must be accounted for.	When 500 searches per minute are executed concurrently: 95^th percentile response time ≤ 1000 ms. Average CPU utilisation ≤ 60% across search containers. PostgreSQL query latency ≤ 200 ms (95^th percentile). Kafka not used in this path remains idle (no backlog). Error rate ≤ 0.5%.
The application must support 250 purchases per minute, and each endpoint in the purchase process request must take no longer than 1000ms to respond at the 95^th percentile while not consuming more than 60% CPU on all containers against a database that contains 10,000 existing users; all errors must be accounted for.	When 250 full purchase workflows per minute (add-to-basket >> add-card >> add-shipping >> confirm-purchase) are executed concurrently: End-to-end transaction time ≤ 2000 ms (95^th percentile). Each involved microservice maintains CPU ≤ 60%. Kafka message production for sales is completed within 2 seconds of transaction completion. No backlog in Kafka topics (lag < 10 s). PostgreSQL write latency ≤ 300 ms (95^th percentile). Error rate ≤ 0.5%.
The application must be able to support an average load of transactions, representative of 60% of peak volumes, for a period or 23 hours with no degradation in response time or CPU utilisation; all errors must be accounted for.	Over a 4-hour sustained load (70% of peak rates): Response times remain within ±10% of initial baseline. No unplanned container restarts or memory leaks. CPU ≤ 60%, memory ≤ 70%. Error rate ≤ 1%.
The Kafka queues must be able to support the processing of message under peak load conditions and not backlog, so that the sales reports endpoint is always current; any DLQ'ed messages must be accounted for.	For every purchase message published to Kafka: 100% are consumed and reflected in reporting tables within 5 minutes. Message loss rate = 0%. End-to-end sales record accuracy = 100%.
The application must be able to support growth of 10% and still meet its's response time and CPU utilisation metrics; all errors must be accounted for.	When load is doubled (1000 logins/min, 1000 searches/min, 500 purchases/min): Application auto-scales within 2 minutes of CPU exceeding 60%. After scaling, response times return to ≤ 1000 ms (95^th). Kafka lag remains < 10 s. No database deadlocks or slow queries > 1 s.
The reporting endpoints must be able to extract a full days' worth of transactional data and process this withing the 1-hour processing window while not consuming more than 60% CPU on all containers against a database that contains 10,000 existing users; all errors must be accounted for.	During the 2 am - 3 am UTC reporting run: Full report generation completes ≤ 3600 s. CPU utilisation ≤ 60% on all reporting and DB nodes. Kafka ingress = egress (no message loss; difference < 0.1%). PostgreSQL read queries average ≤ 300 ms (95^th percentile). Memory consumption stable (no increase > 5% over run).
This one is a ChatGPT one that we provided no equivalent. This is what ChatGPT has termed a "Availability and Reliability" test, but is closer to our definition of a Soak Test.	Over a 24-hour period, excluding 1-hour maintenance: Uptime ≥ 99.86% (23 h availability). Average request success ratio ≥ 99.5%. Automatic recovery from service restarts ≤ 30 s.

Comparison¶

The results are fairly consistent, most of the test are fairly similar. ChatGPT seems to want to quantify error rate acceptable limits which we did not define and probably varies depending on application or organisation or functionality. If we had specified an error percentage in the requirements we would have included a value, but we just stated "accounted for", which is a bit of a catch all.

ChatGPT specified 4 hours for the Soak Test requirement where we wanted to run for the duration the application needs to be available for, which is 23 hours. ChatGPT specified a Availability and Resilience test which was the same duration as our Soak Test. But excepted those small differences, the approaches of a performance tester and ChatGPT are quite similar.

Risk Assessment¶

Risk Assessments for performance need to be considered in much the same way as our requirements are defined. They are linked to our requirements in a way that the risk is the consequence of failing to meet our Non-Functional requirements. For the purposes of this post, we will look at risks from a business perspective rather than a technical perspective.

See this post on risk assessment to see how you need to be able to score your performance risks, and using a simple impact vs likelihood method is as good as any. Using this process, we will quantify our risks as below, we will avoid scoring them as this is very organization specific and will vary. For ChatGPT the first prompt we provided was:

"Using these requirements can you risk assess them."

ChatGPT provided a set of Technical Risks which is a bit beyond the scope of this post as we have not defined the application from an architectural perspective. For this post we will assess these from a business perspective. The second prompt was:

"Can you quantify these from a business perspective"

The table below looks at both how we and ChatGPT have defined the risks, I have tried to line up the responses where possible.

Comparison table¶

Performance Tester Analysis	ChatGPT Analysis
The application is unable to support 500 logons per minute resulting in loss of revenue and reputation.	Business Impact: Slow or failed logins increase abandonment; every 1 s delay in login can reduce conversions by ~7 %. Customer Reputation: High frustration >> increased support tickets and negative reviews.
The logon request takes longer than 1000ms to respond at the 95^th percentile resulting in loss of revenue and poor customer experience making site revisits unlikely.	See above
Logon, at peak volumes, consumes more than 60% CPU on all containers meaning that our system will alert resulting in support needing to address the alerts and possible auto-scaling of the services as well as potential slowdown of response times if the CPU continues to climb.	See above
Response times cannot be met when running load against a database that contains 10,000 existing users meaning as the site grows in popularity performance will degrade and customers may leave.	Business Impact: Downtime directly stops revenue; AWS restart mis-config can extend outages. Customer Reputation: High - customer trust & SLAs
The application is unable to support 500 searches per minute resulting in loss of revenue and reputation.	Business Impact: Core user journey: poor search UX directly reduces product discovery and conversion. Customer Reputation: Very high - visible to all users.
Search requests take longer than 1000ms to respond at the 95^th percentile resulting in loss of revenue and reputation.	See above
Searching, at peak volumes, consumes more than 60% CPU on all containers meaning that our system will alert resulting in support needing to address the alerts and possible auto-scaling of the services as well as potential slowdown of response times if the CPU continues to climb.	See above
The application is unable to support 250 purchases per minute resulting in loss of revenue and reputation.	Business Impact: Slow checkout causes basket drop-offs; payment timeouts risk double charges/refunds. Customer Reputation: Severe trust/reputation hit; negative social media.
All endpoints in the purchase process request take no longer than 1000ms to respond at the 95^th percentile resulting in loss of revenue and poor customer experience making site revisits unlikely.	See above
Purchasing, at peak volumes, consumes more than 60% CPU on all containers meaning that our system will alert resulting in support needing to address the alerts and possible auto-scaling of the services as well as potential slowdown of response times if the CPU continues to climb.	See above
The application degrades in terms of response times and resource consumption when run for a protracted period of time consistent with application restarts in production resulting in loss of revenue and reputation.	Business Impact: If memory leaks cause slowdowns after hours of use, productivity and uptime drop. Customer Reputation: Low.
The overnight reporting takes longer that 1 hour to process data generated under peak levels of load and concurrency leading to unknown levels of stock and unable to balance orders.	Business Impact: Delayed or inconsistent reports impact business decisions and inventory planning. Customer Reputation: Low customer visibility.
The Kafka queues are unable to support the processing of message under peak load conditions and backlog so that the sales reports endpoint is not always current resulting in unknown levels of stock and unable to balance orders.	Business Impact: Incorrect data affects business reports, reconciliation, and compliance. Customer Reputation: Low.
The application is unable to support growth of 10% and still meet its's response time and CPU utilisation metrics.	Business Impact: Failure leads to outages during promos; downtime cost = lost sales + brand damage. Customer Reputation: Very visible, especially during campaign peaks.

Comparison¶

There are a lot of commonalities between the two methods. The Performance Tester Analysis being a bit move verbose and tailored towards a business audience we know we will have to face and convince of the risks. We have tended to split risks and adopt a common approach to the way they are worded, whereas ChatGPT has combined them.

All in all we end up with good output organizing from ChatGPT, this is not evident from the way we have conveyed it but nonetheless it was impressive. There is not much that ChatGPT has missed, but the way it formulated it is very factual and this is not possible with some organizations, where convincing requires a lot more explanations.

Conclusion¶

This is the end of the first part of the comparisons between how AI and a professional performance tester would approach performance testing for our theoretical application.

Look out for the second part of this post where we will pick up with Test Creation and Results Analysis, and where we will form a conclusion on both posts on this intriguing subject.

Want to become a super load tester?

Request a Demo