Performance Testing and Artificial Intelligence (2/2)

Summary¶

AI can be a powerful support tool in performance testing, but it cannot replace the judgment and structure a tester brings.
This second part of our blog post serie compares how AI and a performance tester approach script creation and results analysis, using JMeter as the reference tool.
The study shows that AI can generate workable test plans and meaningful data insights, though often lacking context, flexibility, and refinement.
Human expertise remains vital for modelling realistic user journeys, interpreting trends, and spotting nuances AI misses.
Used together, AI + tester expertise provides a stronger, more efficient workflow for performance testing.
Relying solely on AI, however, risks overlooking critical issues that impact reliability, scalability, and user experience.

Want to become a super load tester?

Request a Demo

Introduction¶

If you recall part one of this blog post, we were going to use ChatGPT in parallel with how we would work to cover these aspects of performance testing.

Requirements Gathering
Risk Assessment
Script Creation
Results Analysis

We left the first part of this blog post at the point at which we had compared Requirements Gathering and Risk Assessment, we will pick this post up by looking at Script Creation before concluding with Results Analysis.

Script Creation¶

Our performance testing tool of choice will be JMeter. We will drive our test from a flat file that holds all the load profile values, which means we can update the test duration, throughput, concurrency or load by changing the values in the file rather than the test itself.

We will build a single peak hour load test, for the purposes of this blog post, that comprises of:

Logon
Search
Purchase
Logoff

See here a blog post that outlines modularisation and one that outlines how to drive the load from a file.

Clearly, we could integrate into Jenkins and run JMeter from a container to support a more CI/CD way to approach performance testing, but for the sake of this comparison we will keep it simple.

Performance Tester Analysis¶

This is the JMeter test we could build to satisfy the peak hour load requirements for the user journey defined above. There are many ways to do this, but this is a simple test for comparison against the one created by ChatGPT.

Link to test

full-script

We have set a requests default config element to store the Protocol, URL and Port Number

request-defaults

We have used properties for our thread and ramp up period.

thread-group

The same in the throughput controller.

flat-file

These values are held in a flat file that we can pass into the test on the command line to determine the load profile. The flat file looks like this.

flat-file

These values were determined by these requirements:

We need to generate 500 logons per minute, to generate 500 searches per minute and 250 purchases, we then logoff. The journey through the full journey contains nine samplers of which purchases makes up five; therefore, we need to generate: 4 x 500 = 2,000 transactions per minute 5 x 250 = 1,250 transactions per minute 3,250 transactions per minute x 60 minutes = 195,000 transactions in total

To determine a sensible number of threads we assume that each transaction will take 1000ms to respond and then add contingency for slow response times.

3,250 / 60 = 54 threads 54 threads + 50% = 81 threads

In order to only run the purchase part of the journey for 50% of the iteration we will use an If Controller.

This will evaluate to true for 50% of the transactions and is a simple approach to the logic.

ChatGPT Analysis¶

For ChatGPT the first prompt we provided was:

"Can you build a JMeter test plan to cover all endpoints and run at a level that will support the transaction throughput volumes."

ChatGPT built a theoretical model after this and asked if we wanted it to build a .jmx file, therefore the second prompt was:

"Yes generate the .jmx file"

This is what ChatGPT built:

Link to test

full-script-chatgpt

The Thread Properties for each Thread Group were variables that were input by reading a flat file.

thread-group-chatgpt

The flat file was read using a CSV Data Set Config Element.

csv-file-chatgpt The flat file that held the input values was defined like this.

flat-file-chatgpt

Comparison¶

For a simple 2 prompts test, ChatGPT's creation is pretty good. It lacks a bit of flexibility when it comes to the volumes, and has grouped the end points into categories which is fine if they can be tested in isolation, but not so if they require to be run as an end-to-end journey. As we did not explicitly specify this requirement, we will ignore this.

It built a flat file to input data from and included a method to get this data into the test in a way that can be easily adjusted. With a few more prompts we believe that we could get close to the one we created.

Results Analysis¶

We could write our results to a database such as Prometheus or Snowflake or even SQLite or Oracle, any flavour of database really, and vizualize data using any number of tools such a Grafana or OAS. Here are some links to blog posts that show how you can accomplish this:

For the purposes of our simple test, we are going to use the JMeter standard .jtl file output and write to Excel to perform our comparison.

We will analyze the results in two alternative ways, firstly against our response time requirements and secondly, we will provide ten sets of results and look for patterns and trends. The table below provides a fictional set of response times.

results-raw

Performance Tester Analysis¶

The colours show how we can compare the response times against the non-functional requirements, and the table below is an example of how we can assess the results and provide some analysis.

results-analysis

Transaction	Analysis
/user/login	Meets the Response Time of 1000ms and has not shown any regression in any release.
/application/list-all-games	The latest releases have seen improvement in response times where the earlier ones were regularly exceeding their non-functional requirement.
/search/search-item	We have seen regression in Releases 4 - 7 but the recent releases have seen response time meet their non-functional requirements and match the earlier releases.
/application/in-stock	Meets the Response Time of 1000ms and has not shown any regression in any release.
/application/add-to-basket	Meets the Response Time of 1000ms and has not shown any regression in any release.
/application/add-card	Since Release 6 we have seen much higher response times.
/application/add-shipping-address	Meets the Response Time of 1000ms and has not shown any regression in any release.
/application/confirm-purchase	Has always failed to meet its non-function requirement of 1000ms.
/user/logout	Meets the Response Time of 1000ms and has not shown any regression in any release.

It is always a good idea to present a graphic representation of data as it makes it easier to see the results:

results-analysis-graph

We have shown the data in a Bar Chart and a Box and Whisker Chart which both shows very clearly the long running response times, and allows us to see how these change for each release.

ChatGPT Analysis¶

For ChatGPT the first prompt we provided was:

"Given this results table can you analyze the data against the second column Response Times and also the ten set of results for each endpoint."

Where the results table the raw non colour coded table above. ChatGPT generated the analysis below. We will discuss the second prompt later.

results-analysis-chatgpt-1

results-analysis-chatgpt-2

results-analysis-chatgpt-3

results-analysis-chatgpt-4

After the analysis was produced ChatGPT asked if we would like a visual chart to be produced from the data. Our second prompt was:

"Yes generate a chart"

This was the chart that was generated:

results-analysis-chatgpt-5

Comparison¶

Probably as expected ChatGPT provided a really good analysis of some raw data and had kept the context of the application throughout the history of the prompts we used so was able to make some useful insights. Some differences were that we applied a colour coding to the table, making it easier to spot where the high response times were. We feel the graphs we provided presented the data in a much more readable format, but we could ask ChatGPT to produce these types of graphs explicitly.

Conclusion¶

The results we have achieved are broadly comparable. ChatGPT did a really good job of understanding what was required and how to articulate its analysis. To say that you could dispense with a performance tester to manage your performance testing requirements however would be wrong we think.

You need the expertise to make AI produce what you need from it, complimenting a mindset of a performance tester with AI is a very powerful combination. The subtleties and nuisances are where the differences lie, and these differences make the difference between good performance or poor performance in your application under test. We have compared a simple example, and more complex examples do not necessarily mean more complicated applications they mean more complicated architecture, involvement of 3^rd parties and the complex needs of the application users and customers.

Performance is so important in the world that we live in where choice is unparalleled and where competition is stiff whatever your site offers. You need to ensure that your applications performs and scales and does not perform badly under load as you will soon start to lose sales to your competitors.

To trust all this to AI, without the intervention of a skilled and experienced performance tester would almost certainly result in performance issues for your application under test.

Want to become a super load tester?

Request a Demo