Skip to content
Generating Quality Data

Generating Quality Data

The problem with test data is that it can become stale very quickly. This is either through its use from testing or from the fact that it is naturally aging in the test environments.

This is not just an issue for performance testing, although the volumes of data sometimes required for performance testing do make it harder. This also affects functional testing as well as batch testing and business acceptance testing amongst others.

Now we have previously written posts on how after completion of performance testing you leave data created by the execution of your tests which may be of use to other members of your test or development community. And we have also discussed how you can use existing data in your performance testing environment in your tests in the most effective manner.

But in both cases, this is during or after your performance testing takes place, for performance test to be executed you need quality data in your test environments. This is also true for functional, batch and user acceptance testing, really its true for any type of activity that wants to use data in the test environments.

Do YOU have custom Load Testing needs?
Rely on our Expertise

Commercial products

There are many commercial products that exist that provide a mechanism to either generate dynamically, or copy from production, data into your test environments. In my experience and opinion, the success rate of these tools is low and really depends on the complexity of your application architecture. These tools vary in how they work but many add data directly to your test environment databases by either interrogating and profiling your database schemas or by cloning from your production environment and then obfuscating sensitive data. Simple applications with no integration with other systems, that also require consistent data to match, work well as the data needs to only exist in one database and does not require complimentary data in other systems.

Complex application with many integration points, some of which may be complex legacy systems that also integrated with other systems, are a different proposition. These type of applications are extremely complicated to generate dummy, consistent data across these types of systems. Sometimes even understanding where your system touches other systems and what data is shared or needs to be consistent is a challenge.

Another option is to clone data from your production environments across all integrated systems and then obfuscate the sensitive data. The challenge can be the various technologies being used, as discussed if you have legacy systems that need the same data as your primary application under test then they may use an alternative database technology which means that the way your data generation tool works need to support this. Other issues around using production data is the way that the data is masked, this needs to be consistent across all integrated environments otherwise your application may become unusable.

With any approach it can take a while to get the test data correct and the reality is you want this data to be refreshed on a regular basis, maybe daily or weekly and if it is complex and time-consuming this is not going to work for you. The statements above are a generalisation on my experiences of tools the offer data creation for test environments. It is always in your best interests to determine the best approach for your organisation by engaging these companies, if you have a test data shortage, to understand for yourself the services that they can offer.

Possible solution

The best way to generate test data that is valid, accurate and integrated across all applications and integration for your system under test is to create the data in the way that your end users do in production. Sounds simple right and it is, there are a couple of limitations, and these will also be discussed later. Many of the OctoPerf blog posts discuss how you can re-use your performance testing assets, we are all for making use of our tests outside of their primary use of executing performance testing scenarios and this philosophy will again be relevant in our data load approach.

Dummy application

Before we discuss how we can generate data using our performance testing assets and how we build volumes and diversity let’s look at a theoretical application. This application has integrations to other systems and services, and this will help us understand how we can use our performance tests to build quality test data for all forms of testing in your environments.

dummy-application

Consider that Application A is the primary focus for your performance tests and is the application that has been developed or updated. This has a single database DB A which would naturally consist of multiple tables and schemas.

If this was the extent of your application, then it is possible that using a commercial product to generate load in this database would be possible and probably quite effective. As we have already discussed the likelihood of this being the case is low as most applications are distributed and rely on multiple systems with their own databases.

The above theoretical application demonstrates the issues we were discussing in the previous section where consistency of data across multiple databases and ensuring that primary and foreign keys are consistent is difficult and time consuming. To get consistent data that is valid for databases A, B and C as well as being correctly referenced from one database to the other is not an easy feat and the levels of success with the commercial tools that are available are limited.

Data load

We have discussed the complexity of using commercial products, we have looked at a dummy application that we want to load data for, and we have theorised that to provide quality data, possible at scale to support performance testing would not be easy. Our suggestion is to use your performance testing scripts to generate the data you need to execute your performance testing scripts.

Let’s explain this in a bit more detail.

When an application is first made available to its end-users it will effectively have no customer data unless it’s being migrated from another system in which case it will be seeded at go-live with migrated data. You may think that if seeding data was the approach you were taking you could use these routines to populate the test environments with production data and solve our problems.

The issues you face with this approach is firstly you need to consistent mask sensitive data in test environments which is difficult and secondly you assume the routines to move the data from one system to the other is ready for when you want to test. Customer data, regardless of whether you start with an empty application database or not, is then built up over time by users of your application using your systems. The processes your customers will follow to add data to your production environment should be included as part of your performance tests as this application journey will be one of the most frequently used and subject to high levels of load.

Therefore, you should have scripted the web service requests or user journey flow already for the purposes of performance testing, so we could repurpose them. So, let’s go back to our theoretical application diagram and consider the critical functionality that generates the application data.

dummy-application-process-flow

We have assumed that the user journeys or web service requests for the four user interactions that generate and persist data have been scripted. In our simple example these four user interactions are:

  • Add customer
  • Add policy
  • Add location
  • Add payment details

These have been colour coded to show which theoretical databases they populate, you application under test is likely to be more complex and results in more user interactions that persist data to the database. If you were to run a high-volume load test simulating these user interactions then you would start to build valid, integrated data in your databases that you can then re-use. This data can be for performance testing or can be used by any other member of the project team that requires it.

These tests should be portable and therefore you can run these against any environment you want at any frequency you want. Before we look at how you can implement a data loading strategy you may need to store details of the data you have created in a database so it can be shared and used by others.

There is a blog post on API testing which while not particularly relevant to this subject does outline a method of storing data from your JMeter tests in a SQLite database.

Data load strategy

We have looked at how we can generate legitimate data and if you want to generate data for a particular performance test or for other member of your team then you have a mechanism to do so. What you should consider is running your tests to load data daily, if you consider your production environment data is being input daily and then naturally aging. If you load data into your test environment daily, then this will also age and give you data in all stages of its lifecycle through your application.

Your data load volumes daily do not need to be excessive, consider loading a modest amount of data for thirty days. If we execute our tests that simulate each of the user interactions that generates data fifty times a day for thirty days, then we will end up with:

Test Interaction Daily Iteration Count No Days Total Requests
Add customer 50 30 1500
Add policy 50 30 1500
Add location 50 30 1500
Add payment details 50 30 1500

So, over the course of a month, we have generated 1500 samples of data for each of our user iterations and this data can then be used by anyone wanting data in the test environments. This data is all aged differently and some will have been subject to processing by overnight, daily, weekly or monthly batch cycles.

What you are effectively doing is replicating what is happening in production. We mentioned earlier in this post about the limitations that this approach has, the primary one is that if you want data that is diverse and spread across several days and months you will need to be loading data daily and consistently. This is not a huge limitation and there are ways to overcome this.

Firstly, you could consider building yourself a pipeline and schedule your tests to run every night overnight, if you use pipelines for other deployment activities then you can easily adopt similar principles to do this. There is a blog post on Continuous Integration and Continuous Delivery that gives an overview on how you might accomplish regular execution.

The other thing to consider is making updates to existing data to further increase the complexity of the data.

Let’s look at this in a bit more detail.

dummy-application-update-process-flow.png

If we added a test that updated the data we create, then we are increasing the complexity and diversity of data we are persisting, and it is likely that the tests that cover the process of updating have already been built as part of your performance testing development. Even if you have not got performance tests that update data then to build them should be straightforward and you would have increased the complexity of your performance testing assets at the same time.

Regular execution

If we execute our tests that simulate each of the user interactions that generates data fifty times a day for a year, then we will end up with:

Test Interaction Daily Iteration Count No Days Total Requests
Add customer 50 365 18250
Add policy 50 365 18250
Add location 50 365 18250
Add payment details 50 365 18250

We would have generated 18,250 data samples for each of the four user interactions that persists data. As you can see this is by loading the relatively small number of 50 a day. You could easily increase this to generate many more.

If you continually generate data, then while you will consume some as part of your testing cycles you should be generating it at a greater rate than consumption meaning you will always have a healthy amount of accurate data available. With some simple calculations you could mirror the volumes added to production meaning that your test environment will grow at the same rate as production making your data l;oad policy even more accurate. Clearly to really see the benefits of this you need to wait a while before you have enough data at different stages of it application lifecycle but eventually you will reach a point where test data is no longer an issue.

Data values

When generating data, you need to ensure that the values you use as input to your tests are valid and random. If you are adding postcodes, or email addresses or even surnames and dates of birth it is a good idea to makes sure that they differ rather than have hardcoded into your tests.

You can do this using flat files and a JMeter CSV Data Set Config which will allow you to randomise your data input values. If you are using OctoPerf SaaS to generate load, then there is a really good feature that simplifies the creation of fake datasets for input into your tests which is exactly want you need to do in your data creation activity.

Conclusion

Test data creation in you test environment is difficult, using commercial products do not always give you the outcomes you desire especially if you have a complex technology landscape.

We are great advocates of reusing and re-purposing your performance testing assets and this is another example of how you can use them to not only solve your data load issues but provide a framework for good quality data in your test environment all the time and data the represents your naturally aging production data.

Want to become a super load tester?
Request a Demo