Skip to content
Use JMeter to create a website crawler

Use JMeter to create a website crawler

The idea behind this blog post originated when we updated our documentation after the release of our new UI. We had to identify all links used in the OctoPerf website and update them from https://doc.octoperf.com to https://api.octoperf.com/doc. With more than 250 blog posts at the time I'm writing this one, you can see how this could prove challenging. And of course the twist is that we also took this opportunity to reorganize the documentation so it's not as simple as a search and replace of the domain.

This got us thinking on ways to automate it, because a lot of third party links could also be broken and of course one of those ways is to use OctoPerf itself to execute JMeter tests that will report on all the broken links.

OctoPerf is JMeter on steroids!
Schedule a Demo

Build list of unique URLs

List all pages

So basically we need to parse all the webiste pages and just open every URL, seems simple right ? Right ?

Well first parsing all the website pages can prove to be a challenge if you do not have a list of all unique pages. Thankfully our website has a sitemap, so we'll be using it as a starting point.

In case your website doesn't, you'll have to consider another level of parsing to iterate and identify all the unique pages. But the biggest difficulty in that case is to avoid cycling through the same pages indefinitely. Such a thing can happen, in particular if your website contains a lot of links.

Our starting point is much easier since we already have the list of all the unique pages. First we start with a request to the sitemap:

Back

Under this request we place a regex, OctoPerf creates it automatically for us on the regex configuration screen:

Back

We end up with this configuration: <loc>(.+?)</loc> and we select All as a match number because we want to extract all the occurences:

Back

A quick look at the debug panel shows us 280 results, which seems correct:

Back

Iterate on all pages

The first step is to use a For loop to iterate on each result from our regex:

Back

Now inside this loop there are several things we will have to do:

  • Execute the request to the URL from before,
  • Extract every URL inside the response,
  • Write all URLs to a file,
  • Remove duplicates.

We will perform the last step after the loop in order to reduce computing time so let's focus on the 3 others first.

Execute page

The issue with our URLs from before is that they need to be broken down as follows: protocol://domain:port/path

To achieve that we will use a groovy preprocessor and the Java URL class:

Back

Here is a snippet of the script if you want to reuse it:

def link = vars.get("link")
def url = new URL(link)

sampler.setDomain(url.getHost())
sampler.setPath(url.getPath())
sampler.setProtocol(url.getProtocol())
sampler.setPort(url.getPort())

In summary, we take what the loop has put into link for this iteration, break it down into its various components and use that to override the configuration of the request (referred to as sampler in this situation).

A naive implementation of the regex post processor would go like this href="(.+?)". And in most situations it would be Ok. However you'll probably find that some HREF tags contain other things than URLs (for instance, in-page links composed of only a #). That's why I recommend using the href="http(.+?)" regex instead:

Back

This way we make sure this is a URL, we will just have to remember to add "http" back when we store it. Again we use All as match number in order to get all iterations.

For this purpose I chose to use a CSV file. That will just make our life easier down the line since we can feed it back to JMeter and let him distribute lines in a unique manner.

Note: our regexp link extraction is href="http(.+?)". We must not forget to add the missing http when building links.

The JSR post processor script goes like this:

Back

Here is a snippet of the script if you want to reuse it:

def csv = new File("resources/urls.csv")
1.upto(vars.get("href_matchNr") as int, {
    csv << "http" << vars.get("href_$it") << System.getProperty("line.separator")  
})

We write to resources/urls.csv because this is where OctoPerf expects the CSV files to be. That allows us to use it in a CSV variable, but more on this later.

Otherwise we iterate on all extracted values until _matchnr is reached, signaling the end of the loop.

Remove duplicates

This step could be part of the loop but I found it only makes the computation time longer. Probably because a lot of the duplicates removed each time will be added back from the next page parsed.

We use a JSR script action this time since it must run on its own (and not under a request like the ones from before):

Back

Here is a snippet of the script if you want to reuse it:

def file = new File('resources/urls.csv')
def lines = file.readLines().unique()
file.withWriter { writer ->
    lines.each {line ->
        writer.writeLine(line)
    }
}

It's a pretty simple script, we simply use the File Java class and in particular the .unique() function to clean duplicates before writing it back.

Validate results

You can either check the contents of the CSV file or if you are in OctoPerf, use an after-test.sh script to upload this CSV to the test artifacts.

1. Upload after-test.sh

As an example:

#!/usr/bin/env bash
uploadFile resources/urls.csv

Back

2. Run a validation

After you execute a virtual user validation you will see this in the logs panel:

Back

Execute URLs

We will use another virtual user/threadgroup to do the actual request. This will make our life easier when we want to automate everything in the next section.

CSV variable

Back

First of all we want to create a CSV variable that will use our urls.csv file from earlier. We've configured this variable to:

  • Share lines between virtual users, that way we can launch concurrent virtual users and each will execute a different line of the file,
  • Stop VU when the file is finished, that way the test will end automatically when we reach the end of the file.

I also made sure to upload a file that has at least a single line (make it point to the homepage of your website) otherwise the first user that will try to populate the file will automatically end because it reached the end of it. That's just to trick JMeter into still executing the first user in order to fill the file with more values.

Execute request

This one is going to look very similar to the execute page from before:

Back

More interesting is the Ignore codes post processor I am using:

if(prev.getResponseCode() == "999" || prev.getResponseCode() == "403" || prev.getResponseCode() == "429") {
 prev.setResponseOK()
}

I'm setting some response codes that I don't want to see in the error log because these codes are either ok or irrelevant. For instance:

  • Linkedin always responds with code 999 so all links to the OctoPerf's team linkedin profiles would be listed in error otherwise,

  • Some third parties like cloudfront could be blocking your calls because you're running a lot of automatic requests like a bot. That would result in 403 when calling these services,

  • If you have a lot of different links to the same website they may answer with 429 Too many requests,
  • Etc..

Of course make sure to adapt this part to your webiste/requirements.

Automate the execution

We're going to make use of setup thread groups and JTL result files in this section. The goal is to be able to simply execute the test and have a result file with all the various problems listed.

The setup goes like this:

  • use a setup thread to execute the Sitemap and extract all the URLs into the CSV file we configured in our variable,
  • Run a bunch of concurrent users iterating on the overriden CSV while writing only errors to the JTL file,
  • Test ends when file is empty (already ensured by CSV configuration).

Setup thread

The most important part will be to properly configure the Setup thread:

Back

We want to execute a single Setup thread of our Sitemap script from earlier and run it only once.

Load policy

In OctoPerf I'm setting up a test with 100 concurrent users max since I usually have a few thousands of requests to execute in my CSV. I've setup a one minute ramp up to avoid any critical CPU usage, but for lower levels of load you can probably go faster:

Back

Having a large number of concurrent users is not important since they can quickly iterate on all requests.

JTL file configuration:

We want the JTL to only log errors (can be achieved through properties in JMeter). That way it only shows relevant requests to us.

Also I find It helpful to only write Response code and the URL, it keeps it very easy to read and none of the other metrics matter anyway.

Back

Final notes

Results

Once you run the test you will see something like this:

Back

As you see, once the setup is finished, all of our other virtual users start ramping up. Until the end of the file triggers the end of the test. We didn't even reach 100 concurrent users meaning the whole process took less than a minute, but feel free to play around with more/less users to see what's best in your situation (spoiler: more is not always better in that situation).

And at the end of the test we see results in the JTL:

Back

Timeouts

It can be painful to have to wait for timeouts on websites that do not exist anymore, in OctoPerf you can change the timeouts on the servers page:

Back

In JMeter you can configure them inside the requests or configure HTTP request defaults for the entire project.

Automate further

The next step could be to execute this test as part of a CI chain or simply run it using our scheduler.

You can then get the JTL back using our Rest API and a call to /analysis/logs/zip/{id}. As usual try to experiment in our UI having the dev tools open in your browser and you'll get the calls our UI does as examples.

Conclusion

This whole process shows once more how powerful JMeter can be when you know how to use it. It was a perfect example of how to use files, setup thread and CSV variables that can benefit anyone using JMeter.

And I hope you can see how OctoPerf makes it even faster to create, maintain and execute than having to do everything in a local JMeter.

OctoPerf Superman
Want to become a super load tester?
Request a Demo