#################### Performance Testing #################### OpenLMIS focuses on performance metrics that are typical in web-applications: - Calls to the server - how many milliseconds does this single operation take, and is the memory usage reasonable. - Network load - how large are the resources returned from the server. Typically OpenLMIS is designed to work in network-constrained locations, so the size, in bytes, of each resource is important. - The number of calls the Reference UI makes - again networks being what they, we want to minimize the number of connections that are made to accomplish a user workflow as each connection adds overhead. - Size of the "working" data set. Here working data is defined as the data that's needed for a user to accomplish a task. Examples are typically Reference Data: # of Products, # of Facilities, # of Users, etc. Though also the # of Requisitions or # of Stock Cards might factor into a user's working data. Since OpenLMIS typically manages countries, it's important that we're efficient in managing country-level data sets. There are some areas of Performance however that OpenLMIS typically doesn't focus as much on: - Scaling - typically we're not concerned with tens of thousands of people needing to use the system concurrently. Likewise we don't typically worry yet about surges or dips in user activity requiring more or less resources to serve those users. Getting Started ---------------- OpenLMIS uses Apache JMeter_ to test RESTful endpoints. We use Taurus_, and it's YAML format, to write our test scenarios and generate reports which our CI server can present as an artifact of every successful deployment to our CD test server. Keeping to our conventions, Taurus_ is used through a Docker image, with a simple script located at `./performance/test.sh` with tests in the directory `./performance/tests/` of a Service. Any `*.yml` file in that test directory will be fed to Taurus to be used against `https://test.openlmis.org`. Running `test.sh` will place JMeter output as well as Taurus output under `./build/performance-artifacts/`. The file `stats.xml` has the final summary performance metrics. Files of note when developing test scenarios: * `error-N.jtl` - Contains errors and requests that led to those errors from the HTTP server. * `JMeter-N.err` - Contains JMeter errors where JMeter didn't understand the test scenario. * `modified_requests-N.jmx` - Contains the generated JMeter requests (after Taurus generation). * `kpi-N.jtl` - Individual metrics of a test scenario. Running in CI -------------- Tests run in a Jenkin's Job that ends in `-performance`. This job is run as part of each Service's build pipeline *that results in a deployment to the test server*. The reports are presented using `Performance Plugin`_. When looking at this report you'll see: * A graph that shows all of the endpoints (requests) over time. * A report for a build which includes an average over time, as well as a table showing KPIs of each request. A simple Scenario (with authentication) ---------------------------------------- Nearly all of our RESTful resources require authentication, in this example we'll show a basic test scenario that includes authentication. The syntax and features used here are documented at Taurus' page on the `JMeter executer`_. .. code-block:: yaml execution: - concurrency: 1 hold-for: 1m scenario: users-get-one scenarios: get-user-token: requests: - url: ${__P(base-uri)}/api/oauth/token method: POST label: GetUserToken headers: Authorization: Basic ${__base64Encode(${__P(basic-auth)})} body: grant_type: password username: ${__P(username)} password: ${__P(password)} extract-jsonpath: access_token: jsonpath: $.access_token users-get-one: requests: - include-scenario: get-user-token - url: ${__P(base-uri)}/api/users/a337ec45-31a0-4f2b-9b2e-a105c4b669bb method: GET label: GetAdministratorUser headers: Authorization: Bearer ${access_token} The `execution` block defines for our test scenario `users-get-one` that runs 1 concurrent user, for one minute. Notice that this definition is for the simplest of test executions - 1 user, run it enough times to get a useful sampling. We use this sort of test execution to first get a sense of what our endpoint's single-user characteristics are. Next notice that we have two scenarios defined: #. get-user-token - this is a reusable scenario, which gets a basic user authentication token, and through the `extract-jsonpath` saves it to a variable named `access_token`. #. users-get-one - this is the test scenario we're primarily interested in: exercise the `/api/users/{a specific users uuid}`. We pass the previously obtained `access_token` through the HTTP request's headers. Summary ^^^^^^^^ * First test the most basic of environments: 1 user, enough times to get useful sample size. * Re-use the scenario to obtain an access_token using `include-scenario`. * It's generally OK to use demo-data identifiers (the user's UUID) - though it couples the test to the demo-data, it will provide consistent results. * Give each request a clear, semantic `label`. This will be used later in pass-fail criteria. Testing collections -------------------- To the simple Scenario we're going to now test the performance of returning a collection of a resource: .. code-block:: yaml users-search-one-page: requests: - include-scenario: get-user-token - url: ${__P(base-uri)}/api/users/search?page=1&size=10 method: POST label: GetAUserPageOfTen body: '{}' headers: Authorization: Bearer ${access_token} Content-Type: application/json Here we're testing the Users resource by asking for 1 page of 10 users. Summary ^^^^^^^ * When testing the performance of collections, the result will be influenced by the number of results returned. Due to this prefer to test a paginated resource, and always ask for a number that exists (i.e. don't ask for 50 when demo-data only has 40). * Searching often requires a POST, in this case the query parameters must be in the URL. Testing complex workflows ------------------------- A complex workflow might be: #. GET a list of periods for which requisitions may be initiated. #. Create a new Requisition resource by POSTing with the previously returned periods available. #. DELETE the previously created Requisition resource, so that we may test again. .. code-block:: yaml initiate-requisition: requests: - url: ${__P(base-uri)}/api/oauth/token method: POST label: GetUserToken headers: Authorization: Basic ${__base64Encode(${__P(user-auth)})} body: grant_type: password username: ${__P(username)} password: ${__P(password)} extract-jsonpath: access_token: jsonpath: $.access_token # program = family planning, facility = comfort health clinic - url: ${__P(base-uri)}/api/requisitions/periodsForInitiate?programId=10845cb9-d365-4aaa-badd-b4fa39c6a26a&facilityId=e6799d64-d10d-4011-b8c2-0e4d4a3f65ce&emergency=false method: GET label: GetPeriodsForInitiate headers: Authorization: Bearer ${access_token} extract-jsonpath: periodUuid: jsonpath: $.[:1]id jsr223: script-text: | String uuid = vars.get("periodUuid"); uuid = uuid.replaceAll(/"|\[|\]/, ""); vars.put("periodUuid", uuid); - url: ${__P(base-uri)}/api/requisitions/initiate?program=10845cb9-d365-4aaa-badd-b4fa39c6a26a&facility=e6799d64-d10d-4011-b8c2-0e4d4a3f65ce&suggestedPeriod=${periodUuid}&emergency=false method: POST label: InitiateNewRequisition headers: Authorization: Bearer ${access_token} Content-Type: application/json extract-jsonpath: reqUuid: jsonpath: $.id jsr223: script-text: | String uuid = vars.get("reqUuid"); uuid = uuid.replaceAll(/"|\[|\]/, ""); # remove quotes and [] vars.put("reqUuid", uuid); - url: ${__P(base-uri)}/api/requisitions/${reqUuid} method: DELETE label: DeleteRequisition headers: Authorization: Bearer ${access_token} Summary ^^^^^^^ * When creating a new RESTful resource (e.g. PUT or POST), we may need to clean-up after ourselves in order to run more than one test. * JSR223 blocks allow us to execute basic Groovy (default). This can be especially useful when you need to clean-up a JSON result from a previous response, such as a UUID, to use in the next request. Simple stress testing --------------------- As mentioned, OpenLMIS performance tests tend to focus first on basic execution environments where we're only testing 1 user interaction at a time. However there is a need to do basic stress testing, especially for endpoints which are used frequently. For example we've seen the authentication resource used repeatedly in all our previous examples. Lets stress test it. .. code-block:: yaml modules: local: sequential: true execution: - concurrency: 10 hold-for: 2m scenario: get-user-token - concurrency: 50 hold-for: 2m scenario: get-service-token scenarios: get-user-token: requests: - url: ${__P(base-uri)}/api/oauth/token method: POST label: GetUserToken headers: Authorization: Basic ${__base64Encode(${__P(user-auth)})} body: grant_type: password username: ${__P(username)} password: ${__P(password)} get-service-token: requests: - url: ${__P(base-uri)}/api/oauth/token method: POST label: GetServiceToken headers: Authorization: Basic ${__base64Encode(${__P(service-auth)})} body: grant_type: client_credentials Here we've defined 2 tests: #. Authenticate as if you're a person. #. Authenticate as if you're another Service (a Service token). The stress testing here introduces important changes in our `execution` block: .. code-block:: yaml - concurrency: 10 hold-for: 2m scenario: get-user-token Instead of defining 1 user, here we'll have 10 concurrent ones. Instead of running the test for 1 minute, we're going to run the test as many times as we can for 2 minutes. For further options see the Taurus' `Execution doc`_. When stress testing, it's important to remember that too much simply isn't useful, and only slows down the test. Nor do we presently have a test infrastructure in place that allows for tests to originate from multiple hosts. Summary ^^^^^^^ - You can define multiple execution definitions for the same scenario, so the first might give us the basic performance characteristics, the second might be a stress test. - By default the tests defined in the `execution` block are run in parallel. This can be changed to by ran sequential with `sequential: true`. - Choose a reasonable number of concurrent users. Typically less than a dozen is enough. - Choose a reasonable time to hold the test for. Typically 1-2 minutes is enough, and no more than 5 minutes unless justifiable. - Remember that we don't have a performance testing infrastructure in place that can concurrently send requests to our application from multiple hosts. OpenLMIS performance testing typically only requires the most basic stress testing. Testing file uploads -------------------- In this short example we're going to send a request to the catalog items endpoint and upload some items as a CSV file. .. code-block:: yaml upload-catalog-items: requests: - include-scenario: get-user-token - url: ${__P(base-uri)}/api/catalogItems?format=csv method: POST label: UploadCatalogItems headers: Authorization: Bearer ${access_token} upload-files: - param: file path: /tmp/artifacts/catalog_items.csv Summary ^^^^^^^ * When uploading a file we don't have to worry about setting correct content header as Taurus take care of it on its own when using upload-files block. This behavior is described in the `HTTP Requests`_ of the Taurus user manual Pass-fail criteria ------------------ With the above tests defined, we can now write pass-fail criteria. This is especially useful if we want our test to fail when the performance is less than what we've defined. .. code-block:: yaml reporting: - module: passfail criteria: - avg-rt of GetUserToken>300ms, continue as failed - avg-rt of GetServiceToken>300ms, continue as failed This allows us to fail the test if the average response time for either of the two tests was greater than 300ms. See the `Taurus Passfail doc` for more. Summary ^^^^^^^ * Write the pass-fail criteria within the test definition. Performance Acceptance Criteria ================================ With Taurus we can now add basic acceptance criteria when working on new issues. For example the acceptance criteria might say: - the endpoint to retrieve 10 users should complete in 500ms for 90% of users This would lead us to write a performance test for this new GET operation to retrieve 10 users, and we'd add a pass-fail criteria such as: .. code-block:: yaml reporting: - module: passfail criteria: Get 10 Users is too slow: p90 of Get10Users>500ms, continue as failed Read the `Taurus Passfail doc`_ for more. Next Steps (WIP) ================ We've covered basic performance testing, stress testing, and pass-fail criteria. Next we'll be adding: * Loading performance-oriented data sets (e.g. what happens to these requests when there are 10,000 products). * Using Selenium to mimic browser interactions, to give us: * How many http requests does a page incur. * Network payload size. * Failing deployments based on performance results. .. _Taurus: http://gettaurus.org/ .. _Execution doc: http://gettaurus.org/docs/ExecutionSettings/#Load-Profile .. _Taurus Passfail doc: http://gettaurus.org/docs/PassFail/ .. _JMeter: http://jmeter.apache.org/ .. _JMeter executer: http://gettaurus.org/docs/JMeter/ .. _Performance Plugin: https://wiki.jenkins.io/display/JENKINS/Performance+Plugin .. _HTTP Requests: https://gettaurus.org/docs/JMeter/#HTTP-Requests