For users wondering how to capture server side diagnostic data for applications running on Cloud Foundry platform, it’s now possible to trigger and capture the diagnostic data using basic shell scripts and CF’s CLI command which can be quite useful for analyzing performance or runtime problems.
If you suspect an application is not performing well or demonstrating problems, it’s often helpful to gather thread dumps, heap dumps, environment details and other types of diagnostic information. Examples of diagnostic information include debug logs, CPU, IO or other system metrics at that instant in time from the OS and the process hosting the application on demand. Since I have worked with numerous customers where thread dumps and heap dumps are critical to analyzing and resolving problems in middleware, I felt this would be a good Hackathon Day project for the Cloud Foundry Platform.
Using a set of shell scripts and the CF CLI interface, it is now possible to trigger the data collection (like thread dumps or system metrics) across all instances of the application running on Cloud Foundry. There are two actors handling the orchestration of the remote trigger actions: one on the client side, and the other on the application container (agent) side as explained in the following figure.
The application container may be configured to spawn off a shell script (running in background) that is able to monitor for access modification timestamp of a designated file. This script kickoff can be handled either via the Buildpack release step (that starts the application or its container) or from within the application via some runtime execution call in Java or Ruby to kick off the script. The script can be part of the Buildpack or application bits as needed.
The script can create the designated target file (using unix touch call) and keep doing stat operation to check the last access time in a forever loop. On detecting some recent access of the target file (check for the timestamp difference since last touch by the script ), it can then kick off the relevant actions (like dumping threads or gathering system stats/metrics) and re-touch the file and save the most recent update time.
The client side will access the file using the
cf files cli command. CF’s loggregator is used to gather the information on the file present within the application container running in Cloud Foundry Platform. Since the CF Log machinery accesses the file on first time and only retrieves the next time on modification, the agent script on the application container side has to re-touch the target file in order to create the necessary modification for cf files to pick the new changes and and re-access it, thereby updating the access time of the target file. The files that are generated by the script can be retrieved through
The mechanism for passing the trigger is analogous to invoking pre-arranged actions whenever we receive missed calls from pre-designated targets. This approach of listener and trigger pair can be used for all forms of applications (Java/Node/Go) to either trigger thread dumps, or heap dumps or just collect system metrics (like top, mpstat, vmstat, ps, environment variables, etc.).
There can be one trigger file polled on application container side per matching trigger action. There can be multiple pairs of such target files to monitor and associated scripts for monitoring and kick-off of the data gathering. For example: /home/vcap/tmp/dumpThread can be target file which gets monitored by a script handling thread dumps; similar pair can exist for metrics or heap dumps. The bundled sample scripts should provide some guidance on triggering, capturing and retrieving the data using this approach.
It’s possible to use other approaches like opening up additional ports on the application container (currently not supported), running web server instance or other additional processes to handle the request, or use JMX. But almost all of them require users to manage and implement the additional service and expose that in some form or fashion on Cloud Foundry while dealing with the related routing for users to access or invoke it. Also, relying purely on the running application to be responsive is not always possible (Java VM might be hanging or thrashing for memory and can hence appear unresponsive to remote JMX calls).
Ruby applications can dump threads if they include the xray gem as part of the application. A kill -QUIT signal will let the ruby application dump threads to the stderr. For Java applications, one can use jstack (its output can be redirected to a file) or kill -QUIT (stack traces goes to the server process stderr) to issue the thread dumps. For more details on thread dumps and related analysis, please check this blog post and ThreadLogic for automatic analysis of Java thread dumps.
CF files CLI command (
cf files) only goes against the very first running instance of the application (often instance index 0). In order for the triggers to be received by all instances, its possible to use the
cf curl CLI command to issue individual requests to each of the instances running the application. The sample scripts provided achieve this fan-out, by creating the url with the application GUID and the instance index as part of the url to kick off the trigger or the data collection. The sample trigger script send the triggers while the capture script can be used to download/grab copies of the generated dumps. Heap dumps and outputs for thread dumps and system stats/metrics can be saved to local folder for every running instance using generated file paths.
The WebLogic Buildpack (created and maintained by the author of this post) will bundle these sample scripts as part of the BuildPack and also kick off the execution of the scripts at server startup, so users of Oracle WebLogic Server application running on Cloud Foundry would have direct support for triggering and collecting data remotely from all running instances of a given application on Cloud Foundry.
Sample Reference Scripts: