Troubleshooting Vault on Kubernetes
Operating Vault in a Kubernetes environment brings a new set of challenges, operational patterns, and practices.
Troubleshooting Vault in Kubernetes resembles the methodology detailed in the Troubleshooting Vault tutorial. The primary differences concern the tools and techniques required to troubleshoot, which are specific to Kubernetes.
In this tutorial, you will use the command line in a terminal session to learn about fundamental techniques and tools for troubleshooting Vault in a Kubernetes deployment.
Checking cluster status
A good starting point for troubleshooting Vault is to gather information about the server status for each cluster member. This includes helpful information such as whether Vault is actively deployed, when it was deployed, whether it is in active or standby mode, and more.
Vault status
One way to immediately determine the Vault cluster status is to examine the output of vault status executed in each of the cluster member pods.
For example, in a Vault deployment using the Helm chart, you might find a list of pods like this example.
To get the status of a single Vault server, you can address the pod by name with kubectl
. For example, the following command gets the status of Vault in the pod vault-0.
You can get the status of all Vault servers in a cluster by looping through each pod, and addressing them with the kubectl
command by name based on the pod names vault-0, vault-1, and vault-2 from the previous examples.
An explanation of each field in the status output follows.
- Seal Type: The type of seal in use. This value should match across cluster members.
- Initialized: Whether the underlying storage has been initialized. This should always appear with a value of true in any case except that of a new and uninitialized server.
- Sealed: Whether the server is in a sealed or unsealed state. A sealed server cannot participate in cluster membership or otherwise be used until it is unsealed. All members of a healthy cluster should report a value of false.
- Total Shares: The number of key shares made from splitting the root key (previously known as master key); this value can only defined during initialization.
- Threshold: The number of key shares required to compose the root key; this value can only defined during initialization.
- Version: The version of Vault in use on the server.
- Storage Type: The type of storage in use.
- Cluster Name: The cluster name string; this value should match on all members of a healthy cluster.
- Cluster ID: The cluster identification string; this value is dynamically generated by default, and should match on all members of a healthy cluster.
- HA Enabled: Whether this cluster is using high availability (HA) coordination functionality.
- HA Cluster: The cluster address used in client redirects.
- HA Mode: The HA mode. Expected values are Active and Standby. There should be one active leader in every healthy cluster. In the example output, the pod vault-0 is the Active cluster leader.
- Active Node Address: The address of the active HA cluster leader, used in request forwarding.
- Raft Committed Index: The index value for storage items which are committed to the log. This value should closely follow or be equal to the value of Raft Applied Index in a healthy Vault cluster.
- Raft Applied Index: The index value for storage items which are applied, but not yet committed to the log.
Helm chart status
If your Vault cluster is deployed with a Helm chart, you can begin a troubleshooting session by getting the basic deployment status from the helm status
command.
You can learn several things from this output, including a few important details.
- The deployment name is vault.
- The last time the cluster was deployed; in this case, Tue Jan 12 13:22:50 2021.
- The current deployment status; in this case, deployed.
Integrated Storage cluster member status
If your Vault cluster uses Integrated Storage (Raft), you can use the vault operator raft command to list the cluster member peers and immediately determine cluster leadership status.
Here is an example of the command vault operator raft list-peers
and its output.
From this output, you can determine the cluster leader. In this example, it is the node with identifier 9bc24771-64bb-2ac5-fee2-35a80f61e810 or vault-0.
Server operational logs
Vault servers log information about operations to standard output and standard error. This information can then be captured by the systemd journal or routed to a static file depending on operational requirements.
When Vault is deployed in Kubernetes, you can retrieve the logs from pods with the kubectl logs
command.
For example, here are the first 105 lines of a Vault server operational log, which shows important server startup information that can be useful for troubleshooting.
To capture the entire log into the file vault-0.log
, use a command like this example.
Server audit device logs
In an ideal production environment, each Vault server should have one or more audit devices enabled to trace requests and responses.
If your troubleshooting scenario requires examining the details of requests and responses at this level, then you need access to these logs. In many production deployments, audit device logs are sent to an external system for processing.
If a file audit device is enabled in the Kubernetes pods however, you can access the file as a means to examine this information. For example, if the cluster was deployed with the Helm chart and enable auditing according to the Production Deployment Checklist, then the audit file is expected to be located in the pod at a path like /vault/audit/vault_audit.log
.
You can confirm this by using the tail command to display the last line in the file.
The file is present; use kubectl cp
to copy it from the pod to the host like this.
You can now examine the vault_audit.log
file as described in the Querying Audit Device Logs tutorial.
Gather debugging data with vault debug
Introduced in version 1.3, the debug command starts a process that monitors a Vault server, probing for information about it during a specified collection duration and generating a compressed archive of the resulting information.
You can use vault debug
in a Kubernetes environment with a multi-step process that involves entering the pod, changing into a writable directory, and executing the command. You can then exit the pod, and copy the resulting debug archive from the pod to the host.
First, access the required pod; in this example, it is vault-0.
Your system prompt is replaced with a new prompt / $
that includes the present working directory name. In the following examples, the prompt is shortened to only $
.
Change into the /tmp
directory so that the debug file can be successfully written.
Now you can execute the debug command; the example command will execute for 2 minutes, taking a sample at the 1 minute and 2 minute marks, and then exit.
Note
The debug command honors the VAULT_TOKEN environment variable value and use it. This can have the unintended effect of limiting the gathered debugging information. If you need the maximum amount of information possible, use a highly privileged token as the value of VAULT_TOKEN before executing the debug command instead.
After two minutes elapse, the process exits and a file is now present in the /tmp
directory.
Note
The output is written to a gzipped tar file with a filename made up of the string vault-debug and a timestamp; in this example, the filename is vault-debug-2021-01-22T14-51-12Z.tar.gz
.
Now you can exit the pod and copy the file from the pod to the host.
Exit the pod.
Use the kubectl cp
command to copy the debug output file from the vault-1 pod to the local host.
Now the file is present locally and you can use the tar
command to extract its contents.
Here is the tree
output of the example extracted file's top-level directory.
The contents consist of Goroutine and heap profiling data, along with sanitized server configuration, host information, data index information, metrics, and status information.
You can learn more about the vault debug
command and the data contained within the archives it generates from the debug documentation.
Next steps
This tutorial demonstrated the troubleshooting tools and techniques of a Vault cluster running on Kubernetes. You learned how to get server and cluster status, gather operational and audit device logs, and how to use the vault debug tool.
You can continue learning about Vault troubleshooting in the Troubleshooting Vault tutorial and might find related content such as Querying Audit Device Logs and Monitor Telemetry & Audit Device Log Data with Splunk helpful.