You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is not a bug report to Seq or its Helm chart, but my attempt to discuss an issue I've been observing while evaluating Seq in a fairly typical Kubernetes cluster created with Azure Kubernetes Services (AKS). Normally, I wouldn't bother the Seq community about a general AKS or Kubernetes issue, but I've been running number of clusters with variety of deployments mounting PV-s from Azure Storage like Azure Files and Azure Disks, and I have not observe such issue with any of applications I run. Although my problem can be caused by a bug in Azure CSI drive or fairly recent Kubernetes version that I'm using on AKS, I thought I'll try brainstorm it here first.
Context
To the point, here is my test environment where I'm evaluating Seq:
AKS cluster with Kubernetes 1.31.5
AKS hybrid node pools running, Linux and Windows
Azure automation with scheduled runbook that stops AKS every evening and starts it every morning - this may be considered as an uncommon set-up
First time Seq is deployed, everything is perfectly fine.
Then, almost every time after AKS is started in the morning, Kubernetes cluster resurrected with all pods but Seq which is failing due to this PV issue:
MountVolume.MountDevice failed for volume "pvc-11ab2543-35e2-44cc-88a4-2640bca6396e": rpc error: code = Internal
desc = could not format /dev/disk/azure/scsi1/lun0(lun: 0), and mount it at /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/56dd34d1e64485d92930f8e0a3873a31e4030b01790c2c2f45d2de222c3a52b0/globalmount,
failed with mount failed: exit status 32 Mounting command: mount
Mounting arguments: -t ext4 -o defaults /dev/disk/azure/scsi1/lun0 /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/56dd34d1e64485d92930f8e0a3873a31e4030b01790c2c2f45d2de222c3a52b0/globalmount
Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/56dd34d1e64485d92930f8e0a3873a31e4030b01790c2c2f45d2de222c3a52b0/globalmount:
wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error. dmesg(1) may have more information after failed mount system call.
kubectl describe pod -n common seq-7d575bc88b-n76rq
Name: seq-7d575bc88b-n76rq
Namespace: common
Priority: 0
Service Account: default
Node: aks-default-29286985-vmss00000j/10.3.0.4
Start Time: Mon, 24 Feb 2025 08:07:31 +0100
Labels: app=seq
pod-template-hash=7d575bc88b
release=seq
Annotations:
Status: Pending
IP:
IPs:
Controlled By: ReplicaSet/seq-7d575bc88b
Containers:
seq:
Container ID:
Image: datalust/seq:2024.3
Image ID:
Ports: 5341/TCP, 80/TCP
Host Ports: 0/TCP, 0/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Liveness: http-get http://:ui/ delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:ui/ delay=0s timeout=1s period=10s #success=1 #failure=3
Startup: http-get http://:ui/ delay=0s timeout=1s period=10s #success=1 #failure=30
Environment:
ACCEPT_EULA: Y
Mounts:
/data from seq-data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-96n6k (ro)
Conditions:
Type Status
PodReadyToStartContainers False
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
seq-data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: seq
ReadOnly: false
kube-api-access-96n6k:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 3m30s (x306 over 10h) kubelet MountVolume.MountDevice failed for volume "pvc-11ab2543-35e2-44cc-88a4-2640bca6396e" : rpc error: code = Internal desc = could not format /dev/disk/azure/scsi1/lun0(lun: 0), and mount it at /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/56dd34d1e64485d92930f8e0a3873a31e4030b01790c2c2f45d2de222c3a52b0/globalmount, failed with mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o defaults /dev/disk/azure/scsi1/lun0 /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/56dd34d1e64485d92930f8e0a3873a31e4030b01790c2c2f45d2de222c3a52b0/globalmount
Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/56dd34d1e64485d92930f8e0a3873a31e4030b01790c2c2f45d2de222c3a52b0/globalmount: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error.
dmesg(1) may have more information after failed mount system call.
If we take Kubernetes out of the picture and focus on the common Linux error wrong fs type, bad option, bad superblock on /dev/sdb, then I suspect some of the following situations might be happening during the scheduled stopping of my AKS cluster:
(Azure) disk not properly unmounted (similar issue discussed here
(Azure) disk not properly detached from node (similar issue described here)
Seq container not being terminated gracefully, but forcibly, while Seq still writing data to volume that is being unmounted leading to disk corruption
This issue could be caused by a bug in CSI driver, as I mentioned earlier, but web searching for "disk.csi.azure.com"+"wrong fs type" does not bring any helpful results.
Perhaps root of this issue is in the hybrid nodes in my cluster which may lead to this peculiar situation: at AKS restart, the Azure disk PV pvc-11ab2543-35e2-44cc-88a4-2640bca6396e is mounted to Windows node by Kubernetes trying to (randomly) schedule Sec pod to Windows node, that is, because I do not specify the nodeSelector in my Helm values above, neither the Helm chart here specifies this as a reasonable default to ensure Seq is scheduled only to Linux nodes:
and, perhaps, such unexpected mounting of the pvc-11ab2543-35e2-44cc-88a4-2640bca6396e disk to Windows somehow leads to corruption of its ext4, making it unusable later. A long shot, I do realise :)
Outro
I have not tried to fsck the disk from node. Since I'm deploying Seq to a test cluster for evaluation, I simply re-deploy it to trigger re-creation of Azure Disk, PV and PVC, and which works around the problem until some another AKS start/stop cycle breaks it.
Next, I am going to try the following:
Specify explicit nodeSelector with kubernetes.io/os: Linux
Use static provisioning of the Azure Disk to see if it makes any difference
I'm sharing my experience with hope that either there are Seq users who have seen similar issues or Seq team folks, who know Seq internals, may be able to provide feedback helpful to diagnose the problem better.
I'll appreciate any ideas.
The text was updated successfully, but these errors were encountered:
Regarding, "Seq container not being terminated gracefully, but forcibly, while Seq still writing data to volume that is being unmounted leading to disk corruption". You could corrupt Seq's storage this way, but I don't think you can break the filesystem.
This is not a bug report to Seq or its Helm chart, but my attempt to discuss an issue I've been observing while evaluating Seq in a fairly typical Kubernetes cluster created with Azure Kubernetes Services (AKS). Normally, I wouldn't bother the Seq community about a general AKS or Kubernetes issue, but I've been running number of clusters with variety of deployments mounting PV-s from Azure Storage like Azure Files and Azure Disks, and I have not observe such issue with any of applications I run. Although my problem can be caused by a bug in Azure CSI drive or fairly recent Kubernetes version that I'm using on AKS, I thought I'll try brainstorm it here first.
Context
To the point, here is my test environment where I'm evaluating Seq:
Here is manifest with Helm release of Seq using the official Helm chart, which then Flux observes in my GitOps repository and releases:
where
managed-csi
is one of built-in AKS storage classes.Problem
First time Seq is deployed, everything is perfectly fine.
Then, almost every time after AKS is started in the morning, Kubernetes cluster resurrected with all pods but Seq which is failing due to this PV issue:
kubectl describe pod -n common seq-7d575bc88b-n76rq
kubectl describe pvc -n common seq
kubectl describe pv pvc-11ab2543-35e2-44cc-88a4-2640bca6396e
Brainstorm
If we take Kubernetes out of the picture and focus on the common Linux error
wrong fs type, bad option, bad superblock on /dev/sdb
, then I suspect some of the following situations might be happening during the scheduled stopping of my AKS cluster:This issue could be caused by a bug in CSI driver, as I mentioned earlier, but web searching for
"disk.csi.azure.com"+"wrong fs type"
does not bring any helpful results.Perhaps root of this issue is in the hybrid nodes in my cluster which may lead to this peculiar situation: at AKS restart, the Azure disk PV
pvc-11ab2543-35e2-44cc-88a4-2640bca6396e
is mounted to Windows node by Kubernetes trying to (randomly) schedule Sec pod to Windows node, that is, because I do not specify thenodeSelector
in my Helm values above, neither the Helm chart here specifies this as a reasonable default to ensure Seq is scheduled only to Linux nodes:in
helm.datalust.co/charts/seq/values.yaml
Line 131 in 4608f51
and, perhaps, such unexpected mounting of the
pvc-11ab2543-35e2-44cc-88a4-2640bca6396e
disk to Windows somehow leads to corruption of its ext4, making it unusable later. A long shot, I do realise :)Outro
I have not tried to
fsck
the disk from node. Since I'm deploying Seq to a test cluster for evaluation, I simply re-deploy it to trigger re-creation of Azure Disk, PV and PVC, and which works around the problem until some another AKS start/stop cycle breaks it.Next, I am going to try the following:
nodeSelector
withkubernetes.io/os: Linux
I'm sharing my experience with hope that either there are Seq users who have seen similar issues or Seq team folks, who know Seq internals, may be able to provide feedback helpful to diagnose the problem better.
I'll appreciate any ideas.
The text was updated successfully, but these errors were encountered: