The Watcher

movq

www.uninformativ.de

Google support just told me: "Sure, if your k8s pod does $thing, then that can corrupt the ext4 on attached volumes." I'm sure this is just a silly misunderstanding.

movq

www.uninformativ.de

26 Jul 22 14:34 UTC

View Thread

Google support just told me: "Sure, if your k8s pod does $thing, then that can corrupt the ext4 on attached volumes." I'm sure this is just a silly misunderstanding.

movq

www.uninformativ.de

26 Jul 22 14:34 UTC

View Thread

Google support just told me: "Sure, if your k8s pod does $thing, then that can corrupt the ext4 on attached volumes." I'm sure this is just a silly misunderstanding.

prologic

twtxt.net

26 Jul 22 14:52 UTC

View Thread

@movq whaaat?! 😳

prologic

twtxt.net

26 Jul 22 14:52 UTC

View Thread

@movq whaaat?! 😳

movq

www.uninformativ.de

26 Jul 22 15:02 UTC

View Thread

@prologic Yeah, we got corrupted disks. Basically, some pods died too "suddenly" or "abrupt" and that confused the shit out of k8s ... I don't claim to understand the details here (and they didn't share them, either).

I really hope it *is* just a misunderstanding. I mean, if some pod can just do the wrong thing and thus corrupt an ext4 disk, then ... dude, what the heck. :D

movq

www.uninformativ.de

26 Jul 22 15:02 UTC

View Thread

movq

www.uninformativ.de

26 Jul 22 15:02 UTC

View Thread

prologic

twtxt.net

26 Jul 22 21:34 UTC

View Thread

@movq Dude what the actual fuck 😳🤦‍♂️ That's like saying if Docker dies too suddenly it'll corrupt the file system 😆

prologic

twtxt.net

26 Jul 22 21:34 UTC

View Thread

@movq Dude what the actual fuck 😳🤦‍♂️ That's like saying if Docker dies too suddenly it'll corrupt the file system 😆

movq

www.uninformativ.de

27 Jul 22 18:04 UTC

View Thread

@prologic Got another mail: They literally compared a pod dying at the wrong time to removing a USB stick without unmounting it. They said I should talk to our devs to find out what that pod is doing.

Okay, seriously, am I misunderstanding something here? How can a quitting pod cause that? I mean, the processes in the containers of a pod are just ... processes. When they quit, then they quit, end of story. They don't automatically unmount anything -- that's the job of k8s, isn't it? The applications running in pods have nothing to do with this layer.

Sure, when a pod dies at the wrong time, it might corrupt data *inside* of a filesystem -- but not the filesystem *itself*. (We were getting "bad superblock" messages in dmesg and all that.) Maybe the support guys think that's what was happening ...

movq

www.uninformativ.de

27 Jul 22 18:04 UTC

View Thread

movq

www.uninformativ.de

27 Jul 22 18:04 UTC

View Thread

lyse

lyse.isobeef.org

27 Jul 22 23:30 UTC+0200

View Thread

@movq Whaaaaat… O_o No offence, but there's often a reason that first level support works at first level support. I'm not helpful, I know.

prologic

twtxt.net

27 Jul 22 23:52 UTC

View Thread

I'm with @lyse on this. Level 1 support are morons, push back and escalate.

The Pod(s) are supposed to be managed by Google's GKE service no? Or is that the Node(s)? 🤔

In any case it's the responsibility of the CSI driver to deal with mounting and un mounting the file system into the Pod's namespace somewhere.

prologic

twtxt.net

27 Jul 22 23:52 UTC

View Thread

retrocrash

twtxt.net

28 Jul 22 01:36 UTC

View Thread

@prologic @movq @lyse

they are right in one sense but wrong in their delivery.

if a pod had a pv attached on storage plane and the pod does something (could be anything) to corrupt the pv, the scheduler will continue to kill the pod since its pvc pointing to pv cannot be fulfilled.

there is no proper time for pod death. kubernetes will kill pods for whatever reason it deems necessary at any given time. its purpose is to ensure declared state is met at all times.

now all that being said ext4 corruption can and does happen on the underlying storage that supports your storage plane (ceph+took, talos, nfs, iscsi, etc) but a pod cannot directly cause this.

if the csi driver/storage plane had some bug or takes a flaming shit sure it can corrupt the blob storage but not a pod.

basically it means they gave you the right answer to the wrong question.

if you need help im happy to discuss.

prologic

twtxt.net

28 Jul 22 02:02 UTC

View Thread

@retrocrash I think you said the same thing as me but you said it much better as you're way more experienced with k8s 😂

prologic

twtxt.net

28 Jul 22 02:02 UTC

View Thread

@retrocrash I think you said the same thing as me but you said it much better as you're way more experienced with k8s 😂

movq

www.uninformativ.de

28 Jul 22 16:18 UTC

View Thread

Yeah, I’m beginning to think this support guy probably doesn’t understand the difference between “a corrupted filesystem” and “corrupted files on an intact filesystem”. That’s the only explanation.

I’m just too naive for this. 🤣 I always take replies from support people too literally and I always assume that they know what they’re doing. 🤣 I mean, the guy even said he talked to a team of experts, so …

movq

www.uninformativ.de

28 Jul 22 16:18 UTC

View Thread

movq

www.uninformativ.de

28 Jul 22 16:18 UTC

View Thread

prologic

twtxt.net

28 Jul 22 19:19 UTC

View Thread

@movq He talked to a team of experts?! 😳 Did they find evidence, do root cause analysis? Produce a repro? 🤔

prologic

twtxt.net

28 Jul 22 19:19 UTC

View Thread

@movq He talked to a team of experts?! 😳 Did they find evidence, do root cause analysis? Produce a repro? 🤔

movq

www.uninformativ.de

28 Jul 22 19:28 UTC

View Thread

@prologic 😂 I hope I’ll find out soon! 😂

movq

www.uninformativ.de

28 Jul 22 19:28 UTC

View Thread

@prologic 😂 I hope I’ll find out soon! 😂

movq

www.uninformativ.de

28 Jul 22 19:28 UTC

View Thread

@prologic 😂 I hope I’ll find out soon! 😂

lyse

lyse.isobeef.org

28 Jul 22 23:15 UTC+0200

View Thread

@movq From my limited experiences in two companies I can anedoctic tell you, that what we developers told our support work mates after analyzing things and what they replied back to the enquirers was not always the same. That also happend when we gave them answers in written form. Always super nice support folks, no a single doubt, but their basic technical knowledge was pretty much non-existent. And plenty of them didn't even really know the softwares they're supposed to support. Granted, those were not easy programs, one was indeed super complex. But if they use them on a daily basis for years one would expect that they know them quite well. At least the main features and workflows. We also often had to tell them basic stuff several times, which was quite a bit frustrating for both sides.

But, I was super glad, that we had them in the front row. You wouldn't believe what crap queries they had to deal with and what utter bullshit they kept off our shoulders. Sometimes people wrote really offensive e-mails for no reason. Holy moly. I wouldn't want to trade with them, not in a hundred years. Lots of my developer work mates, however, didn't value our first level support at all. I mean, I totally understand, that after telling the same things over and over and over and over again it pisses you off, but treating them in a way they feel like shit, doesn't help either. It only makes things worse. I had the impression that there was a slight war between development and support.

One thing that was totally stupid, is that the POs didn't listen to improvements and suggestions on how to make things easier for the support team and also all our users. I mean, support has to deal with this software all day long and also get the same questions about workflows and stuff that's too complicated or unintuitive. So a lot of things were really low hanging fruit to improve everybody's live. But when they suggested anything, the POs always declined it, nah, it's the support's job. Period. A few times I teamed up with the support work mates and told the POs the same, the support team was suggesting and then it was accepted without hesitation. So that clearly shows there really was a two-tier society.

In my current project we don't have a support team, so we need to handle all the support queries ourselves. In that regard I miss the old project. But luckily, it's basically just other developers who are needing our help, so that's fairly okay.

prologic

twtxt.net

28 Jul 22 23:59 UTC

View Thread

@lyse Yeah first level support guys and gals are under valued really. The good ones are great and have awesome people skills. Still thick as bricks but they're there to quell the idiots at the front door 😆

prologic

twtxt.net

28 Jul 22 23:59 UTC

View Thread

lyse

lyse.isobeef.org

29 Jul 22 19:15 UTC+0200

View Thread

@prologic Hahahahaha, very nicely put, mate! :-D

movq

www.uninformativ.de

29 Jul 22 18:06 UTC

View Thread

@lyse Hm, yeah. I’m probably a bit spoiled. 😅 (Aside from being too naive and too trusting.) In my current company, there is no traditional “first level support” that just talks to the customers and has basically no idea what they’re saying. Sure, there are different “tiers” and different sets of skills among the teams, but there are no “support monkeys”. When customers open tickets, they pretty much immediately get to tech-savvy people, who are actual devs/sysadmins (or at least worked as such in the past, as far as I know).

Probably quite unusual in this field. 🤔 But I wouldn’t really know, I’ve only seen three companies in the IT field and I’ve been with the current one for a good decade, so …

> I had the impression that there was a slight war between development and support. […] the POs didn't listen to improvements and suggestions on how to make things easier for the support team

Oof, that’s harsh. 😳

movq

www.uninformativ.de

29 Jul 22 18:06 UTC

View Thread

movq

www.uninformativ.de

29 Jul 22 18:06 UTC

View Thread

lyse

lyse.isobeef.org

29 Jul 22 23:30 UTC+0200

View Thread

@movq Yeah, it's also a bit of a chicken egg problem. If you have unqualified people, they can't do a lot of stuff but they have to do something, so then they're shunt off to support. And there they can't really improve because they're always overloaded. And not getting any respect they deserve also doesn't help their motivation, so the downwards spiral continues. There's more to it, but in my opinion that's one key factor.