fedora-infrastructure

#9051 Various services are not available

Closed: Fixed 3 years ago by zlopez. Opened 3 years ago by zlopez.

Describe what you would like us to do:

Various fedora services are not available right now. I found out the following are either timeout or report back service not available:
https://src.fedoraproject.org/
https://kojipkgs.fedoraproject.org/
https://apps.fedoraproject.org/packages/

When do you need this to be done by? (YYYY/MM/DD)

zlopez commented 3 years ago

Another service:
https://koji.fedoraproject.org/

zlopez commented 3 years ago

It looks like some of the services are up again and those running in openshift are still unavailable.

zlopez commented 3 years ago

It looks like the whole openshift cluster was reset and it's getting back up.

Edited 3 years ago by zlopez

zlopez commented 3 years ago

According to discussion with @pingou in #fedora-admin this looks like networking issue in IAD2

zlopez commented 3 years ago

The issue is still ongoing, another services identified as not working:
https://release-monitoring.org/
dl.fedoraproject.org

And Eeverything running in our openshift is unable to deploy a new pod right now. Here is the list of projects hosted in our OpenShift, some of them could still run:
asknot
bodhi
compose-tracker
coreos-cincinnati
coreos-koji-tagger
coreos-ostree-importer
distgit-bugzilla-sync
docsbuilding
elections
fas
fedora-ostree-pruner
greenwave
ipsilon
koschei
kube-public
kube-service-catalog
kube-system
management-infra
mdapi
message-tagging-service
messaging-bridges
monitor-gating
release-monitoring
review-stats
silverblue
the-new-hotness
transtats
waiverdb
websites

Edited 3 years ago by zlopez

jkonecny commented 3 years ago

COPR builds does not work too.

praiskup commented 3 years ago

Yes, copr issues reported here:
https://lists.fedoraproject.org/archives/list/copr-devel@lists.fedorahosted.org/thread/45GSVWNLZ2P4LJ4TMCIZLERMYWGISZXK/

Metadata Update from @smooge:
- Issue assigned to smooge

3 years ago

Metadata Update from @smooge:
- Issue priority set to: None (was: Needs Review)
- Issue tagged with: high-gain, high-trouble

3 years ago

smooge commented 3 years ago

All services are down. Routers, firewalls and switches in IAD2 are in a critical state. Work is being done on them but there is no ETA or known cause at this time.

jkonecny commented 3 years ago

All services are down. Routers, firewalls and switches in IAD2 are in a critical state. Work is being done on them but there is no ETA or known cause at this time.

Wow, what happened there? Looks like some deluge or earthquake.

kevin commented 3 years ago

We appear to be back. The issue seems to be around a failed switch or switching (still being investigated).

Please report anything you see still down and we will work to make sure it's back up.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

adrian commented 3 years ago

I still see following error using fedpkg new-sources:
Could not execute new_sources: Error occurs inside the server.

kevin commented 3 years ago

@adrian should be fixed now.

mbooth commented 3 years ago

Is mbs.fedoraproject.org okay? Builds stuck in "init" or should I just wait longer?

zlopez commented 3 years ago

I still see issues in OpenShift cluster, the pod deployment fails with Error: ImagePullBackOff.

mbooth commented 3 years ago

Is mbs.fedoraproject.org okay? Builds stuck in "init" or should I just wait longer?

Ah, I think I just needed to wait longer, sorry for the noise

kevin commented 3 years ago

@zlopez so, turns out our storage for our openshift registry had the wrong perms and it couldn't write to it. ;( So, it came back, but there were 0 images there.

I have fixed the perms and started new builds of everything that was waiting in imagepullbackoff.

Should be back to normal in a bit and also have the images actually stored.

zlopez commented 3 years ago

It looks like the issue is gone now.

@kevin Do we know the root cause of this issue? Can we do anything to prepare this in the future?

Metadata Update from @zlopez:
- Issue status updated to: Open (was: Closed)

3 years ago

Metadata Update from @zlopez:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

kevin commented 3 years ago

We don't have exact details... but the cause was a fault switch. It sent corrupted information to other switches and caused a cascading failure.

The fault switch was turned off on friday and replaced on saturday. Hopefully this was just a rare one off issue...

Metadata

Assignee

smooge

Tags

Blocking

None

Depending on

None

Priority

🔥 URGENT 🔥

fedora-infrastructure

Source Code

#9051 Various services are not available Closed: Fixed 3 years ago by zlopez. Opened 3 years ago by zlopez.

Describe what you would like us to do:

When do you need this to be done by? (YYYY/MM/DD)

Metadata

high-gain high-trouble

#9051 Various services are not available

Closed: Fixed 3 years ago by zlopez. Opened 3 years ago by zlopez.