#9051 Various services are not available
Closed: Fixed 3 years ago by zlopez. Opened 3 years ago by zlopez.

Describe what you would like us to do:


Various fedora services are not available right now. I found out the following are either timeout or report back service not available:
https://src.fedoraproject.org/
https://kojipkgs.fedoraproject.org/
https://apps.fedoraproject.org/packages/

When do you need this to be done by? (YYYY/MM/DD)



It looks like some of the services are up again and those running in openshift are still unavailable.

It looks like the whole openshift cluster was reset and it's getting back up.

According to discussion with @pingou in #fedora-admin this looks like networking issue in IAD2

The issue is still ongoing, another services identified as not working:
https://release-monitoring.org/
dl.fedoraproject.org

And Eeverything running in our openshift is unable to deploy a new pod right now. Here is the list of projects hosted in our OpenShift, some of them could still run:
asknot
bodhi
compose-tracker
coreos-cincinnati
coreos-koji-tagger
coreos-ostree-importer
distgit-bugzilla-sync
docsbuilding
elections
fas
fedora-ostree-pruner
greenwave
ipsilon
koschei
kube-public
kube-service-catalog
kube-system
management-infra
mdapi
message-tagging-service
messaging-bridges
monitor-gating
release-monitoring
review-stats
silverblue
the-new-hotness
transtats
waiverdb
websites

COPR builds does not work too.

Metadata Update from @smooge:
- Issue assigned to smooge

3 years ago

Metadata Update from @smooge:
- Issue priority set to: None (was: Needs Review)
- Issue tagged with: high-gain, high-trouble

3 years ago

All services are down. Routers, firewalls and switches in IAD2 are in a critical state. Work is being done on them but there is no ETA or known cause at this time.

All services are down. Routers, firewalls and switches in IAD2 are in a critical state. Work is being done on them but there is no ETA or known cause at this time.

Wow, what happened there? Looks like some deluge or earthquake.

We appear to be back. The issue seems to be around a failed switch or switching (still being investigated).

Please report anything you see still down and we will work to make sure it's back up.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

I still see following error using fedpkg new-sources:
Could not execute new_sources: Error occurs inside the server.

Is mbs.fedoraproject.org okay? Builds stuck in "init" or should I just wait longer?

I still see issues in OpenShift cluster, the pod deployment fails with Error: ImagePullBackOff.

Is mbs.fedoraproject.org okay? Builds stuck in "init" or should I just wait longer?

Ah, I think I just needed to wait longer, sorry for the noise

@zlopez so, turns out our storage for our openshift registry had the wrong perms and it couldn't write to it. ;( So, it came back, but there were 0 images there.

I have fixed the perms and started new builds of everything that was waiting in imagepullbackoff.

Should be back to normal in a bit and also have the images actually stored.

It looks like the issue is gone now.

@kevin Do we know the root cause of this issue? Can we do anything to prepare this in the future?

Metadata Update from @zlopez:
- Issue status updated to: Open (was: Closed)

3 years ago

Metadata Update from @zlopez:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

We don't have exact details... but the cause was a fault switch. It sent corrupted information to other switches and caused a cascading failure.

The fault switch was turned off on friday and replaced on saturday. Hopefully this was just a rare one off issue...

Log in to comment on this ticket.

Metadata