Seems something is going, our monitoring is reporting today very slow responses on:
[TFT] Blackbox probe of tft-blackbox https://koji.fedoraproject.org failed is firing (critical) [TFT] Blackbox probe of tft-blackbox https://kojipkgs.fedoraproject.org failed is firing (critical) [TFT] Blackbox probe of tft-blackbox https://src.fedoraproject.org/rpms/setup/
We could be on the verge of outage?
Even pulling from dist git fails with:
Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/urllib3/connection.py", line 162, in _new_conn (self._dns_host, self.port), self.timeout, **extra_kw) File "/usr/lib/python3.6/site-packages/urllib3/util/connection.py", line 80, in create_connection raise err File "/usr/lib/python3.6/site-packages/urllib3/util/connection.py", line 70, in create_connection sock.connect(sa) TimeoutError: [Errno 110] Connection timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen chunked=chunked) File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 343, in _make_request self._validate_conn(conn) File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 839, in _validate_conn conn.connect() File "/usr/lib/python3.6/site-packages/urllib3/connection.py", line 315, in connect conn = self._new_conn() File "/usr/lib/python3.6/site-packages/urllib3/connection.py", line 171, in _new_conn self, "Failed to establish a new connection: %s" % e) urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7f258fe72828>: Failed to establish a new connection: [Errno 110] Connection timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/requests/adapters.py", line 449, in send timeout=timeout File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 638, in urlopen _stacktrace=sys.exc_info()[2]) File "/usr/lib/python3.6/site-packages/urllib3/util/retry.py", line 399, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='src.fedoraproject.org', port=443): Max retries exceeded with url: /pv/ssh/checkaccess/ (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f258fe72828>: Failed to establish a new connection: [Errno 110] Connection timed out',)) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/libexec/pagure/aclchecker.py", line 73, in <module> resp = requests.post(url, data=data, headers=headers) File "/usr/lib/python3.6/site-packages/requests/api.py", line 116, in post return request('post', url, data=data, json=json, **kwargs) File "/usr/lib/python3.6/site-packages/requests/api.py", line 60, in request return session.request(method=method, url=url, **kwargs) File "/usr/lib/python3.6/site-packages/requests/sessions.py", line 533, in request resp = self.send(prep, **send_kwargs) File "/usr/lib/python3.6/site-packages/requests/sessions.py", line 646, in send r = adapter.send(request, **kwargs) File "/usr/lib/python3.6/site-packages/requests/adapters.py", line 516, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPSConnectionPool(host='src.fedoraproject.org', port=443): Max retries exceeded with url: /pv/ssh/checkaccess/ (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f258fe72828>: Failed to establish a new connection: [Errno 110] Connection timed out',)) fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists.
Note: That appears to be a server-side traceback as I am not running anything on Python 3.6 locally.
yep, this is getting more pressing now ...
escalated.
So, this was largely caused by proxy01 and proxy10 (our main iad2 proxies). They would simply stop getting any connections. A restart of httpd seemed to help for a short time, but not really fix anything.
I updated/rebooted them now and they seem much more stable.
I am thinking this was a odd kernel networking bug somehow stalling incoming connections. ;(
I'm going to leave this ticket open for a while more to make sure things are stable however and am going to try and dig though logs some more for any root cause hints.
Metadata Update from @mohanboddu: - Issue assigned to kevin - Issue tagged with: high-gain, high-trouble, ops
Visiting koji.fedoraproject.org and "fedpkg clone"ing from pkgs.fedoraproject.org are again timing out for me :(
Interestingly enough, traceroute doesn't seem to be able to get a response at all, but if I turn on ICMP mode (traceroute -I koji.fedoraproject.org), I'm getting responses *way fast and reliable) from proxy-iad02. Maybe this helps pinpoint the issue? (Or maybe ICMP messages just get routed to a different proxy that's still working...)
traceroute -I koji.fedoraproject.org
proxy-iad02
ok, I hope its stable now...
Can everyone retry and confirm if it's back to normal?
I think it's good now. Though I'm still sometimes getting HTTP 500 errors from PDC when pushing things to git.
LGTM, no issues spotted on our CI systems consuming all koji builds
I hope this is solved. Lets keep fingers crossed. :)
Metadata Update from @kevin: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.