fedora-infrastructure

#11726 Setup RISC-V builder(s) VM in Fedora Infrastructure

Opened 4 months ago by zlopez. Modified 2 months ago

NOTE

If your issue is for security or deals with sensitive info please
mark it as private using the checkbox below.

Describe what you would like us to do:

There is a plan to support RISC-V architecture in Fedora. For this we need to setup VM(s) and prepare the infrastructure/releng to support this architecture as well.

When do you need this to be done by? (YYYY/MM/DD)

No specific date yet

zlopez commented 4 months ago

This is waiting for hardware to be available.

Metadata Update from @kevin:
- Issue tagged with: blocked

4 months ago

Metadata Update from @t0xic0der:
- Issue assigned to t0xic0der

2 months ago

rjones commented 2 months ago

First time I've seen this bug ...

Once thing we do really need which is not dependent on hardware availability is a unified Koji instance, hosted by Fedora, and connected to FAS. We currently have two externally hosted Koji instances which are not connected to FAS.

http://fedora.riscv.rocks/koji/
http://openkoji.iscas.ac.cn/koji/

We already have plenty of RISC-V builders (mix of VF2, HiFive Unmatched, and qemu) which could be connected to this.

Edited 2 months ago by rjones

kevin commented 2 months ago

So, there I think is a lot of confusion around people using 'hardware' without saying what exactly they are talking about. ;)

My understanding of things:

We are waiting for x86_64 hardware to use to make a koji hub, db, compose vm on.
We are waiting for new netapp storage to be installed so we can provide storage for the new instances.
The x86_64 hardware also might include space to run risc-v vm's for builders (but perhaps we now don't want to do that in the end?)

Once the (x86_64 and storage) hardware shows up we can stand up a hub/db/composer, I guess we should revisit the builder plans.

CC: @smilner

rjones commented 2 months ago

The x86_64 hardware also might include space to run risc-v vm's for builders (but perhaps we now don't want to do that in the end?)

qemu is really slow so I wouldn't bother with this one. Between David and the folks in China we have a huge pile of real RISC-V machines we can connect, and we'll get even more in the next few months.

smilner commented 2 months ago

@kevin @rjones do we foresee any issues with builders being far away from a dedicated Koji instance/scheduler?

davidlt commented 2 months ago

In the past Fedora/RISCV Koji used to be in Fremont, US while majority of builders were in Europe. I have never found any major issue with the distance. To my knowledge there shouldn't be anything latency sensitive (in milliseconds). The only limit is basically your bandwidth / line for external users. That also could be improved with using local cache. Richard did something like some time ago for his boards IIRC.

rjones commented 2 months ago

What David says basically. It's kind of amazing that it works to be honest as I don't think Koji was designed with this in mind.

davidlt commented 2 months ago

Mock passes http_proxy, ftp_proxy, https_proxy, no_proxy variables from user environment. Thus it's designed to do it.

smooge commented 2 months ago

I am not sure that the problems are with koji and building but where parts of the build system need to do things which need access to a central NFS directory (I forget what but know they are important) but requires same arch. This is where things have seen the highest problems with the s390x. At first various vpns were tried but in the end the only reliable system was via fuse_ssh because it can deal with the very high latencies, misordered packets and other things which can happen with long distance writes.

For the s390x it has been something like:

[fedora NFS netapp] <-> [site network eqt] <-> [internal firewall] <-> [internal long haul network connection] <-> [ internal firewall] <-> [s390x network eqt] <-> [s390x dedicated boxes]

Any of those can cause problems (latency, bandwidth blockage, packet problems, etc) with transmission or may need additional IT resources to debug.

While I think that CN or EU may not be a problem with builds, the NFS sections are probably the parts that would be best to be close to the main server.

rjones commented 2 months ago

I think we may be talking about different issues since we've been running Koji in this configuration for years with relatively few issues. In our set up builders don't need access to NFS. AIUI they upload the finished artifacts over HTTPS back to the kojihub.

kevin commented 2 months ago

@kevin @rjones do we foresee any issues with builders being far away from a dedicated Koji instance/scheduler?

Of course they are then subject to network slowness/issues, but as noted that has not been too much of a problem to date.

There's 2 (at least that I can think of off the top of my head) reasons builders need a direct koji mount:

builders doing createrepos/newrepos need to have a read-only mount of the koji volume in order to do those. This could be accomplished with a local x86_64 vm or two that does those. It doesn't need to be the same arch as the repos it's making normally.
builders doing runroot tasks need to be able to mount the koji volume (rw) because they write results directly to the koji volume. This can be an issue here, but only when we start doing composes. These typically do need to be the same arch as the thing they are making. With s390x we use a sshfs mount. It's slow, but functional. Typically in primary koji we set builders with the rw mount to be 'compose' channel only, that is, they don't run normal jobs, only compose jobs (to avoid any chance of a build doing something with the koji volume). So, once we are doing composes we could setup a builder or two with sshfs mount. ;( Or we could possibly emulate in a x86_64 vm for this part, or we could look at adding some small number of riscv SOCs at the datacenter just for this.

Do note that current setup is entirely talking about a 'secondary' hub to help focus and coordinate efforts. Once we try and move the arch into primary things are different. There we definitely do want to control all hosts that do builds, ideally have them locally to avoid network issues and such, etc.

One final note... koji upstream changed the scheduler in 1.34.0. It used to be that builders connected to the hub and asked for tasks. Now in 1.34.0, the hub assigns things more directly. I am not sure how this might affect a deployment with builders accross the network, but our s390x resources have been fine with it.

davidlt commented 2 months ago

OK. I will try to make a short description on what we do now:

We do have x86_64 builder in the farm with a very large maxjobs value in kojid.conf. It takes all the jobs like newRepo, createrepo, build, and similar. There is no value running these on riscv64 builders (would be slower too). Almost all riscv64 builds are maxjobs=1. Thus we don't want actual "builders" to take jobs that do nothing, or don't need to run on riscv64. These builders don't have direct access to /mnt/koji (i.e. NFS).
Technically even buildSRPMFromSCM doesn't need to run on riscv64, but we still do that as that alone is good test that nothing is significantly broken.
You are right. Some builders (i.e. some tasks) require NFS RO or/and RW access, especially for Pungi Composes (we haven't finished work on them in Fedora/RISCV Koji). Those could be x86_64 builders or/and riscv64 VMs on x86_64. There is no physical riscv64 that is capable enough to deliver required performance for Pungi Composes. I was experimenting/working on some changes to convert some things from arch specific to basically any arch. IIRC createImage task spawns libvirt VM to run anaconda install into a disk image. That cannot run on riscv64 and provide enough performance to complete this task. I was forcing these to run on x86_64 (which did it's job), but there were a few other Red Hat artifacts being generated that required mounting generated disk image and running rpm binary to dump install package to cook some XML. Somehow binfmt_misc didn't work via libguestfs tools. The funny thing was that anaconda was already providing that information in the logs IIRC.

TL;DR there will be some x86_64 machines or/and libvirt VMs (riscv64) in this.

Metadata

Assignee

t0xic0der

Tags

Blocking

None

Depending on

None

Priority

Waiting on Assignee

Boards 1

ops Status: Backlog

fedora-infrastructure

Source Code

#11726 Setup RISC-V builder(s) VM in Fedora Infrastructure Opened 4 months ago by zlopez. Modified 2 months ago

Close issue as:

Describe what you would like us to do:

When do you need this to be done by? (YYYY/MM/DD)

Metadata

high-gain high-trouble ops blocked

Boards 1

#11726 Setup RISC-V builder(s) VM in Fedora Infrastructure

Opened 4 months ago by zlopez. Modified 2 months ago