fedora-infrastructure

#11446 Datanommer DB Replica

Opened 9 months ago by rwright. Modified 2 months ago

Describe what you would like us to do:

As a part of the Metrics conversations being had by CommOps / Council / few teams, we would like to request to have a clone / replica of the Datanommer database made available to allow SQL analytics on the Fedora Message Bus data.

Ideally, we would like to approach this as two parts of a single PostgreSQL server:
1. A replica in a Postgres database of the Datanommer DB (real time would be nice, but open for investigation with Fedora Infra on possibilities - maybe a daily restore?)
2. A secondary Postgres database for analytics with a Foreign Data Wrapper connected to the restored database. This secondary database would be where data analytics could occur, including parsing JSON messages from the Fedora Messaging bus, creating views, etc.

The server would not need a full production SLA (i.e. we could withstand a week outage or more as this wouldn't be a critical application, but would empower several community conversations around our data).

Ideally, we would like to have a set of users provisioned who could connect to this database, perform analysis, and while we examine the right visualization layer, allow us to begin the data analysis piece and start to understand how we could use this data to awnser some basic questions around our long term contributor health.

If this replica could be hooked up to FAS for Login to the Postgres DB, that would be ideal (https://www.postgresql.org/docs/current/client-authentication.html - Could be configured to use FAS auth if makes sense).

When do you need this to be done by? (YYYY/MM/DD)

No ETA date needed.

t0xic0der commented 9 months ago

@rwright and @jflory7 - Just wanted to check with you if it is this project https://fedora-arc.readthedocs.io/en/latest/fcas/index.html that you are looking into this for. If yes, we would be better off joining efforts rather than replicating the efforts made before in the investigations.

rwright commented 9 months ago

@t0xic0der - It looks similar, but I think the item is more just making the data available to be examined to ask questions we haven't thought yet to ask. A lot of the above is around contributor stats, and the Initiative is more around designing a service that would provide consistent stats. To enable data science activities, we would want to leave the data open to be examined to understand what we could ask for which would probably feed this initiative. I would think of this like a pre-step to understand what we want to ask.

@jflory7 - your thoughts?

t0xic0der commented 9 months ago

Hey Robert,

I think the request can be one of the initial phases in achieving the said
solution proposed in the feasibility investigation.

Depending on how the initial phase goes, we can definitely let the result
shape the end goal of the proposed initiative because I do think both of
these lead to the same place and end up solving the same problem statement.

We intentionally keep the outcomes of the feasibility investigations open
to ensure that the proposed solutions act as "suggestions" at best to
people who would go on to work on implementing the solution.

Thanks and regards,
Akashdeep Dhar
Red Hat Community Platform Engineering
t0xic0der@fedoraproject.org
akashdeep@redhat.com

On Thu, Aug 3, 2023 at 2:04=E2=80=AFPM Robert Wright pagure@pagure.io wro=
te:

rwright added a new comment to an issue you are following:
``
@t0xic0der - It looks similar, but I think the item is more just making
the data available to be examined to ask questions we haven't thought yet
to ask. A lot of the above is around contributor stats, and the Initiativ=
e
is more around designing a service that would provide consistent stats. T=
o
enable data science activities, we would want to leave the data open to b=
e
examined to understand what we could ask for which would probably feed th=
is
initiative. I would think of this like a pre-step to understand what we
want to ask.

@jflory7 - your thoughts?
``

To reply, visit the link below or just reply to this email
https://pagure.io/fedora-infrastructure/issue/11446

rwright commented 9 months ago

@t0xic0der, @amoloney and I all met in person at Flock to discuss. A summary:

Since the initiative is still in the definition stage, this ticket's request for resources is to more enable community to start the investigation and should move forward with Fedora Infra resources to host and run
CommOps will use this database as a community database though SQL to help streamline the requests instead of analytics scripts done though the production REST API for Datagrepper (which has scaling issues)

Thanks!

Metadata Update from @phsmoura:
- Issue assigned to kevin
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

9 months ago

kevin commented 9 months ago

Sorry for taking so long to reply here. :(

First, one thing to note is that we actually provide a db dump daily:

https://infrastructure.fedoraproject.org/infra/db-dumps/datanommer2.dump.xz

it's really gigantic when uncompressed though. The current db server is using about 1.2TB.

We could perhaps in theory setup a replica. This db is using timescaledb, and I am unsure if it allows for all the replication options that postgres does. Would need to investigate.

Currently the database server is pretty idle. Would it be worth exploring just running them against the current db server first and see if there's any impact?

rwright commented 9 months ago

Hey @kevin - I am open to this! (I've had struggles restoring that database backup online and wasn't able to confirm the full compressed size but 1.2TB now makes sense why!).

My initial thoughts - Could we add a secondary DB on the same server with foreign data wrappers to the main database so we could enforce a "users can't touch the other DB?". Ideally, we would want to create some database views / mapping tables to use to join to the Datanommer data to process it so having a workspace would be great.

Just want to be safe and not cause any locks on the main database.

Edited 9 months ago by rwright

kevin commented 9 months ago

I wonder... do you need the entire history? Or just "recent" things?

If it was only recent, perhaps we could provide a db dump that had just the last say year or something, which might make it much more managable.

Of course if you need the entire thing that doesn't work.

Understood on being safe... we definitely want a solution thats safe here.

rwright commented 9 months ago

Hey Kevin

Only recent - like the past 13 months maybe? What would it look like if we did that.

kevin commented 9 months ago

Good question. I've not used timescaledb much... ;(

@abompard is there a way to dump just the last 13months from it ?

rwright commented 8 months ago

Just checking back in! I was away for a moment due to a personal issue!
@abompard - Not sure if you've got an update or not on this.

I'm looking at the Timescale Docs (https://docs.timescale.com/self-hosted/latest/replication-and-ha/) and it really doesn't seem to do partitioned recovery. If we wanted to restore parts of it - we might need to look at another approach.

rwright commented 7 months ago

@kevin - while we wait on this, would it be possible either via a new request or otherwise to request a read only account to the database? Would be open to discuss safeguards to protect system stability as well

kevin commented 7 months ago

Yeah, sorry its been so blocked here... yes, we can definitely get something setup for you. Would just a shell where you can call 'psql' with your read-only username/password work?
I suppose if you needed to use your own client you could ssh tunnel to the db server port?

If that sounds ok, I can create a read-only user and get you the login/password out of band?

rwright commented 7 months ago

Today I learned my email replies come out as garbage - so manually editing this comment...

@kevin - works for me! Let's give it a go!

Edited 7 months ago by rwright

kevin commented 7 months ago

ok. I setup a read-only db user.

I added you for now to the fi-apprentice group, which will provide you shell access.

Take a look at: https://docs.fedoraproject.org/en-US/infra/sysadmin_guide/sshaccess/

On how to set that up. Once you have that working and can login to batcave01, I can put the password for the account in a file there in your homedir and you can then use it to access the db.

You can either use psql from batcave01, or if you prefer create a ssh tunnel to access it from your home machine(s). The db server is db-datanommer02.iad2.fedoraproject.org

Let me know once you have logged into batcave01 and your homedir exists. :)

rwright commented 6 months ago

@kevin all logged in! :)

kevin commented 6 months ago

Great. For now, lets keep this open to track the replica plans...

rwright commented 6 months ago

@kevin I didn't see a Postgres account, but I did get into batcave. Is there supposed to be a file on the homedir?

kevin commented 6 months ago

Yeah, sorry... it should be there now. I needed you to login for your homedir to get created. :)

kevin commented 4 months ago

So, this has this access worked out for your needs?

Do you still need something more?