Enormous load on DB - looking for help diagnosing

Our 1.8.2 CMS began experiencing major performance problems after we did a bulk deploy of Layout changes using the API. We performed a good number of troubleshooting steps which I’m happy to detail if desired.

When we shut down all the players, the system returned to normal performance. Yesterday we brought up the players one by one and the system continued to perform well throughout the day. This morning, it’s performing poorly again (though not quite as bad as before).

Using ‘mytop’ to monitor MySQL, we see 200+ queries per second - 19M+ queries total since yesterday morning. The QPS rate is similar to what we saw when the problem first occurred. There are a large number of Apache threads when running ‘top’.

I suspect the load we put on the system backed it up and it’s having trouble clearing queues/caches/etc.

My question is: Is there a Xibo-appropriate way to troubleshoot this or do I need to continue using MySQL and Apache performance tools? If the latter, any suggestions on how to monitor Apache when it’s in Docker?

Thanks much! If I can figure out the root cause, I’ll post it here for knowledge base purposes.

I should mention: Host system resources look OK - CPU runs 3-30% and memory doesn’t touch swap at all.

How many Players, and what is their collection interval? Certainly for a busy CMS with hundreds of Players and a low collection interval, 200 QPS isn’t alot in my experience. Clearly, if there’s only 10 Players, then that makes that figure very high.

You should be able to look at MySQL and see what queries are being run. My guess would be it’s all INSERTS and it’s Players uploading stats. Make sure the CMS itself isn’t auditing or logging more than error as that will only compound the problem. It’s very unlikely to be anything querying the database that frequently, since the CMS caches most content returned to the Players.

That would be where I would start looking.

If it is all stats being uploaded, then really there’s not alot you can do other than check that the Players are making progress (ie they’re uploading faster than they’re generating stats), and if so, wait it out. They will catch up and the load drop away.

Equally, if the new media you pushed out is all very short duration, it may be that they’re generating more stats than they were before, in which case, you’ll need to start tuning MySQL to improve it’s performance with inserts. If you’re not already running on SSD, then that would be the first quick win. Then after that. there’s lots of tune-ables that can increase performance on INSERTS.

Thanks, Alex!

There are around 90 players. The Collection Interval is 1 minute. According to ‘mytop’ (a top-like utility for MySQL) it’s 85% SELECT statements. We’ve set the log level to “None”.

Another notable data point we now realize (after the original post) is that the DB server doesn’t really seem to be bogged down. When we connect to it via MySQL Workbench or the command line tool, it’s quite responsive. So we’re focusing now on the Apache side. Thinking there may be some sort of backup of DB connections or something. Guessing it’s possible that if Apache can’t connect or times out, the request gets buffered or cached and continues to retry, and that the continuous backlog grows over time. Just a guess, though.

We’ll report back when we learn more…

Just curious as to why you have a 1 minute collection?

We generally recommend 5 minutes as a minimum on a production system, since with XMR, content changes will be pushed to the Players straight away when there are changes. There’s no need to have the Players hammer away that often (unless there’s a specific reason).

Apache needs huge amount of tuning for a large production setup. If you’re running with a stock config there, then there will be gains in there for sure.

Thanks again! We originally had it for instant gratification on content changes (we update playlists dynamically). I actually didn’t realize XMR allowed for that.

On the Apache tuning – we’re using whatever settings you put into the Docker image. We’ll investigate further on that.

The Docker image largely uses the defaults. I don’t know in that what resources will be available for example, so it’s tuned to consume at most 1GB of RAM under extreme load.

I’d certainly be looking at increasing the number of concurrent Apache processes you’re running if you have RAM available to do so.

Dropping your collection to 5 minutes will have already cut your background traffic by 4/5ths so that’s a big win!

We changed the collection interval to 30 minutes and… it’s fixed!

We’re going to set up a sandbox system (some day) so we can do performance testing. We have 100 players now, but we expect to reach 3000+ eventually. Looking forward to providing the Xibo community with any results we get and settings we have to make for large-scale production.

Cheers…

Great. The only thing to watch with 30 minutes is that the stats being uploaded don’t backlog.

Each collection, the Player uploads 100 stats (if there are 100 ready), so if your content creates more than 100 records in 30 minutes, it will backlog and never catch up. How many stat records there are depends entirely on the content you’re showing.

1 Like

Hi Alex!

After a few weeks of recovering from our load issue, we discovered we’re missing stats using a 15-minute collection interval. We have a few questions, please:

  1. Does the player EVER delete stats files before sending them to the CMS? If so, on what period does this happen? If not, we’ll have to figure out why we’re seeing 5 minutes worth of stats and missing 10 minutes worth for every 15-minute period.

  2. You mentioned it will upload 100 stats if 100 are ready. We’re only seeing batches of 30. We would have expected to see batches of 100 with a continuously-increasing backlog. Unless, as I was asking above, the player deletes stats files before they’re uploaded.

Any thoughts? Many thanks!!

UPDATE: We set up a test player tonight and watched the file system (Xibo Library folder). The player is creating a new Stats file every 1-2 minutes with 10 “rows” per file. Every 15 minutes (our Collection Interval), it deletes 3 files. We would have expected it to delete 10 based on Alex’s indication that 100 are collected if available. But it’s only grabbing 3.

So there are approximately 10 files created in a 15-minute period, but only 3 processed/deleted.

Not 100% sure yet how this explains the gap in stats we’re seeing (5 mins “on” and 10 mins “off”) as described above.

You can tune the number of files that get sent each collection in the CMS display settings profile active on your display. The setting is in advanced and is snappily called “Limit the number of log files uploaded concurrently” - default is 3.

The 100 records per collection figure Alex mentioned is applicable only to Xibo for Android.

1 Like

Thanks, Dan! Couple of followup questions, please:

  1. The label in the Display Setting says: “The lower the number the longer it will take, but the better for memory usage.” Is this Player memory, CMS memory, or both?

  2. Is there a practical maximum for this? I’d like to set it to 12 or 15, but I don’t want to severely burden our systems.

  3. Might you consider changing the label in the Display Setting to “Limit the number of log/status files uploaded concurrently”? I saw this setting when we were doing our investigation and believed that it only pertained to log entries, not stats. I did do some searching to see if stats were included and came up dry - apologies if it’s documented somewhere and I missed it!