How to do X every ~Y requests in PHP?
There are many situations where you need to do something sometimes but not every time.
Spoiler this entire article is about this line of code:
if (mt_rand(0, 10) === 0)
Most common scenarios in typical PHP fashion that come to mind are:
- I need to clear the active sessions in my app every now and then.
So, you need a form of garbage collection that you could run on every request, but that obviously would not scale very well. - I want to cache this information for about 10 requests.
Most caching scenarios work better with a TTL, but let's say you have no fancy system in place where you can track that. And you want to keep it stupidly simple. - I need to track the user's last active time
But I do not want to connect to the Database's master node with every request to update the timestamp.
Common Solutions
There is a plethora of possible solutions to tackle this type of issue, but most revolve around tracking some sort of state somewhere to determine if Action X should be executed or not.
- Redis Using an in-memory key/value store like Redis where reads & writes are cheap.
- Scheduling Using plain old cron jobs or some other form of scheduled process manager can do the trick just fine when your timings are mostly static.
All of these solutions are perfectly valid and can be the proper fit for your problem.
Probability is Your Friend
In this blog post, I want to highlight a stateless approach that comes in incredibly handy in two seemingly opposing situations:
- When you want to keep something really simple.
- When you want a system (yes, in PHP) to scale with massive amounts of traffic.
They are only seemingly opposing because, if you think about it for a second, something really simple takes less work; the less a system has to do, the faster it's going to be.
This is already a lot of text for one line of code, but here it is:
if (mt_rand(0, 24) === 0) {
// Do something approximately every 25th time.
}
Wait thats it?
yup.
Precision and consistency are not that important in all parts of your application; let me explain.
Let's say:
- you serve approximately a million requests/s on average.
- you want to somehow track how many requests you process.
Does it really matter to have the perfectly exact number?
For many cases, it obviously does not.
Perfectly counting them across multiple servers, potentially multiple regions, can become a nightmare.
Let's Simulate
Okay, maybe you don't believe me how close a probabilistic method can come to the actual count. I mean, for example, there is no "true" randomness on a computer, but let's simulate that with a simple script:
<?php
$epochs = 10; // how many times to run the simulation?
$errorValues = [];
for ($e=0; $e<$epochs; $e++) {
$runForSeconds = 300; // 5 minutes
$averageReqPerSecond = 1_000_000;
$simulatedRequests = $runForSeconds * $averageReqPerSecond;
$countEvery = 1_000;
$countedRequest = 0;
for($i=0; $i<$simulatedRequests; $i++)
{
if (mt_rand(0, $countEvery - 1) === 0) {
$countedRequest += $countEvery;
if ($countedRequest % 10_000_000 === 0) {
echo '.';
}
}
}
$error = round(abs(($simulatedRequests - $countedRequest) / $simulatedRequests * 100), 2);
$errorValues[] = $error;
echo "\nEpoch: $e\n";
echo "Counted requests: $countedRequest\n";
echo "Expected requests: $simulatedRequests\n";
echo "Difference: " . ($simulatedRequests - $countedRequest) . "\n";
echo "Error: " . $error . "%\n";
echo "---------------------------------\n";
}
echo "Average error: " . array_sum($errorValues) / count($errorValues) . "%\n";
In this scenario, we run 300 million iterations 10 times and count only approximately every 1,000 iterations.
If we stick with our example from above, this would reduce the number of synchronization operations needed to store the count from a million per second to just a thousand per second.
Error
Running this for 10 epochs gives me an average error of 0.221%
.
Epoch: 8
Counted requests: 299888000
Expected requests: 300000000
Difference: 112000
Error: 0.04%
---------------------------------
.............................
Epoch: 9
Counted requests: 299376000
Expected requests: 300000000
Difference: 624000
Error: 0.21%
---------------------------------
Average error: 0.221%
In theory, the error should shrink with a larger timeframe given perfect randomness.
Here, with a 1-hour timeframe instead of 5 minutes:
Counted requests: 3599651000
Expected requests: 3600000000
Difference: 349000
Error: 0.0097%
In the simulation, this seems to be true, but again, this is all just probability.