Rate limits at Cloudflare / OSM-Boundaries

2

Magnus

3 years ago
Answer

Hi, thank you for asking.

The rate-limiting was primarily added to reduce the load from bad implementations using the site. At times we had huge load peaks due to scripts not having proper error handling and just retrying for hours. Especially when they were using multi-threading or equivalent techniques.

This also leads to the fact that you are correct to ask, to prevent us from imposing even stricter policies. To be honest, using Cloudflare's WAF for this was the lazy (fast) solution, instead of implementing something of our own. But if we had implemented it ourself we could have given better feedback.

The rate-limit has been set to 70 requests per minute (per client IP), which normally should be enough. But maybe that can be exceeded even with proper use if downloading a lot of small (fast to generate) polygons. When hitting the 429 the client is blocked for 60 seconds by Cloudflare.

Each download request is at least two http requests, but normally three. First a request, then a redirect to a wait url if the file isn't ready. Depending on if the file is ready or not, you will either get a redirect to the wait url again, or to a direct download link. Each wait url request is 20 seconds long, unless the file becomes ready.

So what I am thinking is that these things are important:

* Make sure your code has a long enough timeout on its http requests, at least 30 seconds. I would recommend 60 seconds. It's common that the requests take 20-30 seconds, and that's by design.

* Look into each error code and handle it properly. If you get 4xx or 5xx errors, try to figure out why. You have already looked into 429 which shouldn't really happen. The important thing is to not just retry on every weird error, though I doubt that's the case here. From what I can recall you should just see 200, and maybe then 429.

If you already allow 30-60 second timeouts on your http requests I can imagine two solutions:

* Make a sleep of 3 second between each download. Since each download is at most three fast requests, this should end up being at most 60 requests/minute. This isn't the best solution though.

* The better solution is to keep track of how many requests you are making per minute and adjust after that. But that's more work, especially if you don't already have code for similar tasks.

* Alternatively, just sleep 70 seconds upon every 429 you see, that's the easier solution. You could even keep track of when you last got a 429 and sleep 60 seconds minutes that time, which sounds like too much hassle for little gain.

I hope and assume that the issue is that you have too short timeouts on your http requests and that you just retry on curl timeouts. If this isn't the case I am curious to why we are having this issue, if so, it would be nice to know how many http requests you actually produce per minute (per 70 seconds to be more precise). If you could generate a log file with <timestamp> <url> for each request to analyze this it would be great.

Reply Link

0

Benjamin Bock

3 years ago

Thanks for elaborate reply. The different number of redirects explains why I couldn't easily guess what the exact request count limit is.

I don't have any timeout issues, it's really only the rate limiting (429)... I think curl has something very long like 2 minutes, that has never been a problem. But at least I'll know what to do if it ever happens.

All three proposals for limiting the number of requests seem fitting conceptually, I was about to just do the sleeping... but then I decided, I could just use the recursive download feature.

With recursive, I'm over-fetching a LOT but I got any request with levels 2 - 8 within one minute or so, so it seems mostly fine. The overall download time was shorter this way.... just getting the individual features out needed a bit extra code.

After checking the downloads, I wondered why A Coruna (https://www.openstreetmap.org/relation/349021) was missing... but that was me downloading structure from the 2022-06-06 tree and comparing with geojson from 2022-07-04. After switching to the 2022-07-04 tree, this was fine.

Sill, I wonder why this region is missing in 2022-07-04 as it's still available in OSM (link above).

Reply Link

1

Magnus

3 years ago
Under review

I can't explain that. osm2pgsql haven't imported it from what I can see. Since https://www.openstreetmap.org/relation/349021/history only gives me a timeout it's impossible for me to say how it looked historically. Might not have helped either to be honest.

Most likely reason is that the polygon haven't been a polygon, but broken (not closed, self-intersects, parts missing, ...) due to someone making a bad edit. This happens all the time. That's also why OSM-Boundaries keep multiple databases available. It's not primarily to get historical data, it's to be able to get the most current and working data.

Reply Link

0

Benjamin Bock

3 years ago

Quote from Magnus

I can't explain that. osm2pgsql haven't imported it from what I can see. Since https://www.openstreetmap.org/relation/349021/history only gives me a timeout it's impossible for me to say how it looked historically. Might not have helped either to be honest.

Most likely reason is that the polygon haven't been a polygon, but broken (not closed, self-intersects, parts missing, ...) due to someone making a bad edit. This happens all the time. That's also why OSM-Boundaries keep multiple databases available. It's not primarily to get historical data, it's to be able to get the most current and working data.

Thanks for the explanation. I think we can close this thread then. All questions are answered and working alternatives found.

Reply Link

0

Adrian Jagielak

2 years ago

Quote from Magnus

Hi, thank you for asking.

The rate-limiting was primarily added to reduce the load from bad implementations using the site. At times we had huge load peaks due to scripts not having proper error handling and just retrying for hours. Especially when they were using multi-threading or equivalent techniques.

This also leads to the fact that you are correct to ask, to prevent us from imposing even stricter policies. To be honest, using Cloudflare's WAF for this was the lazy (fast) solution, instead of implementing something of our own. But if we had implemented it ourself we could have given better feedback.

The rate-limit has been set to 70 requests per minute (per client IP), which normally should be enough. But maybe that can be exceeded even with proper use if downloading a lot of small (fast to generate) polygons. When hitting the 429 the client is blocked for 60 seconds by Cloudflare.

Each download request is at least two http requests, but normally three. First a request, then a redirect to a wait url if the file isn't ready. Depending on if the file is ready or not, you will either get a redirect to the wait url again, or to a direct download link. Each wait url request is 20 seconds long, unless the file becomes ready.

So what I am thinking is that these things are important:

* Make sure your code has a long enough timeout on its http requests, at least 30 seconds. I would recommend 60 seconds. It's common that the requests take 20-30 seconds, and that's by design.

* Look into each error code and handle it properly. If you get 4xx or 5xx errors, try to figure out why. You have already looked into 429 which shouldn't really happen. The important thing is to not just retry on every weird error, though I doubt that's the case here. From what I can recall you should just see 200, and maybe then 429.

If you already allow 30-60 second timeouts on your http requests I can imagine two solutions:

* Make a sleep of 3 second between each download. Since each download is at most three fast requests, this should end up being at most 60 requests/minute. This isn't the best solution though.

* The better solution is to keep track of how many requests you are making per minute and adjust after that. But that's more work, especially if you don't already have code for similar tasks.

* Alternatively, just sleep 70 seconds upon every 429 you see, that's the easier solution. You could even keep track of when you last got a 429 and sleep 60 seconds minutes that time, which sounds like too much hassle for little gain.

I hope and assume that the issue is that you have too short timeouts on your http requests and that you just retry on curl timeouts. If this isn't the case I am curious to why we are having this issue, if so, it would be nice to know how many http requests you actually produce per minute (per 70 seconds to be more precise). If you could generate a log file with <timestamp> <url> for each request to analyze this it would be great.

I think the info about rate-limiting should be included in Documentation

Reply Link