Donloading EWKB format is larger than GeoJSON?

Avatar
  • updated
  • Not a bug

I'm trying to find the most efficient way to download boundaries and thought it would naturally be EWKB, since this stores each geometry coordinate as binary floats, whereas GeoJSON stores everything as text with one byte per numeric digit. However, when trying this out, it turns out the EWKB downloads are almost twice as large as the GeoJSON downloads, so I'm thinking something must be wrong with the way the EWKB is generated. Here's an example of downloading all boundaries for Afghanistan: 

GeoJSON (1.79MB)

https://osm-boundaries.com/Download/Submit?apiKey=<apikey>&db=osm20210712&osmIds=-303427&recursive&format=GeoJSON&srid=4326&landOnly

EWKB (3.28MB)

https://osm-boundaries.com/Download/Submit?apiKey=

<apikey>&db=osm20210712&osmIds=-303427&recursive&format=EWKB&srid=4326&landOnly

Pinned replies
Avatar
Magnus
  • Answer
  • Not a bug

The GeoJSON is limited to 9 decimals while the EWKB (and EWKT) format uses whatever is in the database, which can be a lot more. For example the first coordinate in this polygon is 61.5622549375 29.6585979362252 (10 decimals in longitude).

PS! Edited your post and removed your token.

Avatar
Karim Bahgat

Aha, so based on what you're saying, I think it's starting to make sense now. I wasn't originally aware that the EWKB download was wrapped inside a JSON, probably in order to store the non-geometry properties/attributes. From what you're saying and after inspecting the JSON contents, in order to store the EWKB binary data in a pure-text format like JSON the ROW_TO_JSON function converts the EWKB data to a HEX-encoded character strings. HEX-encoded strings use two bytes for every one byte of data, so the EWKB ends up at twice the size it should be, which would explain the size discrepancy. I guess there's really no obvious way to store a collection of EWKB geometries along with properties that preserves them in a binary format. 

Maybe it would be possible to allow yet another option to download the queried data as a binary database dump? That would keep all the data in its most efficient binary form. 

Avatar
Magnus

I don't know the EWKB format in detail. But if it's always 8 bytes, it will at times also be longer than the GeoJSON which potentially could be a single byte for a lat/lon. But that's a very rare case and definitely not the cause.

I ran a few SQL queries to see what's happening. First I compared the length (number of bytes), GeoJSON vs EWKB.

SELECT LENGTH(ST_AsGeoJSON(way_landonly))
FROM administrative_boundaries
WHERE osm_id = -303427
;
-- 985130

SELECT LENGTH(ST_AsEWKB(way_landonly))
FROM administrative_boundaries
WHERE osm_id = -303427
;
-- 739466

As we can see the EWKB is shorter, just as you suggest that it should be.


I then looked at what the site actually is doing (while stripping of some columns).

SELECT LENGTH(foo::text) FROM (
SELECT JSON_BUILD_OBJECT(
'type', 'FeatureCollection',
'features', JSON_AGG(ST_AsGeoJSON(t.*, 'geom')::json)
)
FROM (
SELECT
osm_id AS osm_id,
way_landonly AS geom
FROM administrative_boundaries
WHERE osm_id IN (-303427)
) AS t(osm_id, geom)
) AS f(foo);
-- 985245


SELECT LENGTH(foo::text) FROM (
SELECT JSON_AGG(ROW_TO_JSON(t.*))
FROM (
SELECT
osm_id AS osm_id,
ST_AsEWKB(way_landonly) AS geom
FROM administrative_boundaries
WHERE osm_id IN (-303427)
) AS t(osm_id, geom)
) AS f(foo);
-- 1478965

Suddenly the EWKB is larger. And I then assumed it's because it's wrapped into a JSON object which needs to escape the bytes. The fact that it does this is probably not very well documented, if even mentioned. But it's done to be able to return a single file when given more than one OSM-ID.

But, when hex-dumping it, I can't see that it's being escaped. I honestly don't know why it's growing. But from what I know there aren't any issues, besides the size then. We built this site to feed another site with data, and we use the EWKB format to do that. We have not seen a single issue with the end result, and we have transferred over 35,000 polygons to the other site. I still believe that it has to do with the EWKB to GeoJSON though, I just can't see how.

Avatar
Karim Bahgat

Thanks for the prompt response. But the number of decimals shouldn't really apply to EWKB since each number is always stored as a single 8-byte double (15 decimal precision). Compare that to GeoJSON, where even with only 9 decimals the above example of 61.5622549375 would be 9 bytes for the decimals + 1 byte for the comma + 2 bytes for the whole number = 12 bytes in total. GeoJSON would also require additional bytes for the commas between coordinates and the enclosing brackets around each coordinate, ring, and polygon, so should be substantially larger than EWKB. And yet we're seeing the opposite, that EWKB is almost twice the size of GeoJSON. I'll see if I can dig into the byte contents, but was hoping maybe there's something obvious on your end? 

Avatar
Magnus
  • Answer
  • Not a bug

The GeoJSON is limited to 9 decimals while the EWKB (and EWKT) format uses whatever is in the database, which can be a lot more. For example the first coordinate in this polygon is 61.5622549375 29.6585979362252 (10 decimals in longitude).

PS! Edited your post and removed your token.