An educational debugging session

This morning we realised that a test case failed on Fedora 34 only (the link is in Italian) and we set to debugging.

The initial analysis

This is the initial reproducer:

$ PROJ_DEBUG=3 python setup.py test
test_recipe (tests.test_litota3.TestLITOTA3NordArkimetIFS) ... pj_open_lib(proj.db): call fopen(/lib64/../share/proj/proj.db) - succeeded
proj_create: Open of /lib64/../share/proj/proj.db failed
pj_open_lib(proj.db): call fopen(/lib64/../share/proj/proj.db) - succeeded
proj_create: no database context specified
Cannot instantiate source_crs
EXCEPTION in py_coast(): ProjP: cannot create crs to crs from [EPSG:4326] to [+proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +over +units=m +no_defs]
ERROR

Note that opening /lib64/../share/proj/proj.db sometimes succeeds, sometimes fails. It's some kind of Schrödinger path, which works or not depending on how you observe it:

# ls -lad /lib64
lrwxrwxrwx 1 1000 1000 9 Jan 26  2021 /lib64 -> usr/lib64

$ ls -la /lib64/../share/proj/proj.db
-rw-r--r-- 1 root root 8925184 Jan 28  2021 /lib64/../share/proj/proj.db

$ cd /lib64/../share/proj/

$ cd /lib64
$ cd ..
$ cd share
-bash: cd: share: No such file or directory

And indeed, stat(2) finds it, and sqlite doesn't (the file is a sqlite database):

$ stat /lib64/../share/proj/proj.db
  File: /lib64/../share/proj/proj.db
  Size: 8925184     Blocks: 17432      IO Block: 4096   regular file
Device: 33h/51d Inode: 56907       Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-11-08 14:09:12.334350779 +0100
Modify: 2021-01-28 05:38:11.000000000 +0100
Change: 2021-11-08 13:42:51.758874327 +0100
 Birth: 2021-11-08 13:42:51.710874051 +0100

$ sqlite3 /lib64/../share/proj/proj.db
Error: unable to open database "/lib64/../share/proj/proj.db": unable to open database file

A minimal reproducer

Later on we started stripping layers of code towards a minimal reproducer: here it is. It works or doesn't work depending on whether proj is linked explicitly, or via MagPlus:

$ cat tc.cc
#include <magics/ProjP.h>

int main() {
    magics::ProjP p("EPSG:4326", "+proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +over +units=m +no_defs");
    return 0;
}

$ g++ -o tc  tc.cc -I/usr/include/magics  -lMagPlus
$ ./tc
proj_create: Open of /lib64/../share/proj/proj.db failed
proj_create: no database context specified
terminate called after throwing an instance of 'magics::MagicsException'
  what():  ProjP: cannot create crs to crs from [EPSG:4326] to [+proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +over +units=m +no_defs]
Aborted (core dumped)

$ g++ -o tc  tc.cc -I/usr/include/magics -lproj  -lMagPlus
$ ./tc

What is going on here?

A difference between the two is the path used to link to libproj.so:

$ ldd ./tc | grep proj
    libproj.so.19 => /lib64/libproj.so.19 (0x00007fd4919fb000)
$ g++ -o tc  tc.cc -I/usr/include/magics   -lMagPlus
$ ldd ./tc | grep proj
    libproj.so.19 => /lib64/../lib64/libproj.so.19 (0x00007f6d1051b000)

Common sense screams that this should not matter, but we chased an intuition and found that one of the ways proj looks for its database is relative to its shared library.

Indeed, gdb in hand, that dladdr call returns /lib64/../lib64/libproj.so.19.

From /lib64/../lib64/libproj.so.19, proj strips two paths from the end, presumably to pass from something like /something/usr/lib/libproj.so to /something/usr.

So, dladdr returns /lib64/../lib64/libproj.so.19, which becomes /lib64/../, which becomes /lib64/../share/proj/proj.db, which exists on the file system and is used as a path to the database.

But depending how you look at it, that path might or might not be valid: it passes the stat(2) check that stops the lookup for candidate paths, but sqlite is unable to open it.

Why does the other path work?

By linking libproj.so in the other way, dladdr returns /lib64/libproj.so.19, which becomes /share/proj/proj.db, which doesn't exist, which triggers a fallback to a PROJ_LIB constant defined at compile time, which is a path that works no matter how you look at it.

Why that weird path with libMagPlus?

To complete the picture, we found that libMagPlus.so is packaged with a rpath set, which is known to cause trouble

# readelf -d /usr/lib64/libMagPlus.so|grep rpath
 0x000000000000000f (RPATH)              Library rpath: [$ORIGIN/../lib64]

The workaround

We found that one can set PROJ_LIB in the environment to override the normal proj database lookup. Building on that, we came up with a simple way to override it on Fedora 34 only:

    if distro is not None and distro.linux_distribution()[:2] == ("Fedora", "34") and "PROJ_LIB" not in os.environ:
         self.env_overrides["PROJ_LIB"] = "/usr/share/proj/"

This has been a most edifying and educational debugging session, with only the necessary modicum of curses and swearwords. Working in a team of excellent people really helps.