libeio - truly asynchronous POSIX I/O
#include <eio.h>
The newest version of this document is also available as an html-formatted web page you might find easier to navigate when reading it for the first time: http://pod.tst.eu/http://cvs.schmorp.de/libeio/eio.pod.
Note that this library is a by-product of the IO::AIO perl module, and many of the subtler points regarding requests lifetime and so on are only documented in its documentation at the moment: http://pod.tst.eu/http://cvs.schmorp.de/IO-AIO/AIO.pm.
IO::AIO
This library provides fully asynchronous versions of most POSIX functions dealing with I/O. Unlike most asynchronous libraries, this not only includes read and write, but also open, stat, unlink and similar functions, as well as less rarely ones such as mknod, futime or readlink.
read
write
open
stat
unlink
mknod
futime
readlink
It also offers wrappers around sendfile (Solaris, Linux, HP-UX and FreeBSD, with emulation on other platforms) and readahead (Linux, with emulation elsewhere>).
sendfile
readahead
The goal is to enable you to write fully non-blocking programs. For example, in a game server, you would not want to freeze for a few seconds just because the server is running a backup and you happen to call readdir.
readdir
Libeio represents time as a single floating point number, representing the (fractional) number of seconds since the (POSIX) epoch (somewhere near the beginning of 1970, details are complicated, don't ask). This type is called eio_tstamp, but it is guaranteed to be of type double (or better), so you can freely use double yourself.
eio_tstamp
double
Unlike the name component stamp might indicate, it is also used for time differences throughout libeio.
stamp
Usage of pthreads in a program changes the semantics of fork considerably. Specifically, only async-safe functions can be called after fork. Libeio uses pthreads, so this applies, and makes using fork hard for anything but relatively fork + exec uses.
This library only works in the process that initialised it: Forking is fully supported, but using libeio in any other process than the one that called eio_init is not.
eio_init
You might get around by not using libeio before (or after) forking in the parent, and using it in the child afterwards. You could also try to call the eio_init function again in the child, which will brutally reinitialise all data structures, which isn't POSIX conformant, but typically works.
Otherwise, the only recommendation you should follow is: treat fork code the same way you treat signal handlers, and only ever call eio_init in the process that uses it, and only once ever.
Before you can call any eio functions you first have to initialise the library. The library integrates into any event loop, but can also be used without one, including in polling mode.
You have to provide the necessary glue yourself, however.
This function initialises the library. On success it returns 0, on failure it returns -1 and sets errno appropriately.
0
-1
errno
It accepts two function pointers specifying callbacks as argument, both of which can be 0, in which case the callback isn't called.
There is currently no way to change these callbacks later, or to "uninitialise" the library again.
The want_poll callback is invoked whenever libeio wants attention (i.e. it wants to be polled by calling eio_poll). It is "edge-triggered", that is, it will only be called once when eio wants attention, until all pending requests have been handled.
want_poll
eio_poll
This callback is called while locks are being held, so you must not call any libeio functions inside this callback. That includes eio_poll. What you should do is notify some other thread, or wake up your event loop, and then call eio_poll.
This callback is invoked when libeio detects that all pending requests have been handled. It is "edge-triggered", that is, it will only be called once after want_poll. To put it differently, want_poll and done_poll are invoked in pairs: after want_poll you have to call eio_poll () until either eio_poll indicates that everything has been handled or done_poll has been called, which signals the same.
done_poll
eio_poll ()
Note that eio_poll might return after done_poll and want_poll have been called again, so watch out for races in your code.
As with want_poll, this callback is called while locks are being held, so you must not call any libeio functions form within this callback.
This function has to be called whenever there are pending requests that need finishing. You usually call this after want_poll has indicated that you should do so, but you can also call this function regularly to poll for new results.
If any request invocation returns a non-zero value, then eio_poll () immediately returns with that value as return value.
Otherwise, if all requests could be handled, it returns 0. If for some reason not all requests have been handled, i.e. some are still pending, it returns -1.
For libev, you would typically use an ev_async watcher: the want_poll callback would invoke ev_async_send to wake up the event loop. Inside the callback set for the watcher, one would call eio_poll ().
ev_async
ev_async_send
If eio_poll () is configured to not handle all results in one go (i.e. it returns -1) then you should start an idle watcher that calls eio_poll until it returns something != -1.
!= -1
A full-featured connector between libeio and libev would look as follows (if eio_poll is handling all requests, it can of course be simplified a lot by removing the idle watcher logic):
static struct ev_loop *loop; static ev_idle repeat_watcher; static ev_async ready_watcher; /* idle watcher callback, only used when eio_poll */ /* didn't handle all results in one call */ static void repeat (EV_P_ ev_idle *w, int revents) { if (eio_poll () != -1) ev_idle_stop (EV_A_ w); } /* eio has some results, process them */ static void ready (EV_P_ ev_async *w, int revents) { if (eio_poll () == -1) ev_idle_start (EV_A_ &repeat_watcher); } /* wake up the event loop */ static void want_poll (void) { ev_async_send (loop, &ready_watcher) } void my_init_eio () { loop = EV_DEFAULT; ev_idle_init (&repeat_watcher, repeat); ev_async_init (&ready_watcher, ready); ev_async_start (loop &watcher); eio_init (want_poll, 0); }
For most other event loops, you would typically use a pipe - the event loop should be told to wait for read readiness on the read end. In want_poll you would write a single byte, in done_poll you would try to read that byte, and in the callback for the read end, you would call eio_poll.
You don't have to take special care in the case eio_poll doesn't handle all requests, as the done callback will not be invoked, so the event loop will still signal readiness for the pipe until all results have been processed.
Libeio has both a high-level API, which consists of calling a request function with a callback to be called on completion, and a low-level API where you fill out request structures and submit them.
This section describes the high-level API.
You submit a request by calling the relevant eio_TYPE function with the required parameters, a callback of type int (*eio_cb)(eio_req *req) (called eio_cb below) and a freely usable void *data argument.
eio_TYPE
int (*eio_cb)(eio_req *req)
eio_cb
void *data
The return value will either be 0, in case something went really wrong (which can basically only happen on very fatal errors, such as malloc returning 0, which is rather unlikely), or a pointer to the newly-created and submitted eio_req *.
malloc
eio_req *
The callback will be called with an eio_req * which contains the results of the request. The members you can access inside that structure vary from request to request, except for:
ssize_t result
This contains the result value from the call (usually the same as the syscall of the same name).
int errorno
This contains the value of errno after the call.
The void *data member simply stores the value of the data argument.
data
The return value of the callback is normally 0, which tells libeio to continue normally. If a callback returns a nonzero value, libeio will stop processing results (in eio_poll) and will return the value to its caller.
Memory areas passed to libeio must stay valid as long as a request executes, with the exception of paths, which are being copied internally. Any memory libeio itself allocates will be freed after the finish callback has been called. If you want to manage all memory passed to libeio yourself you can use the low-level API.
For example, to open a file, you could do this:
static int file_open_done (eio_req *req) { if (req->result < 0) { /* open() returned -1 */ errno = req->errorno; perror ("open"); } else { int fd = req->result; /* now we have the new fd in fd */ } return 0; } /* the first three arguments are passed to open(2) */ /* the remaining are priority, callback and data */ if (!eio_open ("/etc/passwd", O_RDONLY, 0, 0, file_open_done, 0)) abort (); /* something went wrong, we will all die!!! */
Note that you additionally need to call eio_poll when the want_cb indicates that requests are ready to be processed.
want_cb
Sometimes the need for a request goes away before the request is finished. In that case, one can cancel the request by a call to eio_cancel:
eio_cancel
Cancel the request (and all its subrequests). If the request is currently executing it might still continue to execute, and in other cases it might still take a while till the request is cancelled.
Even if cancelled, the finish callback will still be invoked - the callbacks of all cancellable requests need to check whether the request has been cancelled by calling EIO_CANCELLED (req):
EIO_CANCELLED (req)
static int my_eio_cb (eio_req *req) { if (EIO_CANCELLED (req)) return 0; }
In addition, cancelled requests will either have req->result set to -1 and errno to ECANCELED, or otherwise they were successfully executed, despite being cancelled (e.g. when they have already been executed at the time they were cancelled).
req->result
ECANCELED
EIO_CANCELLED is still true for requests that have successfully executed, as long as eio_cancel was called on them at some point.
EIO_CANCELLED
The following request functions are available. All of them return the eio_req * on success and 0 on failure, and all of them have the same three trailing arguments: pri, cb and data. The cb is mandatory, but in most cases, you pass in 0 as pri and 0 or some custom data value as data.
pri
cb
These requests simply wrap the POSIX call of the same name, with the same arguments. If a function is not implemented by the OS and cannot be emulated in some way, then all of these return -1 and set errorno to ENOSYS.
errorno
ENOSYS
These have the same semantics as the syscall of the same name, their return value is available as req->result later.
These two requests are called read and write, but actually wrap pread and pwrite. On systems that lack these calls (such as cygwin), libeio uses lseek/read_or_write/lseek and a mutex to serialise the requests, so all these requests run serially and do not disturb each other. However, they still disturb the file offset while they run, so it's not safe to call these functions concurrently with non-libeio functions on the same fd on these systems.
pread
pwrite
Not surprisingly, pread and pwrite are not thread-safe on Darwin (OS/X), so it is advised not to submit multiple requests on the same fd on this horrible pile of garbage.
Like mlockall, but the flag value constants are called EIO_MCL_CURRENT and EIO_MCL_FUTURE.
mlockall
EIO_MCL_CURRENT
EIO_MCL_FUTURE
Just like msync, except that the flag values are called EIO_MS_ASYNC, EIO_MS_INVALIDATE and EIO_MS_SYNC.
EIO_MS_ASYNC
EIO_MS_INVALIDATE
EIO_MS_SYNC
If successful, the path read by readlink(2) can be accessed via req->ptr2 and is NOT null-terminated, with the length specified as req->result.
readlink(2)
req->ptr2
if (req->result >= 0) { char *target = strndup ((char *)req->ptr2, req->result); free (target); }
Similar to the realpath libc function, but unlike that one, req->result is -1 on failure. On success, the result is the length of the returned path in ptr2 (which is NOT 0-terminated) - this is similar to readlink.
ptr2
Stats a file - if req->result indicates success, then you can access the struct stat-like structure via req->ptr2:
struct stat
EIO_STRUCT_STAT *statdata = (EIO_STRUCT_STAT *)req->ptr2;
Stats a filesystem - if req->result indicates success, then you can access the struct statvfs-like structure via req->ptr2:
struct statvfs
EIO_STRUCT_STATVFS *statdata = (EIO_STRUCT_STATVFS *)req->ptr2;
Reading directories sounds simple, but can be rather demanding, especially if you want to do stuff such as traversing a directory hierarchy or processing all files in a directory. Libeio can assist these complex tasks with it's eio_readdir call.
eio_readdir
This is a very complex call. It basically reads through a whole directory (via the opendir, readdir and closedir calls) and returns either the names or an array of struct eio_dirent, depending on the flags argument.
opendir
closedir
struct eio_dirent
flags
The req->result indicates either the number of files found, or -1 on error. On success, null-terminated names can be found as req->ptr2, and struct eio_dirents, if requested by flags, can be found via req->ptr1.
struct eio_dirents
req->ptr1
Here is an example that prints all the names:
int i; char *names = (char *)req->ptr2; for (i = 0; i < req->result; ++i) { printf ("name #%d: %s\n", i, names); /* move to next name */ names += strlen (names) + 1; }
Pseudo-entries such as . and .. are never returned by eio_readdir.
flags can be any combination of:
If this flag is specified, then, in addition to the names in ptr2, also an array of struct eio_dirent is returned, in ptr1. A struct eio_dirent looks like this:
ptr1
struct eio_dirent { int nameofs; /* offset of null-terminated name string in (char *)req->ptr2 */ unsigned short namelen; /* size of filename without trailing 0 */ unsigned char type; /* one of EIO_DT_* */ signed char score; /* internal use */ ino_t inode; /* the inode number, if available, otherwise unspecified */ };
The only members you normally would access are nameofs, which is the byte-offset from ptr2 to the start of the name, namelen and type.
nameofs
namelen
type
type can be one of:
EIO_DT_UNKNOWN - if the type is not known (very common) and you have to stat the name yourself if you need to know, one of the "standard" POSIX file types (EIO_DT_REG, EIO_DT_DIR, EIO_DT_LNK, EIO_DT_FIFO, EIO_DT_SOCK, EIO_DT_CHR, EIO_DT_BLK) or some OS-specific type (currently EIO_DT_MPC - multiplexed char device (v7+coherent), EIO_DT_NAM - xenix special named file, EIO_DT_MPB - multiplexed block device (v7+coherent), EIO_DT_NWK - HP-UX network special, EIO_DT_CMP - VxFS compressed, EIO_DT_DOOR - solaris door, or EIO_DT_WHT).
EIO_DT_UNKNOWN
EIO_DT_REG
EIO_DT_DIR
EIO_DT_LNK
EIO_DT_FIFO
EIO_DT_SOCK
EIO_DT_CHR
EIO_DT_BLK
EIO_DT_MPC
EIO_DT_NAM
EIO_DT_MPB
EIO_DT_NWK
EIO_DT_CMP
EIO_DT_DOOR
EIO_DT_WHT
This example prints all names and their type:
int i; struct eio_dirent *ents = (struct eio_dirent *)req->ptr1; char *names = (char *)req->ptr2; for (i = 0; i < req->result; ++i) { struct eio_dirent *ent = ents + i; char *name = names + ent->nameofs; printf ("name #%d: %s (type %d)\n", i, name, ent->type); }
When this flag is specified, then the names will be returned in an order where likely directories come first, in optimal stat order. This is useful when you need to quickly find directories, or you want to find all directories while avoiding to stat() each entry.
If the system returns type information in readdir, then this is used to find directories directly. Otherwise, likely directories are names beginning with ".", or otherwise names with no dots, of which names with short names are tried first.
When this flag is specified, then the names will be returned in an order suitable for stat()'ing each one. That is, when you plan to stat() all files in the given directory, then the returned order will likely be fastest.
If both this flag and EIO_READDIR_DIRS_FIRST are specified, then the likely directories come first, resulting in a less optimal stat order.
EIO_READDIR_DIRS_FIRST
This flag should not be specified when calling eio_readdir. Instead, it is being set by eio_readdir (you can access the flags via req->int1, when any of the type's found were EIO_DT_UNKNOWN. The absence of this flag therefore indicates that all type's are known, which can be used to speed up some algorithms.
req->int1
A typical use case would be to identify all subdirectories within a directory - you would ask eio_readdir for EIO_READDIR_DIRS_FIRST. If then this flag is NOT set, then all the entries at the beginning of the returned array of type EIO_DT_DIR are the directories. Otherwise, you should start stat()'ing the entries starting at the beginning of the array, stopping as soon as you found all directories (the count can be deduced by the link count of the directory).
stat()
These wrap OS-specific calls (usually Linux ones), and might or might not be emulated on other operating systems. Calls that are not emulated will return -1 and set errno to ENOSYS.
Wraps the sendfile syscall. The arguments follow the Linux version, but libeio supports and will use similar calls on FreeBSD, HP/UX, Solaris and Darwin.
If the OS doesn't support some sendfile-like call, or the call fails, indicating support for the given file descriptor type (for example, Linux's sendfile might not support file to file copies), then libeio will emulate the call in userspace, so there are almost no limitations on its use.
Calls readahead(2). If the syscall is missing, then the call is emulated by simply reading the data (currently in 64kiB chunks).
readahead(2)
Calls Linux' syncfs syscall, if available. Returns -1 and sets errno to ENOSYS if the call is missing but still calls sync(), if the fd is >= 0, so you can probe for the availability of the syscall with a negative fd argument and checking for -1/ENOSYS.
syncfs
fd
>= 0
-1/ENOSYS
Calls sync_file_range. If the syscall is missing, then this is the same as calling fdatasync.
sync_file_range
fdatasync
Flags can be any combination of EIO_SYNC_FILE_RANGE_WAIT_BEFORE, EIO_SYNC_FILE_RANGE_WRITE and EIO_SYNC_FILE_RANGE_WAIT_AFTER.
EIO_SYNC_FILE_RANGE_WAIT_BEFORE
EIO_SYNC_FILE_RANGE_WRITE
EIO_SYNC_FILE_RANGE_WAIT_AFTER
Calls fallocate (note: NOT posix_fallocate!). If the syscall is missing, then it returns failure and sets errno to ENOSYS.
fallocate
posix_fallocate
The mode argument can be 0 (for behaviour similar to posix_fallocate), or EIO_FALLOC_FL_KEEP_SIZE, which keeps the size of the file unchanged (but still preallocates space beyond end of file).
mode
EIO_FALLOC_FL_KEEP_SIZE
These requests are specific to libeio and do not correspond to any OS call.
Reads (flags == 0) or modifies (flags == EIO_MT_MODIFY) the given memory area, page-wise, that is, it reads (or reads and writes back) the first octet of every page that spans the memory area.
flags == 0
flags == EIO_MT_MODIFY) the given memory area, page-wise, that is, it reads (or reads and writes back) the first octet of every page that spans the memory area.
This can be used to page in some mmapped file, or dirty some pages. Note that dirtying is an unlocked read-write access, so races can ensue when the some other thread modifies the data stored in that memory area.
Executes a custom request, i.e., a user-specified callback.
The callback gets the eio_req * as parameter and is expected to read and modify any request-specific members. Specifically, it should set req->result to the result value, just like other requests.
Here is an example that simply calls open, like eio_open, but it uses the data member as filename and uses a hardcoded O_RDONLY. If you want to pass more/other parameters, you either need to pass some struct or so via data or provide your own wrapper using the low-level API.
eio_open
O_RDONLY
static int my_open_done (eio_req *req) { int fd = req->result; return 0; } static void my_open (eio_req *req) { req->result = open (req->data, O_RDONLY); } eio_custom (my_open, 0, my_open_done, "/etc/passwd");
This is a request that takes delay seconds to execute, but otherwise does nothing - it simply puts one of the worker threads to sleep for this long.
delay
This request can be used to artificially increase load, e.g. for debugging or benchmarking reasons.
This request does nothing, except go through the whole request cycle. This can be used to measure latency or in some cases to simplify code, but is not really of much use.
There is one more rather special request, eio_grp. It is a very special aio request: Instead of doing something, it is a container for other eio requests.
eio_grp
There are two primary use cases for this: a) bundle many requests into a single, composite, request with a definite callback and the ability to cancel the whole request with its subrequests and b) limiting the number of "active" requests.
Further below you will find more discussion of these topics - first follows the reference section detailing the request generator and other methods.
Creates, submits and returns a group request. Note that it doesn't have a priority, unlike all other requests.
Adds a request to the request group.
Cancels all requests in the group, but not the group request itself. You can cancel the group request and all subrequests via a normal eio_cancel call.
Left alone, a group request will instantly move to the pending state and will be finished at the next call of eio_poll.
The usefulness stems from the fact that, if a subrequest is added to a group before a call to eio_poll, via eio_grp_add, then the group will not finish until all the subrequests have finished.
eio_grp_add
So the usage cycle of a group request is like this: after it is created, you normally instantly add a subrequest. If none is added, the group request will finish on it's own. As long as subrequests are added before the group request is finished it will be kept from finishing, that is the callbacks of any subrequests can, in turn, add more requests to the group, and as long as any requests are active, the group request itself will not finish.
Imagine you wanted to create an eio_load request that opens a file, reads it and closes it. This means it has to execute at least three eio requests, but for various reasons it might be nice if that request looked like any other eio request.
eio_load
This can be done with groups:
Create a group that contains all further requests. This is the request you can return as "the load request".
Next, open the file with eio_open and add the request to the group request and you are finished setting up the request.
If, for some reason, you cannot eio_open (path is a null ptr?) you can set grp->result to -1 to signal an error and let the group request finish on its own.
grp->result
In the open callback, if the open was not successful, copy req->errorno to grp->errorno and set grp->errorno to -1 to signal an error.
req->errorno
grp->errorno
Otherwise, malloc some memory or so and issue a read request, adding the read request to the group.
In the real callback, check for errors and possibly continue with eio_close or any other eio request in the same way.
eio_close
As soon as no new requests are added the group request will finish. Make sure you always set grp->result to some sensible value.
#TODO
void eio_grp_limit (eio_req *grp, int limit);
A request is represented by a structure of type eio_req. To initialise it, clear it to all zero bytes:
eio_req
eio_req req; memset (&req, 0, sizeof (req));
A more common way to initialise a new eio_req is to use calloc:
calloc
eio_req *req = calloc (1, sizeof (*req));
In either case, libeio neither allocates, initialises or frees the eio_req structure for you - it merely uses it.
zero
The functions in this section can sometimes be useful, but the default configuration will do in most case, so you should skip this section on first reading.
This causes eio_poll () to return after it has detected that it was running for nsecond seconds or longer (this number can be fractional).
nsecond
This can be used to limit the amount of time spent handling eio requests, for example, in interactive programs, you might want to limit this time to 0.01 seconds or so.
0.01
Note that:
gettimeofday
When nreqs is non-zero, then eio_poll will not handle more than nreqs requests per invocation. This is a less costly way to limit the amount of work done by eio_poll then setting a time limit.
nreqs
If you know your callbacks are generally fast, you could use this to encourage interactiveness in your programs by setting it to 10, 100 or even 1000.
10
100
1000
Make sure libeio can handle at least this many requests in parallel. It might be able handle more.
Set the maximum number of threads that libeio will spawn.
Libeio uses threads internally to handle most requests, and will start and stop threads on demand.
This call can be used to limit the number of idle threads (threads without work to do): libeio will keep some threads idle in preparation for more requests, but never longer than nthreads threads.
nthreads
In addition to this, libeio will also stop threads when they are idle for a few seconds, regardless of this setting.
Return the number of worker threads currently running.
Return the number of requests currently handled by libeio. This is the total number of requests that have been submitted to libeio, but not yet destroyed.
Returns the number of ready requests, i.e. requests that have been submitted but have not yet entered the execution phase.
Returns the number of pending requests, i.e. requests that have been executed and have results, but have not been finished yet by a call to eio_poll).
Libeio can be embedded directly into programs. This functionality is not documented and not (yet) officially supported.
Note that, when including libeio.m4, you are responsible for defining the compilation environment (_LARGEFILE_SOURCE, _GNU_SOURCE etc.).
libeio.m4
_LARGEFILE_SOURCE
_GNU_SOURCE
If you need to know how, check the IO::AIO perl module, which does exactly that.
These symbols, if used, must be defined when compiling eio.c.
This symbol governs the stack size for each eio thread. Libeio itself was written to use very little stackspace, but when using EIO_CUSTOM requests, you might want to increase this.
EIO_CUSTOM
If this symbol is undefined (the default) then libeio will use its default stack size (sizeof (void *) * 4096 currently). If it is defined, but 0, then the default operating system stack size will be used. In all other cases, the value must be an expression that evaluates to the desired stack size.
sizeof (void *) * 4096
In addition to a working ISO-C implementation, libeio relies on a few additional extensions:
To be portable, this module uses threads, specifically, the POSIX threads library must be available (and working, which partially excludes many xBSD systems, where fork () is buggy).
fork ()
This is actually a harder portability requirement: The libeio API is quite demanding regarding POSIX API calls (symlinks, user/group management etc.).
The type double is used to represent timestamps. It is required to have at least 51 bits of mantissa (and 9 bits of exponent), which is good enough for at least into the year 4000. This requirement is fulfilled by implementations implementing IEEE 754 (basically all existing ones).
If you know of other additional requirements drop me a note.
Marc Lehmann <libeio@schmorp.de>.
2 POD Errors
The following errors were encountered while parsing the POD:
Unterminated C<...> sequence
=back without =over
To install UV, copy and paste the appropriate command in to your terminal.
cpanm
cpanm UV
CPAN shell
perl -MCPAN -e shell install UV
For more information on module installation, please visit the detailed CPAN module installation guide.