Dependencies and Invalidation

Keeping elasticsearch in sync.

The /_indexer wsgi app (es_index_listener.py) drives the incremental indexing process. When a new transaction is notified by postgres (or after 60 seconds) it calls the /index view (indexer.py) which works out what needs to be reindexed. The actual reindexing happens in parallel in multiprocessing subprocesses (mpindexer.py.)

When rendering a response, we record the set of embedded_uuids and linked_uuids used.

  • embedded_uuids are those objects embedded into the response or whose properties have been consulted in rendering of the response. Any change to one of these objects should cause an invalidation. (See Item.__json__.)
  • linked_uuids are the objects linked to in the response. Only changes to their url need trigger an invalidation. (See Item.__resource_url__.)

When modifying objects, event subscribers keep track of which objects where updated and their resource paths before and after the modification. This is used to record the set of updated_uuids and renamed_uuids in the transaction log. (See indexing.py.)

The indexer process listens for notifications of new transactions. With the union of updated_uuids and union of renamed_uuids across each transaction in the log since its last indexing run, it performs a search for all objects where embedded_uuids intersect with the updated_uuids or linked_uuids intersect with the renamed_uuids. The result is the set of invalidated objects which must be reindexed in addition to those that were modified (recorded in updated_uuids.)

Where an object’s url depends on other objects – Page whose url includes its ancestors in its path, or Target whose url includes a property from its referenced organism – we must ensure that linked_uuid dependencies to those other objects are recorded in addition the object itself when linked. (See Page.__resource_url__ and Target.__resource_url__.)

Total Reindexing

Cases can arise where a total reindexing needs to be triggered. >curl -XDELETE ‘localhost:9200/encoded/meta/indexing’ will specifically force it.

localhost:9200/encoded/meta/indexing stores the document that keeps track of incremental indexing. The indexer script checks for that document when deciding between full index and indexing only the recently invalidated documents. It has the benefit of keeping the old-yet-to-be-indexed data online, especially if it’s a production instance.

Alternatively, >curl -XDELETE ‘http://localhost:9200/encoded/’ will delete the entire index along with the mapping information for schema objects. Although it does trigger indexing, missing mapping information makes the documets unsearcheable. Mapping in elasticsearch describes how each field of each object should be tokenized/analyzed/indexed for searching.

Isolation level considerations

Postgres defaults to its lowest isolation level, READ COMMITTED: http://www.postgresql.org/docs/9.3/static/transaction-iso.html

For invalidation of back references of new child objects only READ COMMITTED isolation is necessary as invalidated back references are calculated from the updated objects properties.

However, writes must be at least REPEATABLE READ in order for overalapping PATCHes to apply safely.

During recovery indexing uses READ COMMITTED isolation. Indexed objects may be internally inconsistent if there are concurrent updates to embedded objects. But indexing is still eventually consistent as any concurrent update will invalidate the object and it will be reindexed later.

To avoid internal possible internal inconsistancies of indexed objects, SERIALIZABLE isolation is required. It is used once it becomes available when recovery is complete.