Skip to content

Race caused by static cache in DefaultSessionFactory (QFJ-971) #240

@AndreyNudko

Description

@AndreyNudko

I think there's a race condition caused by static cache in DefaultSessionFactory, which sometimes effectively corrupts the message.

This can happen when:

  • there are 2 sessions each with it's own IO thread
  • both session use same dictionary file
  • both sessions receive first message of a particular type around the same time

Long story:
We use QuickFIX to stub FIX API-s in acceptance tests. Stub is a separate process, and acceptors are dynamically created and destroyed during tests execution. Each acceptor runs it's own IO thread.
Recently we added support for an API which requires sending same message to 2 endpoints for redundancy purposes. So the test starts 2 sessions with the same dictionary file, and verifies that both received expected message.
However now we occasionally see very strange FieldNotFound exceptions when trying to access a field of repeating group, although raw fix logs clearly shows the tag is present.

After some poking around I think there is a race condition caused by interplay of static cache in DefaultSessionFactory:

private static final SimpleCache<String, DataDictionary> dictionaryCache = new SimpleCache<>(path -> {
        try {
            return new DataDictionary(path);
        } catch (ConfigError e) {
            throw new QFJException(e);
        }
    });

and DataDictionary::getOrderedFields:

    public int[] getOrderedFields() {
        if (orderedFieldsArray == null) {
            orderedFieldsArray = new int[fields.size()];
            int i = 0;
            for (Integer field : fields) {
                orderedFieldsArray[i++] = field;
            }
        }

        return orderedFieldsArray;
    }

My reconstruction of events is this:

  • 2 sessions receive same message; because of the static cache they inadvertently share same instance of DataDictionary
  • when parsing message both threads descend into repeating groups and call DataDictionary::getOrderedFields
  • first thread starts lazy initialisation and updates orderedFieldsArray
  • now second thread may see partially constructed array before first thread finished building it
  • second thread passes this partially constructed array to FieldOrderComparator and now we have effectively mutable comparator; which gives one result before first thread finished building array, and different resut later
  • so if message is parsed before array is built but accessed after, internal structure of a field map is inconsistent with what comparator tells - hence FieldNotFound

The simple fix would be to make getOrderedFields idempotent like that:

    public int[] getOrderedFields() {
        if (orderedFieldsArray == null) {
            orderedFieldsArray = toIntArray(fields); 
        }

        return orderedFieldsArray;
    }
    
    private static int[] toIntArray(Collection<Integer> intCollection)
    {
        int result[] = new int[intCollection.size()];
        int i = 0;
        for (Integer field : intCollection) {
            result[i++] = field;
        }
        return result;
    }

However I wonder if there are other possible dangers of sharing dictionary? We certainly saw cases when dictionary validation properties were accidentally modified because of this static cache (although it was bad session configuration on our part).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions