You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The XML encoder used for configuration serialization is not very robust (e.g. in the face of changing class hierarchy and removing configuration options) and has some quirks (#2002). We should consider using something else (YAML/JSON ?).
Also, this serialization is used not only for configuration but also elsewhere (IndexAnalysisSettings).
Also, the configuration should be treated as data, not serialized objects, to avoid security vulnerabilities that might happen when de-serializing XML into Java objects.
The other reason for using something else is performance. Lately, I realized that XMLEncoder does not scale when retrieving configuration using the RESTful API. When running a multithreaded program where each thread just retrieves the configuration in a loop, where the number of threads matches the number of CPUs, the times shoot up to almost 2 seconds, compared to single threaded program with 0.4 seconds. The XML file with the configuration has some 1.38 MB. When I got a jstack snapshot, it revealed that lots of the XMLEncoder processing threads (like 25 out of the 32 threads I was using) are waiting on internal synchronization object, with top of the stack looking like this:
"http-nio-8080-exec-1427" #29360 daemon prio=5 os_prio=64 cpu=38052.59ms elapsed=2859934.54s tid=0x000000000531c000 nid=0x7981 waiting for monitor entry [0x00007fff808fa000]
java.lang.Thread.State: BLOCKED (on object monitor)
at com.sun.beans.util.Cache.get(java.desktop@11.0.7-internal/Cache.java:119)
- waiting to lock <0x00007ff387d7f320> (a java.lang.ref.ReferenceQueue)
at com.sun.beans.finder.MethodFinder.findMethod(java.desktop@11.0.7-internal/MethodFinder.java:81)
at java.beans.Statement.getMethod(java.desktop@11.0.7-internal/Statement.java:369)
at java.beans.Statement.invokeInternal(java.desktop@11.0.7-internal/Statement.java:273)
at java.beans.Statement$2.run(java.desktop@11.0.7-internal/Statement.java:187)
at java.security.AccessController.doPrivileged(java.base@11.0.7-internal/Native Method)
at java.beans.Statement.invoke(java.desktop@11.0.7-internal/Statement.java:184)
at java.beans.Expression.getValue(java.desktop@11.0.7-internal/Expression.java:155)
at java.beans.Encoder.getValue(java.desktop@11.0.7-internal/Encoder.java:105)
at java.beans.Encoder.get(java.desktop@11.0.7-internal/Encoder.java:252)
at java.beans.PersistenceDelegate.writeObject(java.desktop@11.0.7-internal/PersistenceDelegate.java:112)
at java.beans.Encoder.writeObject(java.desktop@11.0.7-internal/Encoder.java:74)
at java.beans.XMLEncoder.writeObject(java.desktop@11.0.7-internal/XMLEncoder.java:326)
Now, I did this exercise in order to simulate read timeout problems that occur right after running all-project sync using the sync.py command. This command runs number of reindex_project.py programs in parallel and each reindex_project.py retrieves the configuration from the web app at the start. Using --api_timeout with increased value for the Python tools is usable as a workaround, however my expectation is that this should scale.
Another feature that could be brought with new serialization scheme is wildcards. For instance, I'd like to be able to set project properties for a set of projects specified with wildcards (regexps, even), similarly to what is done in opengrok-mirror configuration:
YAML is probably not so great so perhaps using something like TOML might be better idea, however still need to address the need for serialization of objects like Project and RepositoryInfo. Seems like some TOML Java implementations support serialization.
Activity
tulinkry commentedon Aug 31, 2018
Yes, finally.
tulinkry commentedon Feb 4, 2019
Looks like yaml would be the way to go.
vladak commentedon Apr 12, 2019
Also, the configuration should be treated as data, not serialized objects, to avoid security vulnerabilities that might happen when de-serializing XML into Java objects.
vladak commentedon Mar 28, 2022
The other reason for using something else is performance. Lately, I realized that
XMLEncoder
does not scale when retrieving configuration using the RESTful API. When running a multithreaded program where each thread just retrieves the configuration in a loop, where the number of threads matches the number of CPUs, the times shoot up to almost 2 seconds, compared to single threaded program with 0.4 seconds. The XML file with the configuration has some 1.38 MB. When I got ajstack
snapshot, it revealed that lots of the XMLEncoder processing threads (like 25 out of the 32 threads I was using) are waiting on internal synchronization object, with top of the stack looking like this:Now, I did this exercise in order to simulate read timeout problems that occur right after running all-project sync using the
sync.py
command. This command runs number ofreindex_project.py
programs in parallel and eachreindex_project.py
retrieves the configuration from the web app at the start. Using--api_timeout
with increased value for the Python tools is usable as a workaround, however my expectation is that this should scale.vladak commentedon Oct 19, 2022
Another feature that could be brought with new serialization scheme is wildcards. For instance, I'd like to be able to set project properties for a set of projects specified with wildcards (regexps, even), similarly to what is done in
opengrok-mirror
configuration:vladak commentedon Dec 1, 2022
YAML is probably not so great so perhaps using something like TOML might be better idea, however still need to address the need for serialization of objects like
Project
andRepositoryInfo
. Seems like some TOML Java implementations support serialization.