Monitoring

Lobster produces monitoring plots, which are saved into a directory either as specified in the configuring Lobster, or by issuing the following command:

lobster plot --outdir <monitoring directory> <configuration>

The monitoring is split into a Lobster overview page and per-category pages displaying progress and task status.

ELK Commands

Lobster has a few commands to help manage ELK monitoring:

lobster elkdownload <configuration>
lobster elkupdate <configuration>
lobster elkcleanup <configuration>

elkdownload downloads templates of all dashboards listed in the configuration with the user/project prefix specified in the configuration and all visualizations on those dashboards, as well as all index patterns matching the user/run prefix.

elkupdate generates dashboards, visualizations, and index patterns from the saved templates according to the dashboards specified in the configuration.

elkcleanup deletes all Kibana objects and Elasticsearch indices that match the user/run prefix in the configuration.

Task Exit Codes

Lobster uses the following error codes, which are referred to in the Failed Tasks section of the category monitoring pages:

Code Reason
169 Unable to run parrot
170 Sandbox unpacking failure
171 Failed to determine base release
172 Failed to find old releasetop
173 Failed to create new release area
174 cmsenv failure
175 Failed to source the environment (may be parrot related)
179 Stagein failure
180 Prologue failure
185 Failed to run command
190 Failed to parse report.xml
191 Failed to parse wrapper timing information
199 Epilogue failure
200 Generic parrot failure
210 Stageout failure during transfer
211 Stageout failure cross-checking transfer
500 Publish failure
10001 Generic task failure reported by WorkQueue
10010 Task timed out
10020 Task exceeded maximum number of retries
10030 Task exceeded maximum runtime
10040 Task exceeded maximum memory
10050 Task exceeded maximum disk

Error codes lower than 170 may indicate a cmsRun problem, codes O(1k) may hint at a CMS configuration or runtime problem. Codes O(10k) are internal Work Queue error codes and may be bitmasked together, i.e., 100514 is a combination of errors 100512 and 100002.