Skip to content

Troubleshooting

Missing Jeballto token

Symptom:

  • prepare fails immediately with a token-related error

Checks:

  1. confirm JEBALLTO_TOKEN is exported, or
  2. confirm JEBALLTO_TOKEN_FILE points at a readable Jeballto config file

Prepare times out while waiting for capacity

Symptom:

  • the runner stays in prepare for a long time and then fails
  • more than two jobs from the same host appear picked, with extra jobs waiting for VM capacity

Likely cause:

  • the host already has two VM slots occupied by RUNNING or PAUSED VMs and no slot freed before JEBALLTO_PREPARE_TIMEOUT
  • GitLab Runner accepted more jobs than the host can run

What to do:

  • set top-level concurrent = 2 in the active GitLab Runner config.toml
  • set limit = 2 directly under the Jeballto [[runners]] entry
  • set request_concurrency = 2 directly under the Jeballto [[runners]] entry
  • confirm there is only one active runner process for the host, or that all runner entries on that host add up to two slots
  • shorten job runtime
  • add another host
  • confirm old VMs are actually being deleted at the end of jobs

SSH never becomes ready

Symptom:

  • VM reaches RUNNING but prepare later fails on SSH readiness

Checks:

  1. verify the VM image enables SSH access for the configured user
  2. verify JEBALLTO_SSH_USER and JEBALLTO_VM_PASSWORD match the image
  3. verify the host can reach the forwarded SSH endpoint

Toolchain probe fails

Symptom:

  • SSH works, but prepare still fails

Likely cause:

  • one of the required tools is missing in the image

Required tools:

  • bash
  • git
  • gitlab-runner

Cleanup leaves state files behind

Symptom:

  • new jobs for the same runner and job context behave strangely

Checks:

  1. inspect JEBALLTO_STATE_ROOT
  2. look for stale JSON or lock files
  3. confirm the runner process can write to and remove files under that directory

Image pull contention looks stuck

Symptom:

  • several jobs appear to pause on the same image

Explanation:

  • one job is holding the image pull lock while it pulls the image
  • the other jobs are waiting intentionally

This is usually healthy behavior unless the pull itself is stalled.

Job failed, but GitLab shows a system failure

That usually means the script did not simply return non-zero. Instead, the failure happened in transport, SSH execution, API access, timeout handling, or cleanup orchestration.

Check the executor logs around run to see whether the script actually started.