Workflow Data Schema
====================

The Workflow schema captures high-level metadata and structure around a Flowcept-enabled run, including user-defined information, system context, and artifact lineage.

Workflow Fields
----------------

- **workflow_id** (str):
  Unique identifier for this workflow execution.

- **used** (dict):
  Dictionary of data or resources that were used by the workflow. Typically includes model and dataset filepaths, and script global configs.

- **generated** (dict):
  Dictionary of outputs generated by the workflow (e.g., model paths, key summarized results, artifacts).


- **parent_workflow_id** (str):
  Identifier for the parent workflow, if applicable (e.g., for nested workflows or pipelines).

- **machine_info** (dict):
  Information about the machine or compute resource where the workflow was executed.

- **conf** (dict):
  Configuration for the run, typically contains the path to used flowcept settings file.

- **flowcept_settings** (dict):
  Snapshot of Flowcept-specific resolved configuration from the settings.yaml file.

- **flowcept_version** (str):
  Version of Flowcept used during execution.

- **utc_timestamp** (float):
  Timestamp (UTC, seconds since epoch) indicating when this workflow metadata was recorded.

- **user** (str):
  Username of the person or agent who initiated the workflow.
  Derived from `sys_metadata.login_name` if available; otherwise falls back to `getpass.getuser()`, `os.getlogin()`, or remains `None`.

- **campaign_id** (str):
  Optional campaign identifier grouping related workflows.

- **adapter_id** (str):
  The adapter or source component that launched or instrumented this workflow.

- **interceptor_ids** (list of str):
  List of Flowcept interceptor instance identifiers used during instrumentation.

- **name** (str):
  Human-readable name for the workflow (e.g., "training-run-001").

- **subtype** (str):
  Optional workflow subtype/category (e.g., ``ml_workflow``).

- **custom_metadata** (dict):
  User-defined metadata for extended tagging or traceability.

- **environment_id** (str):
  Identifier for the execution environment (e.g., cluster name, like Frontier or Summit)

- **sys_name** (str):
  Name of the operating system.
  Derived from `os.uname()[0]`.

- **node_name** (str):
  Hostname of the compute node.
  Derived from `os.uname()[1]`.

- **hostname** (str):
  Fully qualified domain name of the host, resolved via `socket.getfqdn()` with multiple fallbacks.

- **public_ip** (str):
  Public IP address of the machine. Derived from `sys_metadata.public_ip`, if available.

- **private_ip** (str):
  Private (intranet) IP address of the machine. Derived from `sys_metadata.private_ip`, if available.

- **extra_metadata** (str):
  Serialized extra metadata captured from external config sources.

Notes
-----

- The schema prioritizes system-derived metadata from a `sys_metadata` block inside the Flowcept configuration.
- System identification is robust, relying on environment variables, standard libraries, and fallback mechanisms for portability.
- `used` and `generated` fields support artifact lineage and can store references to any structured or semi-structured resource.