gatohep.utils#

class gatohep.utils.LearningRateScheduler(optimizer, lr_initial=0.5, lr_final=0.001, *, total_epochs=100, mode='cosine', verbose=False)#

Cosine or exponential annealing for an optimizer’s learning rate.

update(epoch)#

Update optimizer.learning_rate based on the current epoch.

Return type:

float

class gatohep.utils.SteepnessScheduler(model, t_initial=1.0, t_final=0.01, *, total_epochs=100, mode='exponential', verbose=False)#

Anneal every cfg["k"] in a gato_sigmoid_model.

Inherits all arguments from TemperatureScheduler but updates the steepness parameters stored in model.var_cfg[j]["k"].

Notes

  • Call update(epoch)() once per epoch, exactly like the TemperatureScheduler.

  • Works whether each k is a tf.Variable or a plain float.

update(epoch)#

Update model.temperature for the given epoch index.

class gatohep.utils.TemperatureScheduler(model, t_initial=1.0, t_final=0.01, *, total_epochs=100, mode='exponential', verbose=False)#

Anneal a GATO model’s temperature variable during training.

Parameters:
  • model (gato_gmm_model) – The model whose temperature (tf.Variable) is updated in-place.

  • t_initial (float) – Temperature at epoch 0.

  • t_final (float) – Temperature at total_epochs.

  • total_epochs (int) – Number of epochs that constitute one full annealing cycle.

  • mode ({"exponential", "cosine"}, optional) –

    • “exponential” - geometric decay \(T_e = T_0 (T_f/T_0)^{e/E}\)

    • ”cosine” - half-cosine schedule \(T_e = T_f + 0.5\,(T_0 - T_f)\,[1+\cos(\pi e/E)]\)

  • verbose (bool, optional) – If True, prints the new temperature each epoch.

Notes

Call update() once per epoch (or more often, if desired).

update(epoch)#

Update model.temperature for the given epoch index.

gatohep.utils.align_boundary_tracks(history, dist_tol=0.02, gap_max=20)#

Align boundary tracks across epochs.

Parameters:
  • history (list of lists) – Each inner list contains boundary values at a specific epoch.

  • dist_tol (float, optional) – Maximum distance tolerance for matching boundaries. Default is 0.02.

  • gap_max (int, optional) – Maximum gap in epochs for considering a track inactive. Default is 20.

Returns:

A 2D array of shape (n_epochs, n_tracks) with NaNs where no boundary exists.

Return type:

ndarray

gatohep.utils.asymptotic_significance(S, B, eps=1e-09)#

Compute the asymptotic significance using the Asimov formula.

Parameters:
  • S (tf.Tensor) – Signal counts.

  • B (tf.Tensor) – Background counts.

  • eps (float, optional) – Small value to avoid division by zero. Default is 1e-9.

Returns:

Asymptotic significance values.

Return type:

tf.Tensor

gatohep.utils.build_category_mass_maps(assignments, data_dict, n_cats, *, bins=40, mass_range=(100.0, 180.0), axis_name='mass')#

Build per-category diphoton-mass histograms for each process.

Parameters:
  • assignments (dict[str, np.ndarray]) – Hard bin assignments produced by gato_gmm_model.get_bin_indices().

  • data_dict (dict[str, pandas.DataFrame]) – Input frames containing "mass" and "weight".

  • n_cats (int) – Total number of GMM categories.

  • bins – Passed through to create_hist().

  • mass_range – Passed through to create_hist().

  • axis_name – Passed through to create_hist().

Returns:

One entry per category with per-process histograms.

Return type:

list[dict[str, hist.Hist]]

gatohep.utils.build_mass_histograms(data_dict, *, bins=60, mass_range=(100.0, 180.0), axis_name='mass')#

Create diphoton-mass histograms for every process dataframe.

Parameters:
  • data_dict (dict[str, pandas.DataFrame]) – Mapping with "mass" and "weight" columns.

  • bins (int, optional) – Number of uniform bins in the specified mass range.

  • mass_range (tuple[float, float], optional) – Inclusive histogram range in GeV.

  • axis_name (str, optional) – Name assigned to the histogram axis (for plotting labels).

Returns:

One histogram per process.

Return type:

dict[str, hist.Hist]

gatohep.utils.compute_mass_reweight_factors(model, data_dict, *, signal_labels=None, feature_key='NN_output', mass_column='mass', weight_column='weight', mass_sb_low=100.0, mass_sb_high=180.0, mass_sig_low=123.5, mass_sig_high=126.5, nbins=10)#

Fit an exponential to each category’s diphoton-mass spectrum and return per-bin factors that map the continuum yield in the full sideband (100-180 GeV by default) to the yield expected in the signal window (125 +/- 1 sigma).

gatohep.utils.compute_significance_from_hists(h_signal, h_bkg_list)#

Compute the significance from signal and background histograms.

Parameters:
  • h_signal (hist.Hist) – Histogram of signal events.

  • h_bkg_list (list of hist.Hist) – List of histograms for background events.

Returns:

Combined significance value.

Return type:

float

gatohep.utils.convert_mass_data_to_tensors(data_dict)#

Convert the dataframe-based storage into TensorFlow tensors.

Parameters:

data_dict (dict[str, pandas.DataFrame]) – Mapping whose dataframes contain NN_output, weight and mass.

Returns:

Dictionary mirroring the input keys with tensor-valued payload.

Return type:

dict[str, dict[str, tf.Tensor]]

gatohep.utils.create_hist(data, weights=None, bins=50, low=0.0, high=1.0, name='NN_output')#

Create a histogram from data and weights.

Parameters:
  • data (array_like) – Data to be binned.

  • weights (array_like, optional) – Weights for the data. Default is None.

  • bins (int or array_like, optional) – Number of bins or bin edges. Default is 50.

  • low (float, optional) – Lower bound of the histogram range. Default is 0.0.

  • high (float, optional) – Upper bound of the histogram range. Default is 1.0.

  • name (str, optional) – Name of the histogram axis. Default is “NN_output”.

Returns:

A histogram object.

Return type:

hist.Hist

gatohep.utils.df_dict_to_tensors(data_dict)#

Convert a dictionary of DataFrames to a dictionary of tensors.

Parameters:

data_dict (dict) – A dictionary where keys are process names and values are pandas.DataFrames with columns “NN_output” and “weight”.

Returns:

A dictionary where keys are process names and values are dictionaries containing tensors with keys “x” and “w”.

Return type:

dict

gatohep.utils.generate_resonance_toy_data(n_signal1=60000, n_signal2=60000, n_bkg=400000, *, noise_scale=0.2, mass_sigma=1.5, seed=7, background_slopes=None)#

Extend the 3-class toy dataset with Higgs-like diphoton masses.

Parameters:
  • n_signal1 (int) – Event counts passed to generate_toy_data_3class_3D().

  • n_signal2 (int) – Event counts passed to generate_toy_data_3class_3D().

  • n_bkg (int) – Event counts passed to generate_toy_data_3class_3D().

  • noise_scale (float, optional) – Multiplicative feature noise forwarded to the base generator.

  • mass_sigma (float, optional) – Gaussian width of the resonant signal peak.

  • seed (int, optional) – Seed for deterministic feature and mass sampling.

  • background_slopes (sequence of float, optional) – Exponential slopes for the continuum components. If omitted, a default tuple is used and cycled over all background processes.

Returns:

The original dataframes augmented with a "mass" column.

Return type:

dict[str, pandas.DataFrame]

gatohep.utils.safe_sigmoid(z, steepness)#

Compute a numerically stable sigmoid function.

Parameters:
  • z (tf.Tensor) – Input tensor.

  • steepness (float) – Steepness of the sigmoid function.

Returns:

Output tensor after applying the sigmoid function.

Return type:

tf.Tensor

gatohep.utils.sample_truncated_exponential(rng, slope, size, *, low, high)#

Draw samples from a truncated exponential distribution.

Parameters:
  • rng (np.random.Generator) – Random-number generator used for sampling.

  • slope (float) – Positive exponential slope λ in exp(-λ·x).

  • size (int) – Number of samples to draw.

  • low (float) – Lower bound of the truncation interval.

  • high (float) – Upper bound of the truncation interval (must exceed low).

Returns:

Array of shape (size,) with samples in [low, high].

Return type:

np.ndarray

gatohep.utils.slice_to_2d_features(data_dict)#

Drop the background node of the pseudo-softmax feature vector.

Parameters:

data_dict (dict[str, pandas.DataFrame]) – Input dictionary produced by the toy generator.

Returns:

Shallow copies where "NN_output" only retains the first two components per event.

Return type:

dict[str, pandas.DataFrame]