Estimation and Solvers

A Practitioner’s Guide to $ESTIMATION and $SUBROUTINES

Published

July 24, 2025

If you have ever felt lost choosing between FOCE, LAPLACE, IMP, or SAEM, this page is for you. We will walk through the available estimation methods, when to pick each one, and the little tweaks that make runs faster and more robust.

Estimation methods are the algorithms used to identify the best-fit model population parameters (THETAs, OMEGAs, and SIGMAs) given the specified model and the analysis dataset. The model, together with the dataset, is what we call a “problem” to be solved by our estimation methods.

The aim of this page is to:

Give a brief overview of the available estimation methods and sensible tweaks/options
Give an intuitive explanation of the estimation methods on a conceptual level
Give a sense of when to choose which estimation method

In essence, there are two families of estimation methods available in NONMEM:

Maximum-Likelihood (“Classical” or “MLE”¹) methods
Expectation-Maximization (“EM”) methods

ETA optimization

Finding the individual ETAs is called “Maximum a Posteriori (MAP) estimation”².

The optimal ETA values themselves are called either Empirical Bayes Estimates (EBEs), Conditional Parametric ETAs (CPEs), MAP estimates, or conditional mode estimates, symbolically denoted as $\hat{η}$ (pronounced “ETA hat”).

NONMEM performs MAP using the BFGS (Broyden-Fletcher-Goldfarb-Shanno) optimizer by default. This method can be changed using the OPTMAP option, but it should generally be left at its default setting.

Subroutines

Before diving into the estimation methods, let’s quickly talk about $SUBROUTINES.

NONMEM comes with a library of ADVAN routines (Table 1). Their function is to “advance” the kinetic system from one state time point to the next (hence “ADVAN”). Another possible name for ADVAN would have been SOLVER, because most ADVAN routines solve a set of differential equations, either analytically or by integration. We differentiate the solvers by their computational nature, as this will impact how we choose and optimize our estimation algorithm. In general, “numerical integration problems” are problems using $DES, and “analytical problems” are those that do not (see column “Solver” in Table 1).

Show ADVAN table

Table 1: Advancement routines.

Routine	Solver	Control records	Description
`ADVAN1`	Analytic		One Compartment Linear Model
`ADVAN2`	Analytic		One Compartment Linear Model with First Order Absorption
`ADVAN3`	Analytic		Two Compartment Linear Model
`ADVAN4`	Analytic		Two Compartment Linear Model with First Order Absorption
`ADVAN5`	Analytic	`$MODEL`	General Linear Model
`ADVAN6`	Numeric	`$MODEL`, `$DES`	General Nonlinear Model
`ADVAN7`	Analytic	`$MODEL`	General Linear Model with Real Eigenvalues
`ADVAN8`	Numeric	`$MODEL`, `$DES`	General Nonlinear Model with Stiff Differential Equations
`ADVAN9`	Both	`$MODEL`, `$DES`, `$AES`, `$AESINITIAL`	General Nonlinear Model with Equilibrium Compartments
`ADVAN10`	Analytic		One Compartment Model with Michaelis-Menten Elimination
`ADVAN11`	Analytic		Three Compartment Linear Model
`ADVAN12`	Analytic		Three Compartment Linear Model with First Order Absorption
`ADVAN13`	Numeric	`$MODEL`, `$DES`	General Nonlinear Model using LSODA
`ADVAN14`	Numeric	`$MODEL`, `$DES`	General Nonlinear Model using CVODA
`ADVAN15`	Both	`$MODEL`, `$DES`	General Nonlinear Model with Equilibrium Compartments using IDA
`ADVAN16`	Numeric	`$MODEL`, `$DES`	Delay Differential Equation Model using RADAR5
`ADVAN17`	Both	`$MODEL`, `$DES`, `$AES`, `$AESINITIAL`	Delay Differential Equation Model with Equilibrium Compartments using RADAR5
`ADVAN18`	Numeric	`$MODEL`, `$DES`	Delay Differential Equation Model using DDE_SOLVER

For each ADVAN, there exists a group of TRANS subroutines (numbered 1–6). Think of TRANS as a “translator” (hence “TRANS”) that swaps one parameterization for another so you can work in the language that feels most natural to you.

TRANS1 is the default for all ADVANs and means that you are using rate constants (k’s) as your main parameters.
TRANS2, TRANS3, and TRANS4 are sometimes seen with analytic solvers, meaning you are using CL, V, and potentially Q and KA as parameters instead of rate constants.
TRANS5 and TRANS6 are rarely used, so these can be disregarded for all practical purposes.

For every run, you specify one ADVAN routine and one TRANS routine in the $SUBROUTINES record. If you are using TRANS1, you don’t need to specify this since it’s the default.

Fine-tuning tolerance

Tolerance options may be set at $SUBROUTINES, which will be in effect throughout the entire problem³. These options are used for the numerical ADVAN routines (see column “Solver” in Table 1).

TOL=n: The relative significant digits precision to which differential equations are to be integrated, so the precision is 10^-TOL. Used with ADVAN6, ADVAN8, ADVAN9, and ADVAN13 to specify the number of digits that are required to be accurate in the computation of the drug amount in each compartment; the precise meaning depends on the particular ADVAN routine that uses it [1].
ATOL=n: The absolute significant digits precision to which differential equations are to be integrated, so the accuracy is 10^-ATOL. The default is 12. Used with ADVAN9, ADVAN13, ADVAN14, ADVAN15, ADVAN16, ADVAN17, ADVAN18. On occasion, you may want to reduce ATOL (usually set equal to TOL) to improve speeds.
For the numerical ADVAN routines, internal precision is based on TOL, as opposed to the analytical routines, where internal precision is based on SIGL

Classical MLE methods

MLE (or “classical”) methods want to be “statistically fair” to all possibilities. These are also the types of methods used in classical statistical tests like t-test, AN(C)OVA, and even the simple mean of a set of values.

There are essentially three MLE methods available in NONMEM:

FO (First Order)
FOCE (First Order with Conditional Estimation)
LAPLACE (Second order)

The “order” refers to the derivative evaluated in the Taylor expansion [2]. Therefore, FO and FOCE evaluate the first-order derivative (Jacobian matrix), or the “slope” of the objective function value (OFV). LAPLACE evaluates the second-order derivative (Hessian matrix), or the “curvature” of the OFV, requiring more computational effort than the first order.

FO stands for First Order. Is the perfect method when no ETAs are computed⁴, as in e.g., single TTE-models that do not contain a frailty-component. It may also be helpful as an exploratory method when fast, approximate results are desirable.

To get the ETAs, specify the option POSTHOC (latin for “after this”).

Note

FO can be specified as either METHOD=ZERO or METHOD=0.

Either one is fine, although METHOD=0 is more common.

FOCE stands for First Order Conditional Estimation. FOCE simultaneously estimates both fixed and random effects, unlike the FO method, which estimates fixed effects first, then random effects.

Tip

There is no need to specify the POSTHOC option with FOCE. It only applies to the FO method.

It is very common to also use the INTERACTION option, and that is then called the FOCEI (FOCE Interaction) method. The INTERACTION option allows for correlation between the two types of random effects: ETAs (between-subject variability) and Epsilons (residual variability).

Note

FOCE can be specified as either METHOD=CONDITIONAL or METHOD=1.

For explicitness, METHOD=CONDITIONAL is preferred.

Options

Sensible options to add from start:

Show code

$ESTIMATION METHOD=CONDITIONAL
1            INTERACTION
            MAXEVAL=99999
2            FAST
            CTYPE=4
3            NONINFETA=1
            ETASTYPE=1
4            NSIG=2
5            SIGL
6            SIGLO
            MSFO=root.msf
            PRINT=1

1: INTERACTION should always be used when residual error is proportional, and does not impact the analysis when the residual error is additive. Thus, always use it.
2: Always use the FAST option, there are very few cases where FAST is suboptimal⁵. The FAST option benefits from MU-modeling to be more effective.
3: NONINFETA=1 and ETASTYPE=1 should be added when not all ETAs are used for all subjects. For example, an ETA to a KA during a fit of a subject with only IV dosing.
4: NSIG <= SIGL/3. Default is NSIG=3, which is ok for analytical problems. But for numerical integration problems NSIG=2 is the largest one should insist on⁶.
5: SIGL <= TOL (for numerical integration problems). SIGL <= 14 (for analytical problems, about the limit of double precision ).
6: SIGLO <= TOL and SIGLO >= SIGL. By default, SIGLO=SIGL

If issues arise, then first try these options:

Show code

1NSIG=2
2NOABORT
3NOHABORT
4NOTHETABOUNDTEST
NOOMEGABOUNDTEST
NOSIGMABOUNDTEST
5MCETA=1000
6SEED=14455
RANMETHOD=4P
SADDLE_RESET=1
REPEAT

1: Some trial and error for NSIG, SIGL, SIGLO may still be required given the data.
2: If “Hessian not positive definite”
3: If there is trouble even at the beginning of an optimization with NOABORT
4: If “PARAMETER ESTIMATE IS NEAR ITS DEFAULT BOUNDARY.”
5: If issue with MAP (ETA) estimation
6: If using MCETA

The relevant default options for FOCE are:

Show code

$ESTIMATION METHOD=CONDITIONAL
            ABORT
            ATOL=12
            BOOTDATA=0
            CONDITIONAL
            CTYPE=0
            ETADER=0
            ETASTYPE=0
            FILE=root.ext
            FNLETA=1
            FORMAT=s1PE12.5
            LEVWT=0 ; If $LEVEL is used
            MASSRESET=-1
            MAXEVAL=224
            MCETA=0
            NOLABEL=0
            NONINFETA=0
            NOPRIOR=0
            NOSLOW
            NOSUB=0
            NOTITLE=0
            NSIG=3 ; Also "SIGDIGITS" or "NSIGDIGITS"
            NUMDER=0
            NUTS_EPARAM
            NUTS_MASS
            NUTS_OPARAM
            NUTS_REG
            NUTS_SPARAM
            OMITTED
            ORDER
            PARAFILE
            PARAFPRINT
            PREDICTION
            PRINT=9999
            PRIORC
            SADDLE_HESS=0
            SADDLE_RESET=0
            NOSORT
            NOREPEAT
            NONUMERICAL
            NOCENTERING
            NOINTERACTION
            NOETABARCHECK
            THETABOUNDTEST
            OMEGABOUNDTEST
            SIGMABOUNDTEST

Similar to FOCE but evaluates the second order derivative (Hessian) instead of the first order (Jacobian). Should really be called SOCE (second order conditional estimation). LAPLACE is specified as either LAPLACE (most common) or LAPLACIAN, as an option to FOCE.

$ESTIMATION METHOD=CONDITIONAL LAPLACE

Code for numerical stability

Instead of parameter bounds, re-parameterize. For example, if a parameter A should have a lower bound of 0, code it in logarithmic space A = EXP(THETA). This way, A has a lower bound of 0, but the associated THETA is unbounded.

To guard against floating-point exceptions, simply write $ABBREVIATED PROTECT at the beginning of your control stream. NONMEM will then use “protected” versions of EXP(), LOG(), SQRT(), etc.

Small number of subjects (< 10)?

If you are analyzing very few subjects (less than 10), it’s most accurate to correct the d.f. when assessing OMEGAs. This can be done using LDF, which stands for “loss of degrees of freedom”.

If you can MU-model all THETAs, then set LDF=-1 in $PK to let NONMEM do the correction automatically.

If you cannot MU-model all THETAs, the most accurate is to have LDF(n) be the number of to-be-estimated THETAs involved in MU_n:

" LDF(3) = 2.0
MU_3 = THETA(4) + LOG(WT) * THETA(5)
CL = EXP(MU_3 + ETA(3))

Precision in practice

NSIG: The significant digits for THETAs, OMEGAs, and SIGMAs. Estimation stops when each change by less than NSIG digits between iterations.

SIGL: The significant digits for iOFV. The internal NONMEM SIGL is set to 10. Rule of thumb: iOFV is evaluated to a precision of the smaller of TOL and the internal SIGL. If all iOFVs are of the same sign, then the OVF could have a resulting precision of $\log_{10}$ (N) + SIGL, where N is the number of subjects, for a maximum of 15, the limiting precision of double precision. Thus with 100 subjects, the actual precision that the total OFV is evaluated could be 12. One should not necessarily rely on this, so it is safest to suppose the more conservative precision of 10, for which a suitable NSIG would be 3.

For numerical integration problems:

Internal precision is based on user specified TOL, not SIGL (see Section 1)
Often TOL=6 or less, not 10
The largest NSIG one should insist on is 2.
SIGL ≤ TOL
NSIG ≤ SIGL / 3

SIGLO: The significant digits for ETAs. By default, SIGLO = SIGL.

Rule-of-thumb for setting NSIG, SIGL, and SIGLO

Most problems will benefit from some trial-and-error testing. See Table 1 for info on solvers.

Analytical solvers
- SIGL ≤ 14 (about the limit of double precision)
- NSIG ≤ SIGL / 3
Numerical solvers
- SIGL ≤ TOL
- NSIG ≤ SIGL / 3
- So, if TOL=6, then SIGL should be 6, and NSIG should be 2

If SIGLO is used, set SIGLO ≤ TOL.

CTYPE

Sometimes you may see several iterations with almost no change in OFV. What’s happening? One or more parameters are probably bouncing around in the second significant digit, with NSIG=3. If a parameter’s influence on the objective is this tiny, the traditional convergence rule “every parameter must change by less than NSIG significant digits” may never be met, even though the objective is already at its minimum.

To handle this, add CTYPE=4. If the OFV doesn’t change beyond NSIG decimal places for 10 straight iterations, the minimization is sucessful and the run terminates successfully.

SADDLE_RESET

The SADDLE_RESET option was developed enable more robust parameter estimation through a check for local practical identifiability [3]. In the original article, the authors recommend setting SADDLE_RESET=1 when using MLE (FO, FOCE, LAPLACE).

MCETA

During ETA optimization NONMEM must pick a starting set of individual ETAs. By default it uses all zeros, but that may trap the algorithm in a local minimum. The MCETA option tells NONMEM to test alternative starting points and keep the one that gives the lowest objective function:

MCETA=0 (default): Only ETA = 0 is tried.
MCETA=1: Compares ETA = 0 with the ETAs from the previous iteration and starts MAP with whichever produces the lower OFV.
MCETA>1: Generates MCETA – 1 random ETA vectors sampled from N(0, Ω) and tests them together with ETA = 0 and the previous ETAs; the vector with the lowest OFV becomes the MAP starting point.

If the algorithm ends up in local minima despite MCETA, try increasing the value (e.g. from 100 to 1000). Since MCETA is most impactful during the early iterations⁷, another option to still benefit from the robustness of MCETA but not impact the estimation speed too much, is to only use MCETA for the first 200 iterations or so:

$ESTIMATION METHOD=1
            MCETA=1000
            MAXEVAL=200

$ESTIMATION METHOD=1
            MCETA=1
            MAXEVAL=9999

As mentioned above, MLE methods are inherently reproducible since they provide only one correct answer. However, if one uses the MCETA option, a stochastic Monte-Carlo element is introduced that benefits from additional options for reproducibility:

RANMETHOD=4P: The 4 sets the randomization method to the same as the default one in $SIMULATION, and the P is necessary when running NONMEM in parallel.
SEED=123456789: This enables the Monte-Carlo sequence to be reproduced. Choose any integer, but be consistent within your analysis.

Using the FAST option together with MU_-modeling will most often improve esimation speeds.

Reducing the required significant digits might also increase estimation speeds.

Additionally, the classical NONMEM methods can be facilitated using an EM method by first having a rapid EM method such as iterative two stage be performed first, with the resulting parameters being passed on to the FOCE method, to speed up the analysis:

$ESTIMATION METHOD=ITS
            INTERACTION
$ESTIMATION METHOD=CONDITIONAL
            INTERACTION

EM-methods

EM stand for Expectation-Maximization, and was first published in 1977. EM derives its name from the two iterative steps that define its approach:

E (Expectation) step: THETAs, OMEGAs, and SIGMAs are fixed. ETAs and their variances are calculated. Sometimes, gradients of the likelihood for THETAs and SIGMAs are also calculated.
M (Maximization) step: THETAs, OMEGAs, and SIGMAs are updated based on the ETAs and/or likelihood gradients. The M-step typically requires very little computation time.

Multiple parameter combinations could potentially explain the observed data. Rather than seeking a single “best” solution like MLE methods, EM explores the distribution of likely parameter values and identifies the most likely combination from this distribution.

It is important to understand that parameters can be estimated/refined ad infinitum, since they are continuous. Thus, a stopping (convergence) critera is needed. Each time you run your model, you might get a slightly different result depending on the stopping criteria.

Also, to avoid searching the infinite space of all parameters (would literally take forever), the different EM methods restrict the search area in different ways. In general, this also determines the speed of the algorithm. For example, IMP is generally faster than SAEM due to IMP searching a smaller space around the initial estimates then SAEM.

MU modeling

EM methods are most efficient when the model follows a specific statistical structure: fixed effects parameters (THETAs) only define the mean (MU) of the normal population distribution of individual parameters (phi = THETA + ETA). This requires a special coding technique called MU modeling for optimal performance.

There are essentially three EM methods available in NONMEM:

ITS (Iterative Two-Stage)
IMP (Importance sampling, of which IMPMAP is a special case)
SAEM (Stochastic Approximation EM)

ITS stands for Iterative Two Stage. Is considered deterministic, as opposed to other EM-methods.

ITS is specified by:

$ESTIMATION METHOD=ITS
            INTERACTION
1            NITER=50

1: NITER (default 50) sets maximum number of iterations

Normally, it is used as a rapid exploratory pre-analysis method to facilitate parameter estimation with IMP or SAEM.

Show example

$ESTIMATION METHOD=ITS
            INTERACTION
            LAPLACE
            NITER=15
$ESTIMATION METHOD=SAEM
            INTERACTION
            LAPLACE
            NBURN=200
            NITER=200
$ESTIMATION METHOD=IMP
            EONLY=1
            NITER=2
            ISAMPLE=1000
            MAPITER=0

AUTO=1

METHOD=ITS
INTERACTION
CTYPE=3
NITER=500
NOPRIOR=1
CITER=10
CINTERVAL=1
CALPHA=0.05

NONMEM has a direct Monte Carlo sampling method which is specified like this:

$ESTIMATION METHOD=DIRECT
            INTERACTION
            ISAMPLE=10000
            NITER=50

This method is the purest method for performing E-M, in that it creates completely independent sample, and there is no chance of causing bias if the sampling density is not similar enough to the conditional density (unlike IMP). However, it is very inefficient, requiring ISAMPLE values of 10000–300000 to properly estimate the problem.

IMP stands for IMPortance sampling. The general IMP algorithm goes like this:

A MAP estimation (if MAPITER=1) is performed to estimate the parameters of the proposal density (aka “sampling distribution” q(x)). Can be replaced by Monte Carlo direct sampling METHOD=DIRECT (see Section 3.0.2.2).
E-step: For every individual, ETAs are sampled ISAMPLE times from the proposal density. This gives the mean and variance of the ETAs that are most likely given the data.
M-step: Population parameters are updated.
The algorithm goes through this E-M loop (2–3) NITER times, or until the convergence critera is met.

Show IMP options

$ESTIMATION METHOD=IMP
            INTERACTION
1            NITER=1000
2            ISAMPLE=300
3            ISAMPEND=10000
            STDOBJ=10
4            MAPITER=1
5            MAPINTER=-1
6            IACCEPT=0.0
            IACCEPTL=0.01
7            DF=0
8            ISCALE_MIN=0.1
            ISCALE_MAX=10
9            CTYPE=1
            CINTERVAL=10
            CITER=10
            CALPHA=0.05

1: NITER: Integer. Sets maximum number of iterations (default 50). Usually about 50–200 iterations are required to approach the maximum likelihood population parameters [4]. Synonymous with NSAMPLE.
2: ISAMPLE: Integer. Number of random samples per subject used for E-step (default 300). Usually 300 is sufficient, but may require 1000–3000 for very sparse data, and when you want an OFV with low Monte Carlo noise.
3: ISAMPEND and STDOBJ: If ISAMPEND is greater than ISAMPLE, and STDOBJ > 0.0, then samples per subject will adjust between ISAMPLE–ISAMPEND, until the stochastic standard deviation of the OFV falls below STDOBJ.
4: MAPITER=n: Integer. The first n iterations MAP is use to assess parameters for the proposal density. After this, the Monte Carlo conditional means and variances of the previous iteration are used for the proposal density parameters of the present iteration.
5: MAPINTER=n: Integer. Every n^th iteration, MAP is used to provide parameters to the proposal density. If MAPINTER=-1, then MAP will be turned on only if the OFV increases consistently over several iterations.
6: IACCEPT and IACCEPTL: Real. Lower IACCEPT to expand the proposal density (default 0.4). For very sparse data or highly nonlinear posterior densities (such as with categorical data), you may want to decrease to IACCEPT to 0.1–0.3. IACCEPTL scales a second multivariate normal distribution, to cover long tails in the posterior density (hence “L” for “long tails”). If IACCEPT=0.0 the DF and IACCEPT level will be automatically determined for each subject (output in root.imp). If IACCEPT=0.0, and IACCEPTL > 0, then a search for the best IACCEPTL for each subject is made, starting the testing at the IACCEPTL value given by the user, while IACCEPT is fixed to 1.
7: DF: Integer. Degrees of freedom for the proposal density. Default DF=0 is normal distribution, else d.f. for t-distrubution. For very sparse data or highly non-linear posterior densities (such as with categorical data), you may want to set DF to somewhere between 2–10.
8: ISCALE_MIN and ISCALE_MAX: Real. Boundaries for the scale factor used to vary the size of the variance of the proposal density in order to meet the IACCEPT condition. On very rare occasions, the OFV varies widely, and these options may need to be reduced (e.g., ISCALE_MIN=0.3 ISAMPLE_MAX=3). Remember to revert these parameters to default operation on the next $EST step: ISCALE_MIN=-100 ISCALE_MAX=-100. Default is ISCALE_MIN=0.1 ISCALE_MAX=10 for importance sampling.
9: See Section 3.0.2.4

AUTO=1

$ESTIMATION METHOD=IMP
            AUTO=1

$ESTIMATION METHOD=IMP
            INTERACTION
            CTYPE=3
            NITER=500
            ISAMPLE=300
            STDOBJ=10
            ISAMPEND=10000
1            NOPRIOR=1
            CITER=10
            CINTERVAL=1
            CALPHA=0.05
            IACCEPT=0.0
            ISCALE_MIN=0.1
            ISCALE_MAX=10
            DF=0
2            MCETA=3
            EONLY=0
            MAPITER=1
            MAPINTER=-1

1: NOPRIOR determines if you do want to use (0) prior information for this estimation record or not (1). Normally, if you specify a prior you want to use it, so set NOPRIOR=0.
2: MCETA works the same during MAP as it does for the MLE methods.

AUTO=2

AUTO=2 is same as AUTO=1, with additional:

1GRDQ=-1.0
2DERCONT=1
RANMETHOD=3S2P

1: GRDQ: The gradient quick option. Used for THETAs that are not MU-modeled, for SIGMAs, and THETAs that model RUV.
2: DERCONT: Derivative continuity. DERCONT=1 might be helpful in cases of extreme discontinuity when varying certain THETAs by even just a small amount in the model results in a large change in OFV.

Monte Carlo noise

Remember, EM algorithms don’t find a single set of parameters that are the best fit. They find a distribution of parameter sets that maximize the expectation. Often those parameter sets are very similar and only differ in the 6^th or 7^th decimal place. But sometimes in complex⁸ models with poorly estimated parameters they can differ by larger numbers.

Show example

The iterations for the initial ITS step seems in this case to be quite stable with some artifacts:

$ESTIMATION METHOD=ITS
            INTERACTION
            LAPLACE
            NITER=200
            SIG=3
            SIGL=6
            NOHABORT 
            CTYPE=3
            NUMERICAL
            SLOW 

$ESTIMATION METHOD=IMP
            INTERACTION
            LAPLACE
            ISAMPLE=1000
            NITER=1000
            SIG=3
            SIGL=6
            NOHABORT
            CTYPE=3
            IACCEPT=0.4
            MAPITER=0
            RANMETHOD=3S2P

$COVARIANCE UNCONDITIONAL MATRIX=S TOL=12 SIGL=12 SLOW

...

iteration 175 OBJ= 4693.4674554341409
iteration 176 OBJ= 4694.2296104065535
iteration 177 OBJ= 4693.7753507970829
iteration 178 OBJ= 4693.9600270372885
iteration 179 OBJ= 4693.5732455834705
iteration 180 OBJ= 4693.6386423202493
iteration 181 OBJ= 4693.6215390721527
iteration 182 OBJ= 4693.6006496138452
iteration 183 OBJ= 4693.7877620448235
iteration 184 OBJ= 4694.1591757809929
iteration 185 OBJ= 4693.2614956897451
iteration 186 OBJ= 4693.5641640401127
iteration 187 OBJ= 4693.5575289919379
1iteration 188 OBJ= 4495.6489907149398
iteration 189 OBJ= 4693.7711764252363
iteration 190 OBJ= 4693.6281175153035
iteration 191 OBJ= 4694.1171774559862
iteration 192 OBJ= 4693.7908707845536
iteration 193 OBJ= 4693.7709264605819
iteration 194 OBJ= 4495.9262902940209
iteration 195 OBJ= 4693.3321354894242
iteration 196 OBJ= 4694.3177205227348
iteration 197 OBJ= 4694.1301486616576
iteration 198 OBJ= 4694.2898587322170
iteration 199 OBJ= 4693.8304358341920
iteration 200 OBJ= 4691.6818293505230

#TERM: 
OPTIMIZATION WAS NOT COMPLETED

1: Small “spikes”

The IMP step can be less stable:

iteration 120 OBJ= 4314.8310660241377 eff.= 446. Smpl.= 1000. Fit.= 0.96389
iteration 121 OBJ= 4326.9079856676717 eff.= 448. Smpl.= 1000. Fit.= 0.96409
1iteration 122 OBJ= 4164.6649529423103 eff.= 479. Smpl.= 1000. Fit.= 0.96392
iteration 123 OBJ= 4299.9887619753636 eff.= 432. Smpl.= 1000. Fit.= 0.96395
iteration 124 OBJ= 4303.9571213327054 eff.= 399. Smpl.= 1000. Fit.= 0.96349
iteration 125 OBJ= 4328.9835950930074 eff.= 417. Smpl.= 1000. Fit.= 0.96423
iteration 126 OBJ= 4304.3861595488252 eff.= 550. Smpl.= 1000. Fit.= 0.96392
iteration 127 OBJ= 4291.0862736663648 eff.= 422. Smpl.= 1000. Fit.= 0.96430
iteration 128 OBJ= 4326.2378678645500 eff.= 407. Smpl.= 1000. Fit.= 0.96409
iteration 129 OBJ= 4157.5352046539456 eff.= 406. Smpl.= 1000. Fit.= 0.96404
iteration 130 OBJ= 4332.6894073732456 eff.= 399. Smpl.= 1000. Fit.= 0.96399
iteration 131 OBJ= 4357.5343346793761 eff.= 493. Smpl.= 1000. Fit.= 0.96414

Convergence achieved 

iteration 131 OBJ= 4336.1893012015007 eff.= 417. Smpl.= 1000. Fit.= 0.96369 

#TERM: 
OPTIMIZATION WAS COMPLETED

1: It is not unusual to see this large Monte Carlo noise in the OFV estimate. It indicates that the number of samples (ISAMPLE) may not be high enough. Consider restricting ISCALE_MIN and ISCALE_MAX as well.

Interpreting the IMP iterations

During estimation with IMP, the iteration output contains:

OBJ: The objective function
eff.: Average effective number of samples
Smpl.: Actual samples (which is usually ISAMPLE)
Fit.: Average fitness.

Fit. tells you how Normal the posterior is. If Fit. is approximately 1, you might as well use FOCE. If Fit. is more like 0.6, then IMP is a good choice. It is the average absolute difference between the posterior density and proposal density among the samples, and indicates to what extent that the proposal density matches the posterior density.

The scaling of the variance of the proposal density is adjusted (limited by ISCALE_MIN and ISCALE_MAX) so that eff. divided by Smpl. is approximately equal to IACCEPT. If you find that this is not the case, then tweak something.

METHOD=IMPMAP is equivalent to METHOD=IMP MAPITER=1 MAPINTER=1. It is thus like IMP, with the added assistance of MAP estimation on every E-step.

AUTO=1

$ESTIMATION METHOD=IMPMAP
            AUTO=1

$ESTIMATION METHOD=IMPMAP
            INTERACTION
            CTYPE=3
            NITER=500
            ISAMPLE=300
            STDOBJ=10
            ISAMPEND=10000
            NOPRIOR=1
            CITER=10
            CINTERVAL=1
            CALPHA=0.05
            IACCEPT=0.0
            ISCALE_MIN=0.1
            ISCALE_MAX=10
            DF=0
            MCETA=3
            EONLY=0

SAEM stands for Stochastic Approximation Expectation-Maximization. Like IMP, random samples are generated from normal proposal densities. However, instead of always centered at the mean or mode of the posterior density, the proposal density is centered at the previous sample position. The general SAEM algorithm goes like this:

“Stochastic”, “exploratory”, or “burn-in” phase (iterations set by NBURN).
1. Sampling of the ETA proposal density
2. Population parameters are updated based on the averaging of ETAs from the previous step in this phase.
“Accumulation” or “smoothing” phase (iterations set by NITER).
1. Sampling of the ETA proposal density (as in burn-in phase)
2. Population parameters are updated based on the averaging of ETAs from all previous steps in this phase.

This method is useful for highly non-normal posterior densities⁹.

During the exploratory phase, SAEM acts like a search light that finally will oscillate around the maximum likelihood. Thus, SAEM does not need as large of a starting density (“umbrella”) as IMP. See here for an excellent primer on SAEM.

The size of the explored space depends on the standard deviations of the random effects and the residual error. Exploring a large space increases the chances of finding the global maximum. Thus, it is recommended to choose large initial values for the OMEGAs and the SIGMAs.

Show SAEM options

$ESTIMATION METHOD=SAEM
            AUTO=1

$ESTIMATION METHOD=SAEM
            INTERACTION
1            NBURN=500
2            NITER=1000
3            MCETA=0
4            ISAMPLE=2
5            ISAMPEND=10
6            ISAMPLE_M1=2
7            ISAMPLE_M1A=0
8            ISAMPLE_M1B=2
9            ISAMPLE_M2=2
10            ISAMPLE_M3=2
11            IACCEPT=0.4
12            ISCALE_MIN=1.0E-06
            ISCALE_MAX=1.0E+06
13            NOCOV=0
14            DERCONT=0
15            CONSTRAIN=1

1: NBURN: Maximum iterations for the exploratory phase. Should be high enough for convergence to occur. Default 1000.
2: NITER: Maximum iterations for the smoothing phase. Default 1000.
3: MCETA: If greater than ISAMPLE_M1, then for the first iteration of SAEM, ISAMPLE_M1, ISAMPLE_M2, and ISAMPLE_M3 will be set to MCETA, to facilitate a robust initial search for reasonable parameter values.
4: ISAMPLE: Integer. Number of samples per individual. More samples means less stochastic noise. Should have at least 50 total samples per iteration. For datasets with more than 50 individuals, ISAMPLE=1 is usually fine.
5: ISAMPEND: Integer. NONMEM will perform 200 iterations to determine the best ISAMPLE value, with ISAMPEND as an upper bound.
6: ISAMPLE_M1: Number of “mode 1” iterations per ISAMPLE. Should usually be left at default (2).
7: ISAMPLE_M1A: Number of ” 1A” iterations per ISAMPLE. Should usually be left at default (0).
8: ISAMPLE_M1B: Number of “mode 1B” iterations per ISAMPLE. Should usually be left at default (2).
9: ISAMPLE_M2: Number of “mode 2” iterations per ISAMPLE. Should usually be left at default (2).
10: ISAMPLE_M3: Number of “mode 3” iterations per ISAMPLE. Should usually be left at default (2).
11: IACCEPT: Goal fraction of samples to be accepted. Should usually be left at default (0.4).
12: ISCALE_MIN and ISCALE_MAX: Boundaries for the scale factor used to vary the size of the variance of the proposal density in order to meet the IACCEPT condition. On very rare occasions, the OFV varies widely, and these options may need to be reduced (e.g., ISCALE_MIN=0.001 ISAMPLE_MAX=1000). Remember to revert these parameters to default operation on the next $EST step: ISCALE_MIN=-100 ISCALE_MAX=-100. Default is ISCALE_MIN=1.0E-06 ISCALE_MAX=1.0E+06 for MCMC sampling.
13: NOCOV: If covariance estimation is not desired for a particular estimation step, set NOCOV=1. Normally, after SAEM you perform a refinement step (Section 3.0.2.5), so NOCOV=1 can be set during the SAEM and changed back to NOCOV=0 for the IMP refinement.
14: DERCONT: Derivative continuity. Default 0. DERCONT=1 might be helpful in cases of extreme discontinuity when varying certain THETAs by even just a small amount in the model results in a large change in OFV.
15: CONSTRAIN: A built-in simulated annealing algorithm. It starts the OMEGAs at 1.5 times the initial values, and then control the rate at which the OMEGAs shrink during each iteration. CONSTRAIN=7 performs simulated annealing on both OMEGA (zero and non-zero) and SIGMA parameters. To mimic Monolix, set OMEGA and additive SIGMA inital values to 0.66, and proportional SIGMA to 0.22. Default 1.

AUTO=1

$ESTIMATION METHOD=SAEM
            AUTO=1

$ESTIMATION METHOD=SAEM
            INTERACTION
            CTYPE=3
            NITER=1000
            NBURN=4000
            ISAMPEND=10
            NOPRIOR=1
            CITER=10
            CINTERVAL=0
            CALPHA=0.05
            IACCEPT=0.4
            ISCALE_MIN=1.0E-06
            ISCALE_MAX=1.0E+06
            ISAMPLE_M1=2
            ISAMPLE_M1A=0
            ISAMPLE_M1B=2
            ISAMPLE_M2=2
            ISAMPLE_M3=2
            CONSTRAIN=1
            EONLY=0
            ISAMPLE=2
            MCETA=0

AUTO=2

AUTO=2 is the same as AUTO=1, with the additional options:

MAPITERS=1 MCETA=100

To turn MAP assessment assist on the first iteration.

Initial estimates

It is recommended to set initial variability estimates high in order for SAEM to have the possibility to explore the domain.

Without annealing

$OMEGA 1.0
$SIGMA 1.0 ; additive
$SIGMA 0.3 ; proportional

With annealing

$OMEGA 0.66
$SIGMA 0.66 ; additive
$SIGMA 0.22 ; proportional

By default, no MAP estimation is performed with SAEM (MAPITERS=0). To get good individual parameter values near the mode of the posterior density for the first iteration of SAEM, you can set MAPITERS=1.

The EM-methods are already quite robust in that they can handle many different estimation problems.

Use full OMEGA blocks

Full OMEGA blocks are generally preferred in EM-based estimation, as EM handles them more reliably than FOCE [4]. Including off-diagonal elements does not compromise stability, on the contrary, off-diagonals allow the algorithms to more efficiantly search the parameter space. Off-diagonals should therefore not be fixed to 0 just because of high RSEs (>100%) (as you would with FOCE). If they were fixed in a previous FOCE run due to rounding errors or failed $COVARIANCE steps, they should be estimated in EM. Only consider FIX:ing off-diagonals back to 0 if EM OFV is highly variable, or if the covariance matrix is non-positive-definite.

For uncertain initial estimates

PsN program parallel_retries can be useful
METHOD=CHAIN: Create random initial values of THETAS and OMEGAS. Applied only for the estimation step
$CHAIN: To introduce initial THETAs OMEGAs and SIGMAs that will cover the entire scope of a given problem.

Monte Carlo Direct Sampling can help IMP if the first iteration fails due to MAP issues. Here a few iterations of DIRECT define the sampling distribution for IMP:

$ESTIMATION METHOD=DIRECT
            INTERACTION
            ISAMPLE=10000
            NITER=3

$ESTIMATION METHOD=IMP
            INTERACTION
            ISAMPLE=1000
            NITER=50
1            MAPITER=0

1: MAPITER=0 tells IMP to skip its own MAP search and instead use the sampling distribution that DIRECT just estimated.

If using SAEM: MCETA

With SAEM: If the option MCETA is set to greater than ISAMPLE_M1, then for the first iteration of SAEM, ISAMPLE_M1, ISAMPLE_M2, and ISAMPLE_M3 will be set to MCETA, to facilitate a robust initial search for reasonable parameter values.

If using SAEM: $ANNEAL

If you run SAEM and suspect that your inital values might be far from the solution, you could consider simulated annealing. By default, NONMEM’s gradient-based methods behave nicely when you start close to the solution. When you don’t, annealing is more suitable. Here’s the intuition:

Pick a generous starting value for OMEGAs (e.g., 1).
Each iteration, those values shrink.
Eventually the OMEGAs will reach 0 (as intended).

Use $ANNEAL to set the initial diagonal OMEGA values, and $ESTIMATION CONSTRAIN=n to say what gets annealed.

For example: $ANNEAL 1-3,5:0.3 6,7:1.0, will set initial values of OMEGA(1,1), OMEGA(2,2), OMEGA(3,3), and OMEGA(5,5) to 0.3, while initial OMEGA(6,6) and OMEGA(7,7) are set to 1.0.

Values for CONSTRAIN=n are:

0 or 4: No annealing
1 or 5: Anneal OMEGA. Default is 1.
2 or 6: Anneal SIGMA
3 or 7: Anneal OMEGA & SIGMA

When CONSTRAIN ≥ 4, annealing is also applied to diagonal OMEGAs fixed to 0.

You can even code a custom annealing algorithm (see below)! For example, one use would be to slow the rate of reduction of the diagonal elements of the OMEGA values during the burn-in phase of the SAEM method (as is done in Monolix). This algorithm was derived from a template of CONSTRAINT.f90 in the ..\source directory.

Show FORTRAN code for custom algorithm

File custom_anneal.f90
      SUBROUTINE CONSTRAINT(THETAS, NTHETAS, SIGMA2, NSIGMAS, OMEGA, NOMEGAS, ITER_NO)
      USE SIZES, ONLY: ISIZE, DPSIZE
      IMPLICIT NONE
      INCLUDE '../nm/TOTAL.INC'
      INTEGER(KIND = ISIZE) NTHETAS, NSIGMAS, NOMEGAS, ITER_NO
      INTEGER :: I, J, ITER_OLD
      DATA ITER_OLD /-1/
      REAL(KIND = DPSIZE) :: OMEGA(MAXOMEG,MAXOMEG), THETAS(MAXPTHETA), SIGMA2(MAXPTHETA)
      REAL(KIND = DPSIZE) :: OMEGO(MAXOMEG)
      SAVE
!---------------------------------------------------------------------------
      IF (SAEM_MODE == 1 .AND. IMP_MODE == 0 .AND. ITS_MODE == 0 .AND. ITER_NO < 200) THEN
            IF (ITER_NO /= ITER_OLD .OR. ITER_NO == 0) THEN
! During burn-in phase of SAEM, and when a new iteration occurs
! (iter_old<>iter_no)
! store the present diagonals of OMEGAs
                  ITER_OLD = ITER_NO
                  DO I = 1, NOMEGAS
                        OMEGO(I) = OMEGA(I,I)
                  ENDDO
            ENDIF
            IF (ITER_NO /= 0) THEN
                  DO I = 1, NOMEGAS
! Monolix style algorithm needed to "slow down" the reduction of OMEGA
! The expansion of OMEGA should be less with each iteration.
                        OMEGA(I,I) = MAX(OMEGA(I,I), 0.95D0 * OMEGO(I))
                  ENDDO
            ENDIF
      ENDIF
      RETURN
!---------------------------------------------------------------------------
      END SUBROUTINE CONSTRAINT

1$SUBROUTINES OTHER=custom_anneal.F90
$PK
MU_1 = THETA(1)
EMAX = EXP(MU_1 + ETA(1))
MU_2 = THETA(2)
ED50 = EXP(MU_2 + ETA(2))
IPRED = EMAX * DOSE / (ED50 + DOSE)

$THETA 4.1 ; 1. EMAX
$THETA 6.9 ; 2. ED50

$OMEGA 0 FIX
$OMEGA 0.1

$ANNEAL 1:0.3

$ESTIMATION METHOD=SAEM
            INTERACTION
            CONSTRAIN=5

$ESTIMATION METHOD=IMP
            INTERACTION
            EONLY=1
2            CONSTRAIN=0

1: The custom annealing is implemented using $SUBROUTINES OTHER=custom_anneal.f90.
2: Turn annealing off for the refinement step.

THETAs with no ETAs

One alternative is to use ETAs anyways but fix the OMEGA to a low value like 0.0025 This corresponds to 5% for log-normally distributed parameters. With this method, the EM algorithm can be used, and the variability is kept small.

Another option would be to implement $ANNEAL with the CONSTRAIN option set to ≥4 in $ESTIMATION.

Convergence test

As stated before, you need to set up a criteria for when the EM-method should stop (aka “converge” or “terminate”). Basically what this does is set up the instructions for a linear regression that tests if the slope is not statistically different from zero.

Show convergence test options

CTYPE: Once the OFV is stationary, then often the analysis has converged statistically, meaning CTYPE=1 is enough.
CINTERVAL=n: Integer. Every n iterations is saved for convergence test (see CITER). If CINTERVAL=0, then NONMEM will decide¹⁰.
CITER=n: Integer. Number of latest submitted iterations (see CINTERVAL) on which to perform the regression. CITER=10 is the default. Also coded as CNSAMP.
CALPHA=x: Real, 0.01–0.05. Alpha error rate to use on linear regression test to assess statistical significance. The default value is 0.05.

IMP: IMP iterations tend to be statistically uncorrelated, and so it is reasonable to have CITER=10 consecutive iterations (CINTERVAL=1) tested at CALPHA=0.05.

SAEM: Iterations can be highly correlated. To properly detect a lack of change in parameters, you may want to test every 10–100^th iteration (CINTERVAL=10 to 100), so that the linear regression on parameter change is spread out over a larger segment of iterations. To mimic Monolix, set CTYPE=3, CINTERVAL=15, and CITER=10.

Allow time to converge

Give a rather high value to NBURN (1000–10000 for SAEM/BAYES) or NITER (200–1000 for ITS/IMP/IMPMAP) so you don’t run out of iterations before the method converges.

An alternative method to convergence testing is to set NBURN to a very high number (10000), monitor the change in SAEMOBJ, and enter ctrl-K (see section I.15 Interactive Control of a NONMEM batch Program) when you feel that the variations are stationary, which will end the burn-in mode and continue on to the statistical/accumulation mode.

It is better to provide a large NBURN number, and end it at will with ctrl-K, or allow the convergence tester to end it, rather than to have a small NBURN number and have the burn-in phase end prematurely.

Refinement step

When using EM methods, it is adviced to include a “refinement step” in the form of an additional subsequent $ESTIMATION METHOD=IMP EONLY=1 record to obtain more precise final estimates. This will take the population parameter estimates and calculate more precise estimates of the objective function value (just the E-step).

Show refinement step example

$ESTIMATION METHOD=IMP
            INTERACTION
1            ISAMPLE=10000
2            NITER=10
3            EONLY=1
4            MAPITER=0

1: The higher number of samples (3000–10000) should give a more stable result (although the run time of each iteration will increase significantly).
2: Number of iterations should be low, 5–10. IMP EONLY=1 needs a few iterations to allow the proposal density to become stationary. This is observed by the OFV stabilizing (oscilating) around a particular value. Assessing the mean and variance of the iterated OFV will give a more accurate estimation of the final OFV¹¹.
3: Perform only the E-step (do not update population parameters)
4: To maintain continuity of the estimation from the previous step.

Random number generator (RNG)

NONMEM’s internal RNG is controlled with RANMETHOD=[n|M|S|m|P].

The “Knuth” RNG is the default in $ESTIMATION, specified as RANMETHOD=3 (n=3). To reduce sampling noise, use the Sobol method (S, RANMETHOD=3S) with Faure-Tezuka type scrambling (m=2, RANMETHOD=3S2). That added stability often lets you reach reliable results with IMP using fewer ISAMPLEs.

Using DF>0 with RANMETHOD=S

When RANMETHOD=S is combined with a t-distribution (DF>0), DF should be limited to 1, 2, 4, 5, 8, or 10. However, discretizing the d.f. does not work if you use the IACCEPT=0.0 to automatically determine the DF.

In this case, use IACCEPTL as described for IMP instead of DF.

Running NONMEM in parallel? Append P so every subject keeps its own seed path, independent of the number of processes: RANMETHOD=P.

So, the final recommended RANMETHOD for IMP becomes: RANMETHOD=3S2P

Seed for RNG

Using the SEED option is a requirement for reproducible results. You can use any SEED number, and the default is SEED=14455. Its also advisable to keep the same SEED throughout your analysis, i.e., in all the models, to track the model fit and not depend so much on stochastic happenstance.

Parallel NONMEM

For stochastic methods (MCETA, DIRECT, IMP, IMPMAP, and SAEM), distribution of subjects to different nodes for computations is dependent on number of processors. This leads to differences in sequences of random numbers generated during computations that could make the results dependent on the number of processors.

Moreover, with the parallel processing option PARSE_TYPE=4, the results of the stochastic methods may not be exactly reproducible even when repeated on the same number of processors, because the distribution of the subjects to each processor may depend on the computer load. The option PARSE_TYPE=1 (equal number of subjects at each node) guarantees the results to be independent of the computer load. Still, the results will depend on the number of used processors.

Copy this file: /opt/local64/nonmem/nm_7.4.4_g510/run/mpilinux8.pnm Into my_own_parafile.pnm and put it somewhere resonable. Then change PARSE_TYPE to 1:

$GENERAL
; [nodes] is a User defined variable.
NODES=[nodes] PARSE_TYPE=1

When running models, use your own parafile: -parafile=my_own_parafile.pnm

MAP-estimation is always parallelized properly.
Always use RANMETHOD=P

Reduce ATOL in $SUBROUTINES (usually set equal to TOL) may improve speeds.
Keeping the ISAMPLE low makes for faster iterations, however, it will result in higher stochastic noise.
MU-model as many THETAs as possible. Even if a parameter doesn’t have IIV, MU-model it anyway. Sometimes, it might be impossible to MU-model a THETA. If you can, rerapameterize so that they can be MU-modeled, or keep a maximum of three non-MU-modeled THETAs.

When to use which

If you are unsure which method to use, start with MLE methods like FOCE or LAPLACE, then turn to EM.

The reason is simple: with MLE, there is one correct answer; a single best set of parameters for a data set given a statistical model, while for EM methods, there are several answers (i.e., parameter estimates). This makes for better reproducibility of model fit, and less ambiguity on “best fit”.

With that said, Table 2 provides some rule-of-thumbs of when to consider which estimation method.

Table 2: Problem-Method fit

Scenario	FOCE	LAPLACE	IMP	IMPMAP	SAEM
Very sparse data					x
Sparse data	x	x	x		x
Rich data	x	x	x	x	x
Very rich data	x	x		x
Categorical data		x			x
Simple models	x	x
Complex models			x	x	x

For simple one- and two-compartment PK models, FOCE is the preferred method, whereas more complex models and/or very sparse datasets typically benefit from EM methods [5–7]. In general, EM tends to perform better than MLE in the following situations:

Relatively complex models, such as:

Parent-metabolite
Non-linear clearance
Target-mediated drug disposition (TMDD) [1]
Mixture models ($MIX)
Many covariates (like during FFEM)
Multiple PD endpoints or sequential PD biomarkers
Categorical endpoints
Count data
Binary data (consider SAEM)

Data limitations, such as:

Sparse or incomplete datasets
Lack of observations across the full dose–response range
Insufficient PK data to identify all compartments

A note on sparse data

Think about “sparse data” as missing important parts of the picture. Perhaps we miss the peak concentration or don’t have high enough doses to see the top of the dose-response curve. In that situation, classic MLE can have a tough time because the likelihood surface is poorly defined.

That’s where EM shines ✨. EM fills in the gaps of the missing data, so the parameter estimates stay much more stable than with MLE.

However, EM is not a magic fix. If you’re forcing the wrong model onto the data, say, fitting a two-compartment PK model when the observations clearly behave as one compartment, the problem is your model, not the algorithm.

That’s different from using an EM algorithm on a complex model where you know that you have PKPD response, but your dose or exposure range is limited.

Important

OFVs between MLE and EM methods are incomparable.

Issuing Multiple Estimations within a Single Problem

A sequence of two or more $ESTIMATION statements within a given problem will result in the sequential execution of separate estimations. For example:

$THETA 0.3 0.5 6.0
$OMEGA 0.2 0.2 0.2
$SIGMA 0.2
1$ESTIMATION METHOD=0
            MAXEVAL=9999
            NSIG=3
2$ESTIMATION METHOD=CONDITIONAL
            NSIG=4

1: First estimation step
2: Second estimation step

Will first result in estimation of the problem by the first order method, using as initial parameters those defined by the $THETA, $OMEGA, and $SIGMA statements. Next, the first order conditional estimation method will be implemented, using as initial parameters the final estimates of THETA, OMEGA, and SIGMA from the previous analysis.

Up to 20 estimations may be performed within a problem.

References

[1]

Gibiansky L, Gibiansky E, Bauer R. Comparison of Nonmem 7.2 estimation methods and parallel processing efficiency on a target-mediated drug disposition model. J Pharmacokinet Pharmacodyn 2012;39:17–35. https://doi.org/10.1007/s10928-011-9228-y.

[2]

Wang Y. Derivation of various NONMEM estimation methods. J Pharmacokinet Pharmacodyn 2007;34:575–93. https://doi.org/10.1007/s10928-007-9060-6.

[3]

Bjugård Nyberg H, Hooker AC, Bauer RJ, Aoki Y. Saddle-Reset for Robust Parameter Estimation and Identifiability Analysis of Nonlinear Mixed Effects Models. AAPS J 2020;22. https://doi.org/10.1208/s12248-020-00471-y.

[4]

Bauer RJ. NONMEM Tutorial Part II: Estimation Methods and Advanced Examples. CPT Pharmacometrics Syst Pharmacol 2019:psp4.12422. https://doi.org/10.1002/psp4.12422.

[5]

Liu X, Wang Y. Comparing the performance of FOCE and different expectation-maximization methods in handling complex population physiologically-based pharmacokinetic models. J Pharmacokinet Pharmacodyn 2016;43:359–70. https://doi.org/10.1007/s10928-016-9476-y.

[6]

Sukarnjanaset W, Wattanavijitkul T, Jarurattanasirikul S. Evaluation of FOCEI and SAEM Estimation Methods in Population Pharmacokinetic Analysis Using NONMEM Across Rich, Medium, and Sparse Sampling Data. Eur J Drug Metab Pharmacokinet 2018;43:729–36. https://doi.org/10.1007/s13318-018-0484-8.

[7]

Bach T, An G. Comparing the performance of first-order conditional estimation (FOCE) and different expectation–maximization (EM) methods in NONMEM: Real data experience with complex nonlinear parent-metabolite pharmacokinetic model. J Pharmacokinet Pharmacodyn 2021;48:581–95. https://doi.org/10.1007/s10928-021-09753-0.

Footnotes

“ML”-methods (Machine Learning) are something else entirely↩︎
In pharmacometrics, also called “Mode a Posteriori”. Essentially the same, since the mode is the maximum of a unimodal distribution.↩︎
An alternative method is to use the $TOL record, where options for each compartment may be entered.↩︎
Credit: Adrian Dunne↩︎
Credit: Robert Bauer↩︎
Credit: Robert Bauer↩︎
Credit: Robert Bauer↩︎
Complex PK/PD problems are those that require ODEs or problems with more than six model parameters to be estimated↩︎
very sparse or categorical data↩︎
I frequently use this↩︎
Credit: Jon Moss↩︎

--- title: "Estimation and Solvers" description: "A Practitioner's Guide to $ESTIMATION and $SUBROUTINES" date: "2025-07-24" code-overflow: scroll code-tools: true --- If you have ever felt lost choosing between FOCE, LAPLACE, IMP, or SAEM, this page is for you. We will walk through the available estimation methods, when to pick each one, and the little tweaks that make runs faster and more robust. Estimation methods are the algorithms used to identify the best-fit model population parameters (THETAs, OMEGAs, and SIGMAs) given the specified model and the analysis dataset. The model, together with the dataset, is what we call a "problem" to be solved by our estimation methods. The aim of this page is to: * Give a **brief overview** of the available estimation methods and sensible tweaks/options * Give an **intuitive explanation** of the estimation methods on a conceptual level * Give a sense of **when to choose which** estimation method In essence, there are two families of estimation methods available in NONMEM: 1. Maximum-Likelihood ("Classical" or "MLE"^["ML"-methods (Machine Learning) are something else entirely]) methods 2. Expectation-Maximization ("EM") methods ::: {.callout-note title="ETA optimization"} Finding the individual ETAs is called "Maximum a Posteriori (**MAP**) estimation"^[In pharmacometrics, also called "Mode a Posteriori". Essentially the same, since the mode is the maximum of a unimodal distribution.]. The optimal ETA values themselves are called either **Empirical Bayes Estimates (EBEs)**, Conditional Parametric ETAs (CPEs), MAP estimates, or conditional mode estimates, symbolically denoted as $\hat{\eta}$ (pronounced "ETA hat"). NONMEM performs MAP using the **BFGS (Broyden-Fletcher-Goldfarb-Shanno)** optimizer by default. This method can be changed using the `OPTMAP` option, but it should generally be left at its default setting. ::: ## Subroutines {#sec-subr} Before diving into the estimation methods, let's quickly talk about `$SUBROUTINES`. NONMEM comes with a library of **ADVAN** routines (@tbl-advan). Their function is to "advance" the kinetic system from one state time point to the next (hence "ADVAN"). Another possible name for ADVAN would have been SOLVER, because most ADVAN routines solve a set of differential equations, either analytically or by integration. We differentiate the solvers by their computational nature, as this will impact how we choose and optimize our estimation algorithm. In general, "numerical integration problems" are problems using `$DES`, and "analytical problems" are those that do not (see column "Solver" in @tbl-advan). <details> <summary>Show ADVAN table</summary> | Routine | Solver | Control records | Description | | :-------- | :------- | :-------------- | :---------- | | `ADVAN1` | Analytic | | One Compartment Linear Model | | `ADVAN2` | Analytic | | One Compartment Linear Model with First Order Absorption | | `ADVAN3` | Analytic | | Two Compartment Linear Model | | `ADVAN4` | Analytic | | Two Compartment Linear Model with First Order Absorption | | `ADVAN5` | Analytic | `$MODEL` | General Linear Model | | `ADVAN6` | Numeric | `$MODEL`, `$DES`| General Nonlinear Model | | `ADVAN7` | Analytic | `$MODEL` | General Linear Model with Real Eigenvalues | | `ADVAN8` | Numeric | `$MODEL`, `$DES`| General Nonlinear Model with Stiff Differential Equations | | `ADVAN9` | Both | `$MODEL`, `$DES`, `$AES`, `$AESINITIAL`| General Nonlinear Model with Equilibrium Compartments | | `ADVAN10` | Analytic | | One Compartment Model with Michaelis-Menten Elimination | | `ADVAN11` | Analytic | | Three Compartment Linear Model | | `ADVAN12` | Analytic | | Three Compartment Linear Model with First Order Absorption | | `ADVAN13` | Numeric | `$MODEL`, `$DES`| General Nonlinear Model using LSODA | | `ADVAN14` | Numeric | `$MODEL`, `$DES`| General Nonlinear Model using CVODA | | `ADVAN15` | Both | `$MODEL`, `$DES`| General Nonlinear Model with Equilibrium Compartments using IDA | | `ADVAN16` | Numeric | `$MODEL`, `$DES`| Delay Differential Equation Model using RADAR5 | | `ADVAN17` | Both | `$MODEL`, `$DES`, `$AES`, `$AESINITIAL`| Delay Differential Equation Model with Equilibrium Compartments using RADAR5 | | `ADVAN18` | Numeric | `$MODEL`, `$DES`| Delay Differential Equation Model using DDE_SOLVER | : Advancement routines. {#tbl-advan tbl-colwidths="[20, 30, 50]"} </details> For each ADVAN, there exists a group of **TRANS** subroutines (numbered 1--6). Think of TRANS as a "translator" (hence "TRANS") that swaps one parameterization for another so you can work in the language that feels most natural to you. * `TRANS1` is the default for all ADVANs and means that you are using rate constants (*k*'s) as your main parameters. * `TRANS2`, `TRANS3`, and `TRANS4` are sometimes seen with analytic solvers, meaning you are using CL, V, and potentially Q and KA as parameters instead of rate constants. * `TRANS5` and `TRANS6` are rarely used, so these can be disregarded for all practical purposes. For every run, you specify **one ADVAN** routine and **one TRANS** routine in the `$SUBROUTINES` record. If you are using `TRANS1`, you don't need to specify this since it's the default. ### Fine-tuning tolerance Tolerance options may be set at `$SUBROUTINES`, which will be in effect throughout the entire problem^[An alternative method is to use the `$TOL` record, where options for each compartment may be entered.]. These options are used for the numerical `ADVAN` routines (see column "Solver" in @tbl-advan). * `TOL=n`: The relative significant digits precision to which differential equations are to be integrated, so the precision is 10^-`TOL`^. Used with `ADVAN6`, `ADVAN8`, `ADVAN9`, and `ADVAN13` to specify the number of digits that are required to be accurate in the computation of the drug amount in each compartment; the precise meaning depends on the particular ADVAN routine that uses it @gibianskyComparisonNonmem722012. * `ATOL=n`: The absolute significant digits precision to which differential equations are to be integrated, so the accuracy is 10^-`ATOL`^. The default is 12. Used with `ADVAN9`, `ADVAN13`, `ADVAN14`, `ADVAN15`, `ADVAN16`, `ADVAN17`, `ADVAN18`. On occasion, you may want to reduce `ATOL` (usually set equal to `TOL`) to improve speeds. * For the numerical ADVAN routines, internal precision is based on `TOL`, as opposed to the analytical routines, where internal precision is based on `SIGL` ## Classical MLE methods {#sec-mle} MLE (or "classical") methods want to be "statistically fair" to all possibilities. These are also the types of methods used in classical statistical tests like t-test, AN(C)OVA, and even the simple *mean* of a set of values. There are essentially three MLE methods available in NONMEM: 1. FO (First Order) 2. FOCE (First Order with Conditional Estimation) 3. LAPLACE (Second order) The "order" refers to the derivative evaluated in the Taylor expansion [@wangDerivationVariousNONMEM2007]. Therefore, FO and FOCE evaluate the first-order derivative (Jacobian matrix), or the "slope" of the objective function value (OFV). LAPLACE evaluates the second-order derivative (Hessian matrix), or the "curvature" of the OFV, requiring more computational effort than the first order. ::: {.panel-tabset} ### FO {#sec-fo} FO stands for **F**irst **O**rder. Is the **perfect method when no ETAs are computed**[^AD], as in e.g., single TTE-models that do not contain a frailty-component. It may also be helpful as an exploratory method when fast, approximate results are desirable. To get the ETAs, specify the option `POSTHOC` (latin for "after this"). ::: {.callout-note} FO can be specified as either `METHOD=ZERO` or `METHOD=0`. Either one is fine, although `METHOD=0` is more common. ::: ### FOCE {#sec-foce} FOCE stands for **F**irst **O**rder **C**onditional **E**stimation. FOCE simultaneously estimates both fixed and random effects, unlike the FO method, which estimates fixed effects first, then random effects. ::: {.callout-tip} There is no need to specify the `POSTHOC` option with FOCE. It only applies to the FO method. ::: It is very common to also use the `INTERACTION` option, and that is then called the FOCEI (FOCE **I**nteraction) method. The `INTERACTION` option allows for correlation between the two types of random effects: ETAs (between-subject variability) and Epsilons (residual variability). ::: {.callout-note} FOCE can be specified as either `METHOD=CONDITIONAL` or `METHOD=1`. For explicitness, `METHOD=CONDITIONAL` is preferred. ::: #### Options {#sec-foce-options} Sensible options to add from start: <details> <summary>Show code</summary> ```verilog $ESTIMATION METHOD=CONDITIONAL INTERACTION # <1> MAXEVAL=99999 FAST # <2> CTYPE=4 NONINFETA=1 # <3> ETASTYPE=1 NSIG=2 # <4> SIGL # <5> SIGLO # <6> MSFO=root.msf PRINT=1 ``` 1. `INTERACTION` should always be used when residual error is proportional, and does not impact the analysis when the residual error is additive. Thus, always use it. 2. Always use the `FAST` option, there are very few cases where `FAST` is suboptimal[^RB]. The `FAST` option benefits from MU-modeling to be more effective. 3. `NONINFETA=1` and `ETASTYPE=1` should be added when not all ETAs are used for all subjects. For example, an ETA to a KA during a fit of a subject with only IV dosing. 4. `NSIG <= SIGL/3`. Default is `NSIG=3`, which is ok for analytical problems. But for numerical integration problems `NSIG=2` is the largest one should insist on[^RB]. 5. `SIGL <= TOL` (for numerical integration problems). `SIGL <= 14` (for analytical problems, about the limit of double precision ). 6. `SIGLO <= TOL` and `SIGLO >= SIGL`. By default, `SIGLO=SIGL` </details> If issues arise, then first try these options: <details> <summary>Show code</summary> ```verilog NSIG=2 # <1> NOABORT # <2> NOHABORT # <3> NOTHETABOUNDTEST # <4> NOOMEGABOUNDTEST # <4> NOSIGMABOUNDTEST # <4> MCETA=1000 # <5> SEED=14455 # <6> RANMETHOD=4P # <6> SADDLE_RESET=1 REPEAT ``` 1. Some trial and error for `NSIG`, `SIGL`, `SIGLO` may still be required given the data. 2. If "Hessian not positive definite" 2. If there is trouble even at the beginning of an optimization with NOABORT 4. If "PARAMETER ESTIMATE IS NEAR ITS DEFAULT BOUNDARY." 5. If issue with MAP (ETA) estimation 6. If using MCETA </details> The relevant default options for FOCE are: <details> <summary>Show code</summary> ```verilog $ESTIMATION METHOD=CONDITIONAL ABORT ATOL=12 BOOTDATA=0 CONDITIONAL CTYPE=0 ETADER=0 ETASTYPE=0 FILE=root.ext FNLETA=1 FORMAT=s1PE12.5 LEVWT=0 ; If $LEVEL is used MASSRESET=-1 MAXEVAL=224 MCETA=0 NOLABEL=0 NONINFETA=0 NOPRIOR=0 NOSLOW NOSUB=0 NOTITLE=0 NSIG=3 ; Also "SIGDIGITS" or "NSIGDIGITS" NUMDER=0 NUTS_EPARAM NUTS_MASS NUTS_OPARAM NUTS_REG NUTS_SPARAM OMITTED ORDER PARAFILE PARAFPRINT PREDICTION PRINT=9999 PRIORC SADDLE_HESS=0 SADDLE_RESET=0 NOSORT NOREPEAT NONUMERICAL NOCENTERING NOINTERACTION NOETABARCHECK THETABOUNDTEST OMEGABOUNDTEST SIGMABOUNDTEST ``` </details> ### LAPLACE {#sec-laplace} Similar to FOCE but evaluates the second order derivative (Hessian) instead of the first order (Jacobian). Should really be called SOCE (second order conditional estimation). LAPLACE is specified as either `LAPLACE` (most common) or `LAPLACIAN`, as an **option to FOCE**. ```verilog $ESTIMATION METHOD=CONDITIONAL LAPLACE ``` ### Tweaks for geeks ::: {.panel-tabset} #### Robustness {#sec-mle-robust} ##### Code for numerical stability Instead of parameter bounds, **re-parameterize**. For example, if a parameter `A` should have a lower bound of 0, code it in logarithmic space `A = EXP(THETA)`. This way, A has a lower bound of 0, but the associated THETA is unbounded. To guard against floating-point exceptions, simply write `$ABBREVIATED PROTECT` at the beginning of your control stream. NONMEM will then use "protected" versions of `EXP()`, `LOG()`, `SQRT()`, etc. ##### Small number of subjects (< 10)? If you are analyzing very few subjects (**less than 10**), it's most accurate to correct the *d.f.* when assessing OMEGAs. This can be done using `LDF`, which stands for "loss of degrees of freedom". **If you can MU-model all THETAs**, then set `LDF=-1` in `$PK` to let NONMEM do the correction automatically. **If you cannot MU-model all THETAs**, the most accurate is to have `LDF(n)` be the number of to-be-estimated THETAs involved in `MU_n`: ```verilog " LDF(3) = 2.0 MU_3 = THETA(4) + LOG(WT) * THETA(5) CL = EXP(MU_3 + ETA(3)) ``` ##### Precision in practice `NSIG`: **The significant digits for THETAs, OMEGAs, and SIGMAs**. Estimation stops when each change by less than `NSIG` digits between iterations. `SIGL`: **The significant digits for iOFV**. The internal NONMEM `SIGL` is set to 10. Rule of thumb: iOFV is evaluated to a precision of the smaller of `TOL` and the internal `SIGL`. If all iOFVs are of the same sign, then the OVF could have a resulting precision of $\log_{10}$(N) + `SIGL`, where N is the number of subjects, for a maximum of 15, the limiting precision of double precision. Thus with 100 subjects, the actual precision that the total OFV is evaluated could be 12. One should not necessarily rely on this, so it is safest to suppose the more conservative precision of 10, for which a suitable `NSIG` would be 3. For numerical integration problems: * Internal precision is based on user specified `TOL`, not `SIGL` (see @sec-subr) * Often `TOL=6` or less, not 10 * The largest `NSIG` one should insist on is 2. * `SIGL` ≤ `TOL` * `NSIG` ≤ `SIGL` / 3 `SIGLO`: **The significant digits for ETAs**. By default, `SIGLO` = `SIGL`. <details> <summary>Rule-of-thumb for setting NSIG, SIGL, and SIGLO</summary> Most problems will benefit from some trial-and-error testing. See @tbl-advan for info on solvers. * Analytical solvers * `SIGL` ≤ 14 (about the limit of double precision) * `NSIG` ≤ `SIGL` / 3 * Numerical solvers * `SIGL` ≤ `TOL` * `NSIG` ≤ `SIGL` / 3 * So, if `TOL=6`, then `SIGL` should be 6, and `NSIG` should be 2 If SIGLO is used, set `SIGLO` ≤ `TOL`. </details> ##### CTYPE Sometimes you may see several iterations with almost no change in OFV. What’s happening? One or more parameters are probably bouncing around in the second significant digit, with `NSIG=3`. If a parameter’s influence on the objective is this tiny, the traditional convergence rule "every parameter must change by less than `NSIG` significant digits" may never be met, even though the objective is already at its minimum. To handle this, add `CTYPE=4`. If the OFV doesn't change beyond `NSIG` decimal places for 10 straight iterations, the minimization is sucessful and the run terminates successfully. ##### SADDLE_RESET The `SADDLE_RESET` option was developed enable more robust parameter estimation through a check for local practical identifiability [@bjugardnybergSaddleResetRobustParameter2020]. In the original article, the authors recommend setting `SADDLE_RESET=1` when using MLE (FO, FOCE, LAPLACE). ##### MCETA During ETA optimization NONMEM must pick a starting set of individual ETAs. By default it uses all zeros, but that may trap the algorithm in a local minimum. The `MCETA` option tells NONMEM to test alternative starting points and keep the one that gives the lowest objective function: * `MCETA=0` (default): Only ETA = 0 is tried. * `MCETA=1`: Compares ETA = 0 with the ETAs from the previous iteration and starts MAP with whichever produces the lower OFV. * `MCETA>1`: Generates `MCETA – 1` random ETA vectors sampled from N(0, Ω) and tests them together with ETA = 0 and the previous ETAs; the vector with the lowest OFV becomes the MAP starting point. If the algorithm ends up in local minima despite `MCETA`, try increasing the value (e.g. from 100 to 1000). Since `MCETA` is most impactful during the early iterations[^RB], another option to still benefit from the robustness of `MCETA` but not impact the estimation speed too much, is to only use `MCETA` for the first 200 iterations or so: ```verilog $ESTIMATION METHOD=1 MCETA=1000 MAXEVAL=200 $ESTIMATION METHOD=1 MCETA=1 MAXEVAL=9999 ``` #### Reproducibility {#sec-mle-repr} As mentioned above, MLE methods are inherently reproducible since they provide only one correct answer. However, if one uses the `MCETA` option, a stochastic Monte-Carlo element is introduced that benefits from additional options for reproducibility: * `RANMETHOD=4P`: The `4` sets the randomization method to the same as the default one in `$SIMULATION`, and the `P` is necessary when running NONMEM in parallel. * `SEED=123456789`: This enables the Monte-Carlo sequence to be reproduced. Choose any integer, but be consistent within your analysis. #### Speed {#sec-mle-speed} Using the `FAST` option together with MU_-modeling will most often improve esimation speeds. Reducing the required significant digits might also increase estimation speeds. Additionally, the classical NONMEM methods can be facilitated using an EM method by first having a rapid EM method such as iterative two stage be performed first, with the resulting parameters being passed on to the FOCE method, to speed up the analysis: ```verilog $ESTIMATION METHOD=ITS INTERACTION $ESTIMATION METHOD=CONDITIONAL INTERACTION ``` ::: ::: ## EM-methods {#sec-em} EM stand for **E**xpectation-**M**aximization, and was first published in 1977. EM derives its name from the two iterative steps that define its approach: 1. **E (Expectation) step**: THETAs, OMEGAs, and SIGMAs are fixed. ETAs and their variances are calculated. Sometimes, gradients of the likelihood for THETAs and SIGMAs are also calculated. 2. **M (Maximization) step**: THETAs, OMEGAs, and SIGMAs are updated based on the ETAs and/or likelihood gradients. The M-step typically requires very little computation time. Multiple parameter combinations could potentially explain the observed data. Rather than seeking a single "best" solution like MLE methods, **EM explores the distribution of likely parameter values** and identifies the *most likely combination* from this distribution. It is important to understand that parameters can be estimated/refined *ad infinitum*, since they are continuous. Thus, a **stopping (convergence) critera** is needed. Each time you run your model, you might get a slightly different result depending on the stopping criteria. Also, to avoid searching the *infinite space* of all parameters (would literally take forever), the different EM methods **restrict the search area** in different ways. In general, this also determines the speed of the algorithm. For example, IMP is generally faster than SAEM due to IMP searching a smaller space around the initial estimates then SAEM. ::: {.callout-note title="MU modeling"} EM methods are most efficient when the model follows a specific statistical structure: fixed effects parameters (THETAs) only define the mean (MU) of the normal population distribution of individual parameters (phi = THETA + ETA). This requires a special coding technique called **MU modeling** for optimal performance. ::: There are essentially three EM methods available in NONMEM: * ITS (Iterative Two-Stage) * IMP (Importance sampling, of which IMPMAP is a special case) * SAEM (Stochastic Approximation EM) ::: {.panel-tabset} ### ITS {#sec-its} ITS stands for **I**terative **T**wo **S**tage. Is considered deterministic, as opposed to other EM-methods. ITS is specified by: ```verilog $ESTIMATION METHOD=ITS INTERACTION NITER=50 # <1> ``` 1. `NITER` (default 50) sets maximum number of iterations Normally, it is used as a **rapid exploratory pre-analysis method** to facilitate parameter estimation with IMP or SAEM. <details> <summary>Show example</summary> ```verilog $ESTIMATION METHOD=ITS INTERACTION LAPLACE NITER=15 $ESTIMATION METHOD=SAEM INTERACTION LAPLACE NBURN=200 NITER=200 $ESTIMATION METHOD=IMP EONLY=1 NITER=2 ISAMPLE=1000 MAPITER=0 ``` </details> <details> <summary>AUTO=1</summary> ```verilog METHOD=ITS INTERACTION CTYPE=3 NITER=500 NOPRIOR=1 CITER=10 CINTERVAL=1 CALPHA=0.05 ``` </details> ### DIRECT {#sec-direct} NONMEM has a direct Monte Carlo sampling method which is specified like this: ```verilog $ESTIMATION METHOD=DIRECT INTERACTION ISAMPLE=10000 NITER=50 ``` This method is the purest method for performing E-M, in that it creates completely independent sample, and there is no chance of causing bias if the sampling density is not similar enough to the conditional density (unlike IMP). However, it is very inefficient, requiring `ISAMPLE` values of 10000--300000 to properly estimate the problem. ### IMP {#sec-imp} IMP stands for **IMP**ortance sampling. The general IMP algorithm goes like this: 1. A **MAP estimation** (if `MAPITER=1`) is performed to estimate the parameters of the proposal density (aka "sampling distribution" *q(x)*). Can be replaced by Monte Carlo direct sampling `METHOD=DIRECT` (see @sec-em-robust-init). 2. **E-step**: For every individual, ETAs are sampled `ISAMPLE` times from the proposal density. This gives the mean and variance of the ETAs that are most likely given the data. 3. **M-step**: Population parameters are updated. 4. The algorithm goes through this E-M loop (2--3) `NITER` times, or until the convergence critera is met. <details> <summary>Show IMP options</summary> ```verilog $ESTIMATION METHOD=IMP INTERACTION NITER=1000 # <1> ISAMPLE=300 # <2> ISAMPEND=10000 # <3> STDOBJ=10 # <3> MAPITER=1 # <4> MAPINTER=-1 # <5> IACCEPT=0.0 # <6> IACCEPTL=0.01 # <6> DF=0 # <7> ISCALE_MIN=0.1 # <8> ISCALE_MAX=10 # <8> CTYPE=1 # <9> CINTERVAL=10 # <9> CITER=10 # <9> CALPHA=0.05 # <9> ``` 1. `NITER`: Integer. Sets maximum number of iterations (default 50). Usually about 50--200 iterations are required to approach the maximum likelihood population parameters [@bauerNONMEMTutorialPartTwo2019]. Synonymous with `NSAMPLE`. 2. `ISAMPLE`: Integer. Number of random samples per subject used for E-step (default 300). Usually 300 is sufficient, but may require 1000--3000 for very sparse data, and when you want an OFV with low Monte Carlo noise. 3. `ISAMPEND` and `STDOBJ`: If `ISAMPEND` is greater than `ISAMPLE`, and `STDOBJ` > 0.0, then samples per subject will adjust between `ISAMPLE`--`ISAMPEND`, until the stochastic standard deviation of the OFV falls below `STDOBJ`. 4. `MAPITER=n`: Integer. The first *n* iterations MAP is use to assess parameters for the proposal density. After this, the Monte Carlo conditional means and variances of the previous iteration are used for the proposal density parameters of the present iteration. 5. `MAPINTER=n`: Integer. Every *n*^th^ iteration, MAP is used to provide parameters to the proposal density. If `MAPINTER=-1`, then MAP will be turned on only if the OFV increases consistently over several iterations. 6. `IACCEPT` and `IACCEPTL`: Real. Lower `IACCEPT` to expand the proposal density (default 0.4). For very sparse data or highly nonlinear posterior densities (such as with categorical data), you may want to decrease to `IACCEPT` to 0.1--0.3. `IACCEPTL` scales a second multivariate normal distribution, to cover long tails in the posterior density (hence "L" for "long tails"). If `IACCEPT=0.0` the `DF` and `IACCEPT` level will be automatically determined for each subject (output in `root.imp`). If `IACCEPT=0.0`, and `IACCEPTL` > 0, then a search for the best `IACCEPTL` for each subject is made, starting the testing at the `IACCEPTL` value given by the user, while `IACCEPT` is fixed to 1. 7. `DF`: Integer. Degrees of freedom for the proposal density. Default `DF=0` is normal distribution, else *d.f.* for t-distrubution. For very sparse data or highly non-linear posterior densities (such as with categorical data), you may want to set `DF` to somewhere between 2--10. 8. `ISCALE_MIN` and `ISCALE_MAX`: Real. Boundaries for the scale factor used to vary the size of the variance of the proposal density in order to meet the `IACCEPT` condition. On very rare occasions, the OFV varies widely, and these options may need to be reduced (e.g., `ISCALE_MIN=0.3 ISAMPLE_MAX=3`). Remember to revert these parameters to default operation on the next `$EST` step: `ISCALE_MIN=-100 ISCALE_MAX=-100`. Default is `ISCALE_MIN=0.1 ISCALE_MAX=10` for importance sampling. 9. See @sec-em-robust-ctest </details> <details> <summary>AUTO=1</summary> ```verilog $ESTIMATION METHOD=IMP AUTO=1 $ESTIMATION METHOD=IMP INTERACTION CTYPE=3 NITER=500 ISAMPLE=300 STDOBJ=10 ISAMPEND=10000 NOPRIOR=1 # <1> CITER=10 CINTERVAL=1 CALPHA=0.05 IACCEPT=0.0 ISCALE_MIN=0.1 ISCALE_MAX=10 DF=0 MCETA=3 # <2> EONLY=0 MAPITER=1 MAPINTER=-1 ``` 1. `NOPRIOR` determines if you do want to use (`0`) prior information for this estimation record or not (`1`). Normally, if you specify a prior you want to use it, so set `NOPRIOR=0`. 2. `MCETA` works the same during MAP as it does for the MLE methods. </details> <details> <summary>AUTO=2</summary> `AUTO=2` is same as `AUTO=1`, with additional: ```verilog GRDQ=-1.0 # <1> DERCONT=1 # <2> RANMETHOD=3S2P ``` 1. `GRDQ`: The gradient quick option. Used for THETAs that are not MU-modeled, for SIGMAs, and THETAs that model RUV. 2. `DERCONT`: Derivative continuity. `DERCONT=1` might be helpful in cases of extreme discontinuity when varying certain THETAs by even just a small amount in the model results in a large change in OFV. </details> #### Monte Carlo noise Remember, EM algorithms don't find a single set of parameters that are the best fit. They find a distribution of parameter sets that maximize the expectation. Often those parameter sets are very similar and only differ in the 6^th^ or 7^th^ decimal place. But sometimes in complex^[Complex PK/PD problems are those that require ODEs or problems with more than six model parameters to be estimated] models with poorly estimated parameters they can differ by larger numbers. <details> <summary>Show example</summary> The iterations for the initial `ITS` step seems in this case to be quite stable with some artifacts: ```verilog $ESTIMATION METHOD=ITS INTERACTION LAPLACE NITER=200 SIG=3 SIGL=6 NOHABORT CTYPE=3 NUMERICAL SLOW $ESTIMATION METHOD=IMP INTERACTION LAPLACE ISAMPLE=1000 NITER=1000 SIG=3 SIGL=6 NOHABORT CTYPE=3 IACCEPT=0.4 MAPITER=0 RANMETHOD=3S2P $COVARIANCE UNCONDITIONAL MATRIX=S TOL=12 SIGL=12 SLOW ... iteration 175 OBJ= 4693.4674554341409 iteration 176 OBJ= 4694.2296104065535 iteration 177 OBJ= 4693.7753507970829 iteration 178 OBJ= 4693.9600270372885 iteration 179 OBJ= 4693.5732455834705 iteration 180 OBJ= 4693.6386423202493 iteration 181 OBJ= 4693.6215390721527 iteration 182 OBJ= 4693.6006496138452 iteration 183 OBJ= 4693.7877620448235 iteration 184 OBJ= 4694.1591757809929 iteration 185 OBJ= 4693.2614956897451 iteration 186 OBJ= 4693.5641640401127 iteration 187 OBJ= 4693.5575289919379 iteration 188 OBJ= 4495.6489907149398 # <1> iteration 189 OBJ= 4693.7711764252363 iteration 190 OBJ= 4693.6281175153035 iteration 191 OBJ= 4694.1171774559862 iteration 192 OBJ= 4693.7908707845536 iteration 193 OBJ= 4693.7709264605819 iteration 194 OBJ= 4495.9262902940209 iteration 195 OBJ= 4693.3321354894242 iteration 196 OBJ= 4694.3177205227348 iteration 197 OBJ= 4694.1301486616576 iteration 198 OBJ= 4694.2898587322170 iteration 199 OBJ= 4693.8304358341920 iteration 200 OBJ= 4691.6818293505230 #TERM: OPTIMIZATION WAS NOT COMPLETED ``` 1. Small "spikes" The IMP step can be less stable: ```verilog iteration 120 OBJ= 4314.8310660241377 eff.= 446. Smpl.= 1000. Fit.= 0.96389 iteration 121 OBJ= 4326.9079856676717 eff.= 448. Smpl.= 1000. Fit.= 0.96409 iteration 122 OBJ= 4164.6649529423103 eff.= 479. Smpl.= 1000. Fit.= 0.96392 # <1> iteration 123 OBJ= 4299.9887619753636 eff.= 432. Smpl.= 1000. Fit.= 0.96395 iteration 124 OBJ= 4303.9571213327054 eff.= 399. Smpl.= 1000. Fit.= 0.96349 iteration 125 OBJ= 4328.9835950930074 eff.= 417. Smpl.= 1000. Fit.= 0.96423 iteration 126 OBJ= 4304.3861595488252 eff.= 550. Smpl.= 1000. Fit.= 0.96392 iteration 127 OBJ= 4291.0862736663648 eff.= 422. Smpl.= 1000. Fit.= 0.96430 iteration 128 OBJ= 4326.2378678645500 eff.= 407. Smpl.= 1000. Fit.= 0.96409 iteration 129 OBJ= 4157.5352046539456 eff.= 406. Smpl.= 1000. Fit.= 0.96404 iteration 130 OBJ= 4332.6894073732456 eff.= 399. Smpl.= 1000. Fit.= 0.96399 iteration 131 OBJ= 4357.5343346793761 eff.= 493. Smpl.= 1000. Fit.= 0.96414 Convergence achieved iteration 131 OBJ= 4336.1893012015007 eff.= 417. Smpl.= 1000. Fit.= 0.96369 #TERM: OPTIMIZATION WAS COMPLETED ``` 1. It is not unusual to see this large Monte Carlo noise in the OFV estimate. It indicates that the number of samples (`ISAMPLE`) may not be high enough. Consider restricting `ISCALE_MIN` and `ISCALE_MAX` as well. </details> ::: {.callout-tip title="Interpreting the IMP iterations" collapse="true"} During estimation with IMP, the iteration output contains: * `OBJ`: The objective function * `eff.`: Average effective number of samples * `Smpl.`: Actual samples (which is usually `ISAMPLE`) * `Fit.`: Average fitness. `Fit.` tells you how Normal the posterior is. If `Fit.` is approximately 1, you might as well use FOCE. If `Fit.` is more like 0.6, then IMP is a good choice. It is the average absolute difference between the posterior density and proposal density among the samples, and indicates to what extent that the proposal density matches the posterior density. The scaling of the variance of the proposal density is adjusted (limited by `ISCALE_MIN` and `ISCALE_MAX`) so that `eff.` divided by `Smpl.` is approximately equal to `IACCEPT`. If you find that this is not the case, then tweak something. ::: ### IMPMAP {#sec-impmap} `METHOD=IMPMAP` is equivalent to `METHOD=IMP MAPITER=1 MAPINTER=1`. It is thus like IMP, with the added assistance of MAP estimation on every E-step. <details> <summary>AUTO=1</summary> ```verilog $ESTIMATION METHOD=IMPMAP AUTO=1 $ESTIMATION METHOD=IMPMAP INTERACTION CTYPE=3 NITER=500 ISAMPLE=300 STDOBJ=10 ISAMPEND=10000 NOPRIOR=1 CITER=10 CINTERVAL=1 CALPHA=0.05 IACCEPT=0.0 ISCALE_MIN=0.1 ISCALE_MAX=10 DF=0 MCETA=3 EONLY=0 ``` </details> ### SAEM {#sec-saem} SAEM stands for **S**tochastic **A**pproximation **E**xpectation-**M**aximization. Like IMP, random samples are generated from normal proposal densities. However, instead of always centered at the mean or mode of the posterior density, the proposal density is centered at the previous sample position. The general SAEM algorithm goes like this: 1. "Stochastic", "exploratory", or "burn-in" phase (iterations set by `NBURN`). i. Sampling of the ETA proposal density ii. Population parameters are updated based on the **averaging of ETAs from the previous step** in this phase. 2. "Accumulation" or "smoothing" phase (iterations set by `NITER`). i. Sampling of the ETA proposal density (as in burn-in phase) ii. Population parameters are updated based on the **averaging of ETAs from all previous steps** in this phase. This method is useful for highly non-normal posterior densities^[very sparse or categorical data]. During the exploratory phase, SAEM acts like a search light that finally will oscillate around the maximum likelihood. Thus, SAEM does not need as large of a starting density ("umbrella") as IMP. See [here](http://monolix.lixoft.com/tasks/population-parameter-estimation-using-saem/) for an excellent primer on SAEM. The size of the explored space depends on the standard deviations of the random effects and the residual error. Exploring a large space increases the chances of finding the global maximum. Thus, it is recommended to choose large initial values for the OMEGAs and the SIGMAs. <details> <summary>Show SAEM options</summary> ```verilog $ESTIMATION METHOD=SAEM AUTO=1 $ESTIMATION METHOD=SAEM INTERACTION NBURN=500 # <1> NITER=1000 # <2> MCETA=0 # <3> ISAMPLE=2 # <4> ISAMPEND=10 # <5> ISAMPLE_M1=2 # <6> ISAMPLE_M1A=0 # <7> ISAMPLE_M1B=2 # <8> ISAMPLE_M2=2 # <9> ISAMPLE_M3=2 # <10> IACCEPT=0.4 # <11> ISCALE_MIN=1.0E-06 # <12> ISCALE_MAX=1.0E+06 # <12> NOCOV=0 # <13> DERCONT=0 # <14> CONSTRAIN=1 # <15> ``` 1. `NBURN`: Maximum iterations for the exploratory phase. Should be high enough for convergence to occur. Default 1000. 2. `NITER`: Maximum iterations for the smoothing phase. Default 1000. 3. `MCETA`: If greater than `ISAMPLE_M1`, then for the first iteration of SAEM, `ISAMPLE_M1`, `ISAMPLE_M2`, and `ISAMPLE_M3` will be set to `MCETA`, to facilitate a robust initial search for reasonable parameter values. 4. `ISAMPLE`: Integer. Number of samples per individual. More samples means less stochastic noise. Should have at least 50 total samples per iteration. For datasets with more than 50 individuals, `ISAMPLE=1` is usually fine. 5. `ISAMPEND`: Integer. NONMEM will perform 200 iterations to determine the best `ISAMPLE` value, with `ISAMPEND` as an upper bound. 4. `ISAMPLE_M1`: Number of "mode 1" iterations per `ISAMPLE`. Should usually be left at default (2). 5. `ISAMPLE_M1A`: Number of " 1A" iterations per `ISAMPLE`. Should usually be left at default (0). 6. `ISAMPLE_M1B`: Number of "mode 1B" iterations per `ISAMPLE`. Should usually be left at default (2). 7. `ISAMPLE_M2`: Number of "mode 2" iterations per `ISAMPLE`. Should usually be left at default (2). 8. `ISAMPLE_M3`: Number of "mode 3" iterations per `ISAMPLE`. Should usually be left at default (2). 9. `IACCEPT`: Goal fraction of samples to be accepted. Should usually be left at default (0.4). 12. `ISCALE_MIN` and `ISCALE_MAX`: Boundaries for the scale factor used to vary the size of the variance of the proposal density in order to meet the `IACCEPT` condition. On very rare occasions, the OFV varies widely, and these options may need to be reduced (e.g., `ISCALE_MIN=0.001 ISAMPLE_MAX=1000`). Remember to revert these parameters to default operation on the next `$EST` step: `ISCALE_MIN=-100 ISCALE_MAX=-100`. Default is `ISCALE_MIN=1.0E-06 ISCALE_MAX=1.0E+06` for MCMC sampling. 13. `NOCOV`: If covariance estimation is not desired for a particular estimation step, set `NOCOV=1`. Normally, after SAEM you perform a refinement step (@sec-em-robust-refinement), so `NOCOV=1` can be set during the SAEM and changed back to `NOCOV=0` for the IMP refinement. 14. `DERCONT`: Derivative continuity. Default 0. `DERCONT=1` might be helpful in cases of extreme discontinuity when varying certain THETAs by even just a small amount in the model results in a large change in OFV. 15. `CONSTRAIN`: A built-in simulated annealing algorithm. It starts the OMEGAs at 1.5 times the initial values, and then control the rate at which the OMEGAs shrink during each iteration. `CONSTRAIN=7` performs simulated annealing on both OMEGA (zero and non-zero) and SIGMA parameters. To mimic Monolix, set OMEGA and additive SIGMA inital values to 0.66, and proportional SIGMA to 0.22. Default 1. </details> <details> <summary>AUTO=1</summary> ```verilog $ESTIMATION METHOD=SAEM AUTO=1 $ESTIMATION METHOD=SAEM INTERACTION CTYPE=3 NITER=1000 NBURN=4000 ISAMPEND=10 NOPRIOR=1 CITER=10 CINTERVAL=0 CALPHA=0.05 IACCEPT=0.4 ISCALE_MIN=1.0E-06 ISCALE_MAX=1.0E+06 ISAMPLE_M1=2 ISAMPLE_M1A=0 ISAMPLE_M1B=2 ISAMPLE_M2=2 ISAMPLE_M3=2 CONSTRAIN=1 EONLY=0 ISAMPLE=2 MCETA=0 ``` </details> <details> <summary>AUTO=2</summary> `AUTO=2` is the same as `AUTO=1`, with the additional options: ```verilog MAPITERS=1 MCETA=100 ``` To turn MAP assessment assist on the first iteration. </details> #### Initial estimates It is recommended to set initial variability estimates high in order for SAEM to have the possibility to explore the domain. ```verilog Without annealing $OMEGA 1.0 $SIGMA 1.0 ; additive $SIGMA 0.3 ; proportional With annealing $OMEGA 0.66 $SIGMA 0.66 ; additive $SIGMA 0.22 ; proportional ``` By default, no MAP estimation is performed with SAEM (`MAPITERS=0`). To get good individual parameter values near the mode of the posterior density for the first iteration of SAEM, you can set `MAPITERS=1`. ### Tweaks for geeks ::: {.panel-tabset} #### Robustness {#sec-em-robust} The EM-methods are already quite robust in that they can handle many different estimation problems. ##### Use full OMEGA blocks {#sec-em-robust-omega-block} Full OMEGA blocks are generally preferred in EM-based estimation, as EM handles them more reliably than FOCE [@bauerNONMEMTutorialPartTwo2019]. Including off-diagonal elements does not compromise stability, on the contrary, off-diagonals allow the algorithms to more efficiantly search the parameter space. Off-diagonals should therefore not be fixed to 0 just because of high RSEs (>100%) (as you would with FOCE). If they were fixed in a previous FOCE run due to rounding errors or failed `$COVARIANCE` steps, they should be estimated in EM. Only consider FIX:ing off-diagonals back to 0 if EM OFV is highly variable, or if the covariance matrix is non-positive-definite. ##### For uncertain initial estimates {#sec-em-robust-init} * PsN program `parallel_retries` can be useful * `METHOD=CHAIN`: Create random initial values of THETAS and OMEGAS. Applied only for the estimation step * `$CHAIN`: To introduce initial THETAs OMEGAs and SIGMAs that will cover the entire scope of a given problem. **Monte Carlo Direct Sampling** can help IMP if the first iteration fails due to MAP issues. Here a few iterations of DIRECT define the sampling distribution for IMP: ```verilog $ESTIMATION METHOD=DIRECT INTERACTION ISAMPLE=10000 NITER=3 $ESTIMATION METHOD=IMP INTERACTION ISAMPLE=1000 NITER=50 MAPITER=0 # <1> ``` 1. `MAPITER=0` tells IMP to skip its own MAP search and instead use the sampling distribution that DIRECT just estimated. <details> <summary>If using SAEM: `MCETA`</summary> With SAEM: If the option `MCETA` is set to greater than `ISAMPLE_M1`, then for the first iteration of SAEM, `ISAMPLE_M1`, `ISAMPLE_M2`, and `ISAMPLE_M3` will be set to `MCETA`, to facilitate a robust initial search for reasonable parameter values. </details> <details> <summary>If using SAEM: `$ANNEAL`</summary> If you run **SAEM** and suspect that your inital values might be far from the solution, you could consider **simulated annealing**. By default, NONMEM’s gradient-based methods behave nicely when you start close to the solution. When you don’t, annealing is more suitable. Here's the intuition: 1. Pick a generous starting value for OMEGAs (e.g., 1). 2. Each iteration, those values shrink. 3. Eventually the OMEGAs will reach 0 (as intended). Use `$ANNEAL` to set the initial diagonal OMEGA values, and `$ESTIMATION CONSTRAIN=n` to say *what* gets annealed. For example: `$ANNEAL 1-3,5:0.3 6,7:1.0`, will set initial values of `OMEGA(1,1)`, `OMEGA(2,2)`, `OMEGA(3,3)`, and `OMEGA(5,5)` to 0.3, while initial `OMEGA(6,6)` and `OMEGA(7,7)` are set to 1.0. Values for `CONSTRAIN=n` are: * 0 or 4: No annealing * 1 or 5: Anneal OMEGA. Default is 1. * 2 or 6: Anneal SIGMA * 3 or 7: Anneal OMEGA & SIGMA When `CONSTRAIN` ≥ 4, annealing is also applied to diagonal OMEGAs fixed to 0. You can even code a custom annealing algorithm (see below)! For example, one use would be to slow the rate of reduction of the diagonal elements of the OMEGA values during the burn-in phase of the SAEM method (as is done in Monolix). This algorithm was derived from a template of `CONSTRAINT.f90` in the `..\source` directory. <details> <summary>Show FORTRAN code for custom algorithm</summary> ```fortranfree File custom_anneal.f90 SUBROUTINE CONSTRAINT(THETAS, NTHETAS, SIGMA2, NSIGMAS, OMEGA, NOMEGAS, ITER_NO) USE SIZES, ONLY: ISIZE, DPSIZE IMPLICIT NONE INCLUDE '../nm/TOTAL.INC' INTEGER(KIND = ISIZE) NTHETAS, NSIGMAS, NOMEGAS, ITER_NO INTEGER :: I, J, ITER_OLD DATA ITER_OLD /-1/ REAL(KIND = DPSIZE) :: OMEGA(MAXOMEG,MAXOMEG), THETAS(MAXPTHETA), SIGMA2(MAXPTHETA) REAL(KIND = DPSIZE) :: OMEGO(MAXOMEG) SAVE !--------------------------------------------------------------------------- IF (SAEM_MODE == 1 .AND. IMP_MODE == 0 .AND. ITS_MODE == 0 .AND. ITER_NO < 200) THEN IF (ITER_NO /= ITER_OLD .OR. ITER_NO == 0) THEN ! During burn-in phase of SAEM, and when a new iteration occurs ! (iter_old<>iter_no) ! store the present diagonals of OMEGAs ITER_OLD = ITER_NO DO I = 1, NOMEGAS OMEGO(I) = OMEGA(I,I) ENDDO ENDIF IF (ITER_NO /= 0) THEN DO I = 1, NOMEGAS ! Monolix style algorithm needed to "slow down" the reduction of OMEGA ! The expansion of OMEGA should be less with each iteration. OMEGA(I,I) = MAX(OMEGA(I,I), 0.95D0 * OMEGO(I)) ENDDO ENDIF ENDIF RETURN !--------------------------------------------------------------------------- END SUBROUTINE CONSTRAINT ``` </details> ```verilog $SUBROUTINES OTHER=custom_anneal.F90 # <1> $PK MU_1 = THETA(1) EMAX = EXP(MU_1 + ETA(1)) MU_2 = THETA(2) ED50 = EXP(MU_2 + ETA(2)) IPRED = EMAX * DOSE / (ED50 + DOSE) $THETA 4.1 ; 1. EMAX $THETA 6.9 ; 2. ED50 $OMEGA 0 FIX $OMEGA 0.1 $ANNEAL 1:0.3 $ESTIMATION METHOD=SAEM INTERACTION CONSTRAIN=5 $ESTIMATION METHOD=IMP INTERACTION EONLY=1 CONSTRAIN=0 # <2> ``` 1. The custom annealing is implemented using `$SUBROUTINES OTHER=custom_anneal.f90`. 2. Turn annealing off for the refinement step. 1. Once you have found good initial estimates (for THETA, OMEGA, and SIGMA) close to the global optimum, run the model again without `$ANNEAL` for better estimates. </details> ##### THETAs with no ETAs One alternative is to **use ETAs anyways but fix the OMEGA to a low value like 0.0025** This corresponds to 5% for log-normally distributed parameters. With this method, the EM algorithm can be used, and the variability is kept small. Another option would be to implement `$ANNEAL` with the `CONSTRAIN` option set to ≥4 in `$ESTIMATION`. ##### Convergence test {#sec-em-robust-ctest} As stated before, you need to set up a criteria for when the EM-method should stop (aka "converge" or "terminate"). Basically what this does is set up the instructions for a linear regression that tests if the slope is not statistically different from zero. <details> <summary>Show convergence test options</summary> * `CTYPE`: Once the OFV is stationary, then often the analysis has converged statistically, meaning `CTYPE=1` is enough. * `CINTERVAL=n`: Integer. Every n iterations is saved for convergence test (see `CITER`). If `CINTERVAL=0`, then NONMEM will decide^[I frequently use this]. * `CITER=n`: Integer. Number of latest submitted iterations (see `CINTERVAL`) on which to perform the regression. CITER=10 is the default. Also coded as `CNSAMP`. * `CALPHA=x`: Real, 0.01--0.05. Alpha error rate to use on linear regression test to assess statistical significance. The default value is 0.05. </details> **IMP:** IMP iterations tend to be statistically uncorrelated, and so it is reasonable to have `CITER=10` consecutive iterations (`CINTERVAL=1`) tested at `CALPHA=0.05`. **SAEM:** Iterations can be highly correlated. To properly detect a lack of change in parameters, you may want to test every 10--100^th^ iteration (`CINTERVAL`=10 to 100), so that the linear regression on parameter change is spread out over a larger segment of iterations. To mimic Monolix, set `CTYPE=3`, `CINTERVAL=15`, and `CITER=10`. ::: {.callout-tip title="Allow time to converge" collapse="true"} Give a rather high value to `NBURN` (1000--10000 for SAEM/BAYES) or `NITER` (200--1000 for ITS/IMP/IMPMAP) so you don't run out of iterations before the method converges. An alternative method to convergence testing is to set `NBURN` to a very high number (10000), monitor the change in `SAEMOBJ`, and enter ctrl-K (see section I.15 Interactive Control of a NONMEM batch Program) when you feel that the variations are stationary, which will end the burn-in mode and continue on to the statistical/accumulation mode. It is better to provide a large `NBURN` number, and end it at will with ctrl-K, or allow the convergence tester to end it, rather than to have a small `NBURN` number and have the burn-in phase end prematurely. ::: ##### Refinement step {#sec-em-robust-refinement} When using EM methods, it is adviced to include a "refinement step" in the form of an additional subsequent `$ESTIMATION METHOD=IMP EONLY=1` record to obtain more precise final estimates. This will take the population parameter estimates and calculate more precise estimates of the objective function value (just the E-step). <details> <summary>Show refinement step example</summary> ```verilog $ESTIMATION METHOD=IMP INTERACTION ISAMPLE=10000 # <1> NITER=10 # <2> EONLY=1 # <3> MAPITER=0 # <4> ``` 1. The higher number of samples (3000--10000) should give a more stable result (although the run time of each iteration will increase significantly). 2. Number of iterations should be low, 5--10. `IMP EONLY=1` needs a few iterations to allow the proposal density to become stationary. This is observed by the OFV stabilizing (oscilating) around a particular value. Assessing the **mean and variance of the iterated OFV** will give a more accurate estimation of the final OFV[^JM]. 3. Perform only the E-step (do not update population parameters) 4. To maintain continuity of the estimation from the previous step. </details> #### Reproducibility {#sec-em-repr} ##### Random number generator (RNG) {#sec-em-repr-ranmethod} NONMEM’s internal RNG is controlled with `RANMETHOD=[n|M|S|m|P]`. The "Knuth" RNG is the default in `$ESTIMATION`, specified as `RANMETHOD=3` (`n=3`). To reduce sampling noise, use the Sobol method (`S`, `RANMETHOD=3S`) with Faure-Tezuka type scrambling (`m=2`, `RANMETHOD=3S2`). That added stability often lets you reach reliable results with IMP using fewer `ISAMPLE`s. ::: {.callout-important title="Using DF>0 with RANMETHOD=S" collapse=true} When `RANMETHOD=S` is combined with a t-distribution (DF>0), `DF` should be limited to 1, 2, 4, 5, 8, or 10. However, discretizing the *d.f.* does not work if you use the `IACCEPT=0.0` to automatically determine the `DF`. In this case, **use `IACCEPTL`** as described for IMP instead of `DF`. ::: Running NONMEM in parallel? Append `P` so every subject keeps its own seed path, independent of the number of processes: `RANMETHOD=P`. So, the final recommended `RANMETHOD` for IMP becomes: `RANMETHOD=3S2P` ##### Seed for RNG {#sec-em-repr-seed} Using the `SEED` option is a requirement for reproducible results. You can use any SEED number, and the default is `SEED=14455`. Its also advisable to keep the same SEED throughout your analysis, i.e., in all the models, to track the model fit and not depend so much on stochastic happenstance. ##### Parallel NONMEM {#sec-em-repr-parallel} For stochastic methods (MCETA, DIRECT, IMP, IMPMAP, and SAEM), distribution of subjects to different nodes for computations is dependent on number of processors. This leads to differences in sequences of random numbers generated during computations that could make the results dependent on the number of processors. Moreover, with the parallel processing option `PARSE_TYPE=4`, the results of the stochastic methods may not be exactly reproducible even when repeated on the same number of processors, because the distribution of the subjects to each processor may depend on the computer load. The option `PARSE_TYPE=1` (equal number of subjects at each node) guarantees the results to be independent of the computer load. Still, the results will depend on the number of used processors. Copy this file: `/opt/local64/nonmem/nm_7.4.4_g510/run/mpilinux8.pnm` Into `my_own_parafile.pnm` and put it somewhere resonable. Then change `PARSE_TYPE` to 1: ```verilog $GENERAL ; [nodes] is a User defined variable. NODES=[nodes] PARSE_TYPE=1 ``` When running models, use your own parafile: `-parafile=my_own_parafile.pnm` * MAP-estimation is always parallelized properly. * Always use `RANMETHOD=P` #### Speed {#sec-em-speed} * Reduce `ATOL` in `$SUBROUTINES` (usually set equal to `TOL`) may improve speeds. * Keeping the `ISAMPLE` low makes for faster iterations, however, it will result in higher stochastic noise. * MU-model as many THETAs as possible. Even if a parameter doesn't have IIV, MU-model it anyway. Sometimes, it might be impossible to MU-model a THETA. If you can, rerapameterize so that they can be MU-modeled, or keep a maximum of three non-MU-modeled THETAs. ::: ::: ## When to use which If you are unsure which method to use, **start with MLE methods like FOCE or LAPLACE, then turn to EM.** The reason is simple: with MLE, there is **one correct answer**; a single best set of parameters for a data set given a statistical model, while for EM methods, there are several answers (i.e., parameter estimates). This makes for better reproducibility of model fit, and less ambiguity on "best fit". With that said, @tbl-when provides some rule-of-thumbs of when to consider which estimation method. | Scenario | FOCE | LAPLACE | IMP | IMPMAP | SAEM | | :--------------- | :---: | :-----: | :---: | :----: | :---: | | Very sparse data | | | | | x | | Sparse data | x | x | x | | x | | Rich data | x | x | x | x | x | | Very rich data | x | x | | x | | | Categorical data | | x | | | x | | Simple models | x | x | | | | | Complex models | | | x | x | x | : Problem-Method fit {#tbl-when tbl-colwidths="[30,14,14,14,14,14]"} For simple one- and two-compartment PK models, FOCE is the preferred method, whereas more complex models and/or very sparse datasets typically benefit from EM methods [@liuComparingPerformanceFOCE2016; @sukarnjanasetEvaluationFOCEISAEM2018; @bachComparingPerformanceFirstorder2021]. In general, EM tends to perform better than MLE in the following situations: Relatively **complex models**, such as: * Parent-metabolite * Non-linear clearance * Target-mediated drug disposition (TMDD) [@gibianskyComparisonNonmem722012] * Mixture models (`$MIX`) * Many covariates (like during FFEM) * Multiple PD endpoints or sequential PD biomarkers * Categorical endpoints * Count data * Binary data (consider SAEM) **Data limitations**, such as: * Sparse or incomplete datasets * Lack of observations across the full dose–response range * Insufficient PK data to identify all compartments ::: {.callout-note title="A note on sparse data" collapse="true"} Think about "sparse data" as missing important parts of the picture. Perhaps we miss the peak concentration or don't have high enough doses to see the top of the dose-response curve. In that situation, classic MLE can have a tough time because the likelihood surface is poorly defined. That’s where EM shines ✨. EM fills in the gaps of the missing data, so the parameter estimates stay much more stable than with MLE. However, EM is not a magic fix. If you’re forcing the wrong model onto the data, say, fitting a two-compartment PK model when the observations clearly behave as one compartment, **the problem is your model, not the algorithm**. That's different from using an EM algorithm on a complex model where you **know** that you have PKPD response, but your dose or exposure range is limited. ::: ::: {.callout-important} OFVs between MLE and EM methods are incomparable. ::: ### Issuing Multiple Estimations within a Single Problem A sequence of two or more `$ESTIMATION` statements within a given problem will result in the sequential execution of separate estimations. For example: ```verilog $THETA 0.3 0.5 6.0 $OMEGA 0.2 0.2 0.2 $SIGMA 0.2 $ESTIMATION METHOD=0 # <1> MAXEVAL=9999 NSIG=3 $ESTIMATION METHOD=CONDITIONAL # <2> NSIG=4 ``` 1. First estimation step 2. Second estimation step Will first result in estimation of the problem by the first order method, using as initial parameters those defined by the `$THETA`, `$OMEGA`, and `$SIGMA` statements. Next, the first order conditional estimation method will be implemented, using as initial parameters the final estimates of THETA, OMEGA, and SIGMA from the previous analysis. Up to 20 estimations may be performed within a problem. ## References ::: {#refs} ::: [^AD]: Credit: Adrian Dunne [^JM]: Credit: Jon Moss [^RB]: Credit: Robert Bauer