A Julia interface to Stan's cmdstan executable

Stan.jl v9

Stan is a system for statistical modeling, data analysis, and prediction. It is extensively used in social, biological, and physical sciences, engineering, and business. The Stan language and the interfaces to execute a Stan language program are documented here.

Stan.jl illustrates how the packages available in StanJulia's ecosystem wrap the methods available in Stan's cmdstan executable.

Stan.jl v9 uses StanSample.jl v6, StanOptimize.jl v4, StanQuap.jl v4, StanDiagnose.jl v4 and StanVariational v4 and supports multithreading on C++ level. Stan.jl v9 also uses JSON.jl to generate data and init input files for cmdstan.

The StanJulia ecosystem includes 2 additional packages, StanQuap.jl (to compute MAP estimates) and DiffEqBayesStan.jl.

Options for multi-threading and multi-chaining

Stan.jl v9 is intended to use Stan's cmdstan v2.35.0+ and StanSample.jl v6.

StanSample.jl v6 enables the use of c++ multithreading in the cmdstan binary. To activate multithreading in cmdstan this needs to be specified during the build process of cmdstan. I typically create a path_to_cmdstan_directory/make/local file (before running make -j9 build) containing STAN_THREADS=true. For an example, see the .github/CI.yml script

This means StanSample supports 2 mechanisms for in parallel drawing samples for chains, i.e. on C++ level (using C++ threads) and on Julia level (by spawing a Julia process for each chain).

The use_cpp_chains keyword argument for stan_sample() determines if chains are executed on C++ level or on Julia level. By default, use_cpp_chains=false.

By default in ether case num_chains=4. See ??stan_sample. Internally, num_chains will be copied to either num_cpp_chains or num_julia_chains'.

Note: Currently I do not suggest to use both C++ level chains and Julia level chains. Based on use_cpp_chains the stan_sample() method will set either num_cpp_chains=num_chains; num_julia_chains=1 or num_julia_chains=num_chains;num_cpp_chain=1 (the default of use_cpp_chains is false).

Set the check_num_chains keyword argument in the call to stan_sample() to false to prevent above default behavior. See the example in the Examples/RedCardsStudy directory for more details and an example.

Threads on C++ level can be used in multiple ways, e.g. to run separate chains and to speed up certain operations.

StanSample.jl's SampleModel sets the C++ num_threads to 4 but for compatibility with previous versions of StanJulia this is by default (use_cpp_chains=false) not included in the generated command line, e.g. see sm.cmds where sm is your SampleModel.

An example of the possible performance trade-offs between use_cpp_threads, num_cpp_chains and num_julia_chains can be found in the this directory.

StanJulia overview

Stan.jl is part of the StanJulia Github organization set of packages.

The use of the underlying method packages in StanJulia, i.e. StanSample.jl (the primary workhorse package), StanOptimize.jl, StanVariational.jl, StanQuap.jl and DiffEqBayesStan.jl are demonstrated in Stan.jl and in a much broader context in StatisticalRethinking.jl.

Stan.jl is not the only Stan mcmc option in Julia. Other options are PyCall.jl/PyStan and StanRun.jl. In addition, Julia provides other, pure Julia, mcmc options such as DynamicHMC.jl, Turing.jl and Mamba.jl.

On a high level, a typical workflow to use Stan.jl looks like:

using Stan

# Define a Stan language program.
bernoulli = "..."

# Create and compile a SampleModel, an OptimizeModel, etc.:
sm = SampleModel(...)

# Run the compiled Stan languauge program and collect draws:
rc = stan_sample(...)

if success(rc)

  # Retrieve Stan's `stansummary` executable result:
  sdf = read_summary(sm)

  # Display the summary as a DataFrame:
  sdf |> display

  # Extract the draws from the SampleModel and show the schema:
  tbl = read_samples(sm, :table)
  tbl |> display

  # Or converted to a DataFrame

  DataFrame(tbl) |> display

  # See below for reading in the draws directly into a DataFrame.

end

Above workflow returns a StanTable.Table object with all chains appended.

If e.g. a DataFrame (with all chains appended) is preferred:

df = read_samples(sm, :dataframe)

Other options are :namedtuple, :particles, :keyedarray, :dimdata, :mcmcchains, etc. See ?read_samples for more details. Walkthrough and Walkthrough2 show StanSample.jl in action.

References

There is no shortage of good books on Bayesian statistics. A few of my favorites are:

  1. Bolstad: Introduction to Bayesian statistics

  2. Bolstad: Understanding Computational Bayesian Statistics

  3. Gelman, Hill: Data Analysis using regression and multileve,/hierachical models

  4. McElreath: Statistical Rethinking

  5. Kruschke: Doing Bayesian Data Analysis

  6. Lee, Wagenmakers: Bayesian Cognitive Modeling

  7. Betancourt: A Conceptual Introduction to Hamiltonian Monte Carlo

  8. Gelman, Carlin, and others: Bayesian Data Analysis

  9. Causal Inference in Statistics - A Primer

  10. Pearl, Judea and MacKenzie, Dana: The Book of Why

  11. Scott Cunningham: Causal Inference - the mixtapes

Special mention is appropriate for the new book:

  1. Gelman, Hill, Vehtari: Regression and other stories

which in a sense is a major update to item 3. above.