[go: up one dir, main page]

Skip to content

Tags: lanl/vpic

Tags

1.2

Toggle 1.2's commit message
document v1.2 release in readme

1.1

Toggle 1.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
WIP: Release v1.1 (#45)

* Rc 1pt1 pipeline v2 (#34)

Summary: Internal LANL changes to add support for v16, amongst other details (below). Basis for v1.1 release

* Adding avx2 intrinsic support for store_8x2_tr.

* Fix bug in store_8x2_tr.

* Add intrinsic support for store_8x2_tr and store_8x4_tr.

* Switch to using load_8x2_tr.

* Increase MAX_PIPELINES to 272, the max for Trinity KNL.

* Add support for conditional compilation of support for using Intel VTune.

* Add a compile time option to allow printing more significant digits
in the VPIC timing data.  To select this option, define the
VPIC_PRINT_MORE_DIGITS CPP variable.

* Change USE_INTEL_VTUNE to VPIC_USE_VTUNE.  Seems like it would be good
to add a namespace feature to VPIC CPP macros to avoid possible collisions
with other packages that VPIC may potentially use in the future.

* Comment out CELL specific code for now in anticipation that these
remaining references to CELL will be removed like the rest of the CELL
stuff previously.

Add the ability to reverse the order of boot sequence for the communication
layer and the pipelines.  This is accomplished by defining the CPP variable
VPIC_SWAP_MPI_PTHREAD_INIT at compile time.  The default is the original
order where the pthreads are created before MPI is initialized.  The new
reversed order is untested at this time.

* Add build support for printing more digits in output of VPIC timer info.

* Use my special V4 and V8 implementations with manual inlining.

* Make redundant copies of some inlined functions to explore performance issues.

* Separate the load and store transpose functions into load, store
and transpose calls and duplicate for performance study.

* Fix compilation bug.

* Changes to try and isolate a run time bug introduced in last commit.

* Separate out the advance_p v4 and v8 pipeline implementations into
separate files that get conditionally included into advance_p.cc.

Add some more performance diagnostics in an effort to better understand the
performance of advance_p on Haswell and KNL.

* Fix a problem with my include statements.

* Initial start on adding AoSoA support for the particles.

* Adding support for AVX-512 for running on KNL.

* Latest additions to v16 support for advance_p_pipeline_v16 function.

* Fix bugs in V16_PORTABLE implementation that prevented a proper build.

* Fix compilation bug with undefined variables.

* Fix typo in v16_portable header file.

* Fix a problem where the accumulate macros were not expanding properly
by not using them.
Seems pointless to use them if they are only used once.

* Fix a bug where I forgot to load data for particle block b.

* Initial work on the v16 unit test program.  Need this to try debugging
the current problems with v16 support.

* More unit tests for v16 support.

* More unit tests for v16 support.

* More unit tests for v16 support.

* Attempt to fix a bug in the v16 support.  Must test to see if the fix works.

* Integrate the OpenMP support that was added by Evan Peters into my
development version of VPIC.

* Change DISTRIBUTE macro argument to use 32 particles for the block
size instead of 16.

* Fix grammar in a comment.

* Changes to advance_p_pipeline_v16 so that the particles get processed
in the proper order.

* Properly declare ii_aa and ii_bb in advance_p_pipeline_v16.

* Latest in a variety of changes to add support for v16 and tune the
support for v8.

* Fix some compile problems with v16_avx512.h.

* Fix more compile errors with v16_avx512.h.

* Clean up files some. Add for loop style to V4 implementation.

* Add two different implementations for the V16 portable support.  One uses
a for loop implementation and the other unrolls the for loop implementation.

* Added two versions of the V8 portable implementation to be consistent with V16.

* Also, update the actual v8_portable.h file to be the same as the v0 implementation.

* Make v0 and v1 versions of v4_portable.h consistent with the V8 and V16 support.

* Add some experimental versions of the V8 support to do some performance testing with.

* Add a bare bones avx512 implementation.

* Add some different options for the V4 avx2 support.

* Work on fixing compilation errors for Altivec support using Clang.

* Work on fixing compilation errors for Altivec support using Clang.

* Add some hacks to fix Clang build problems.  Fix correctly later.

* Add some hacks to fix Clang build problems.  Fix correctly later.

* Add some hacks to fix Clang build problems.  Fix correctly later.

* Add some hacks to fix Clang build problems.  Fix correctly later.

* Make splat and shuffle templated functions like other intrinsic versions.

* Add a method to the V16 support that loads and stores data in the
proper format on the first pass for the order of particle processing.

* Add a new V16 method that uses load_16x16_tr_a and store_16x16_tr_a
to load and store particles in the same order as the reference
implementation in a single step.

* Add the new load and store functions to the V16 AVX512 implementation.

* Add AVX512 implementations for several V16 wrapper functions.

* Fix bugs in AVX512 implementation.

* Debug commit to discover what is causing the run time failure with the
AVX512 implementation.

* Debug commit to discover what is causing the run time failure with the
AVX512 implementation.

* Debug commit to discover what is causing the run time failure with the
AVX512 implementation.

* Debug commit to discover what is causing the run time failure with the
AVX512 implementation.

* Debug commit to discover what is causing the run time failure with the
AVX512 implementation.

* Debug commit to discover what is causing the run time failure with the
AVX512 implementation.

* Debug commit to discover what is causing the run time failure with the
AVX512 implementation.

* Debug commit to discover what is causing the run time failure with the
AVX512 implementation.

* Debug commit to discover what is causing the run time failure with the
AVX512 implementation.

* Debug commit to discover what is causing the run time failure with the
AVX512 implementation.

* Debug commit to discover what is causing the run time failure with the
AVX512 implementation.

* Debug commit to discover what is causing the run time failure with the
AVX512 implementation.

* Debug commit to discover what is causing the run time failure with the
AVX512 implementation.

* Debug commit to discover what is causing the run time failure with the
AVX512 implementation.

* Format tweak.

* First try at implementing a 16x16 transpose with AVX-512 intrinsics.

* First try at implementing load_16x16_tr and store_16x16_tr with AVX-512 intrinsics.

* Fix a declaration bug.

* Switch back to using Method 1 in advance_p_pipeline_v16.  Method 3 has
problems of some sort.

* Use Method 2 in advance_p_pipeline_v16 so I can performance test it.

* Try an optimization of load_8x8_tr and store_8x8_tr that should use less
registers.  If this improves performance, it is an indication that the Intel
compiler cannot make this sort of optimization.

* Add tests for load_16x16_tr_a and store_16x16_tr_a.  Add AVX-512 implementation
of load_16x16_tr_a.

* Fix compilation bug in v16_avx512_v99.h.

* Turn on the AVX-512 version of load_16x16_tr_a for testing.

* Add debug print output.

* Add more debug print output.

* Fix the input to _mm512_permutexvar_ps.

* Turn off debug print statements for performance testing.

* Switch back to using V16 Method 1 to test load_16x16_tr_a.

* Initial implementation of store_16x16_tr_a.

* Fix compilation bug.

* First step to reversing the swizzle action of load_16x16_tr_a.

* Second step to reversing the swizzle action of load_16x16_tr_a.

* Fix bug in debug print statements.

* Fix a bug in second step to reversing the swizzle action of load_16x16_tr_a.

* Third step to reversing the swizzle action of load_16x16_tr_a.

* Fourth step to reversing the swizzle action of load_16x16_tr_a.

* Turn off debug print for performance testing.

* Change the names of load_16x16_tr_a to load_16x16_tr_p and store_16x16_tr_a
to store_16x16_tr_p to reflect the fact that they are used to load particle
data only.

Switch methods for V4 and V8 to use the manually inlined versions for an
experiment on KNL.

* Make the v8 member data union public so I can compile the manually inlined
version of advance_p_pipeline_v8.

* Fix a typo.

* Revert back to the standard methods for advance_p_pipeline_v4 and advance_p_pipeline_v8.

Add portable implementation for load_16x8_tr_p and store_16x8_tr_p including unit test
support.

* Switch to using Method 4 in advance_p_pipeline_v16 for testing.

* Fix a declaration bug.

* Add the AVX-512 implementations for load_16x8_tr_p and store_16x8_tr_p.

* Implement a strategy to minimize use of temporary work vectors in load_16x8_tr_p.

* Implement a strategy to minimize use of temporary work vectors in load_16x16_tr.

* Implement a strategy to minimize use of temporary work vectors in load_16x16_tr.

* Sync v16_avx512.h with v16_avx512_v99.h.

* Add support in v16_portable.h and v16_portable_v0.h for all the methods contained
in v16_avx512.h.

* Try using #pragma omp simd and #pragma forceinline recursive in v4_portable_v1.h.

* Remove #pragma forceinline recursive.  I was not using it correctly.

* Define and use ALWAYS_VECTORIZE macro.

* Fix my ALWAYS_VECTORIZE macro.

* Fix my ALWAYS_VECTORIZE macro.

* Add and use ALWAYS_VECTORIZE macro to v4_avx2_v11.h.

* Try using #pragma omp simd in v16_avx512_v10 and v16_avx512_v11 implementations.

* Sync portable versions with AVX-512 version.  Use #pragma omp simd in
v16_portable_v1.h.

* Use #pragma omp simd to force vectorization in v8_avx2_v10.h, v8_avx2_v11.h
and v8_portable_v1.h.

* Change particle block size for AoSoA from 64 to 16.

* Fix some bugs in the portable version. Initial work on finishing up the
v16 vector support.

* Format tweaks.

* Format tweaks.

* Clean up and finish the work that was done on the master_wdn_v8 branch
to prepare for the Trinity KNL Open Science Campaign.

* Turn off v16 support in center_p.cc until I get some other testing finished.

* Turn on the transpose timing test for v4, v8 and v16.

* Add more transpose operations for v4, v8 and v16.

* Add more transpose operations for v4, v8 and v16.

* Add more transpose operations for v4, v8 and v16.

* Add more transpose operations for v4, v8 and v16.

* Add more transpose operations for v4, v8 and v16.

* Add more transpose operations for v4, v8 and v16.

* Add more transpose operations for v4, v8 and v16.

* Add more transpose operations for v4, v8 and v16.

* Add more transpose operations for v4, v8 and v16.

* Add more transpose operations for v4, v8 and v16.

* Try experiment to replace load_16x2_tr with load_16x16_tr.

* Add and experiment with load_16x2_bc and load_16x16_bc.

* Add some debug print statements to see what is going on.

* Add some debug print statements to see what is going on.

* Add some debug print statements to see what is going on.

* Revert back to not using simd broadcast.

* Revert back to not using simd broadcast.

* Add some debug print statements to see what is going on.

* Lots of work to try and clean up the v8 and v16 work to get ready for
the VPIC Open Science Campaign on Trinity KNL.

* Finished separating most of the explicit intrinsics implementations into
separate files to help with understanding VPIC better.

* Modify the transpose experiment for v4 and v8 so that it performs
the same amount of work as v16.

Prepare for making the knl_open_science_xxx branches.

* Format tweaks.

* Format tweak.

* Added first pass of integrating explicit auto-vec into the v16 code base. The current version added is no stage-ready and does not represent best performance, but serves as a nice place holder for me to drop code into at a later date

* Fix a bug in the v16 portable implementation.  Modify v8_avx2.h so that it
will compile with the Intel compiler.

* changed the way that the autovec is included so that it replaces the serial kernel entirley. Also snook in some updates to a newever version of the auto-vec from Intel

* merged command line args changes

* fixed error macro line break bug

* fixed up error mesages lines to avoid macro continuations or 80char lines

* Switch from int to size_t to allow for a larger number of particles in an MPI domain.

* Save the original github .gitmodules file as .gitmodules_github_relative and
then change to use absolute URLs for the git submodules.

* Add a relative path version of .gitmodules for use with git repos in the
Dave Nystrom git repo organization scheme.

* Change .gitmodules back to using the original github relative paths.  Add a .gitmodules_github_absolute
to document how to use the absolute URL.

* Switch back to using an absolute URL for cinch in the .gitmodules file.

* Update cinch to be in sync with that of the master branch.

* Format tweaks for deck/main.cc.

* Fixed bug in input parser where restart file was not allowed to start with 'restart*'

Also tidied up some white space, removed a debug code, and change the formatting of the inputdeck parsing

* removed coloring of unit tests from generic input decks

* tidied arg checking code to remove repitions, and changed equals sign detection from an error to a warning

* switched to a slightly safer current accumulation for the autovec

* Add some additional flexibility in how VPIC can interact with VTune.

* updated auto-vec to a slightly safer (correctness) version

* added hsw build file for autovec

* updated simd move queue to remove a small bug which could cause a sort hang on HSW. Now fully working on HSW and KNL

* First attempt at adding intrinsic support for AVX instructions.

* Change the implementation of the fma, fms and fnms functions to allow for the
fact that AVX does not support those instructions natively.

* Initial work on sort_p.  Need to fix segfault for threaded case and
also work on scaling and optimization issues.  First, need to understand
the details of the sort.

* Format tweak.

* removed -g from autovec hsw build script

* Remove the explicit call to dump_energies that got added to the end of the
advance function.  This function should be called from the user_diagnostics
function that gets defined in the input deck of the user.

* added v16 arch file

* Exclude .cc files that get conditionally included into other .cc files from VPIC_SRC
so that they do not get compiled as separate translation units.

* Customize padding of structs to optimize memory footprint based on
the vector length model in use.

* Remove the config directory which was removed as part of the cinch removal.

* Remove a bunch of commented out code.

* Fix an incorrectly named template argument.

* Use more padding in the interpolator struct for V8 until I have debugged the problem
with using the correct amount of padding.

* Changes to optimize memory space usage in grid sized data structures based on
the required amount of padding in structs to achieve required data alignment. The
previous method did not work.

* Add some format tweaks. Add more implementations for load_8x8_tr.

* Add a v16 implementation for uncenter_p. Add a v8 implementation that
uses the load_8x8_tr function. Add some format tweaks to facilitate use
of xxdiff. Sort the particles before calling uncenter_p so the timing
data can be usefully compared to advance_p. The sort will need to be
removed for production code.

* Change back to the default version of load_8x8_tr. Tell cmake to exclude
uncenter_p_pipeline_v16.cc from the list of valid VPIC source files.

* Fix some literal floating point constants to be single precision.

* Fix a problem with building on Power 9 with the GNU compilers.

* Fix a problem where -1 was not properly interpreted within CPP macros.

* First complete implementation of v16 support. Not tested yet.

* Try vectorizing the particle loop in uncenter_p_pipeline by using an OpenMP simd pragma.

* Fix some bugs in v_16 field advance support. Fix problem with vectorization experiment
for uncenter_p_pipeline.

* Update arguments to NEXT_STENCIL macro.

* First try at factoring VPIC implementation to isolate and package use of the
pipeline abstraction. The objective here is to make it easier to support more
programming models in the same source code base such as the autovec work by
Bob and Doug or the Kokkos work. Hopefully, this will also make it easier to
test the different programming models with the same physics problems and test
for accuracy and performance. I wonder if it will compile and run the first time.

* Fix a build problem caused by the pipeline refactor of collision operator
source files.

* Fix more compilation problems with collision operator files after pipeline model refactor.

* Fix a problem in the EXEC_PIPELINES macro where _scalar was missing.

* Fix a missing semi-colon problem in langevin.c.

* Fix problem with undefined symbols in field_advance source files resulting
from refactor to separate pipeline model implementation files.

* Fix problem with undefined symbols in sf_interface source files resulting
from refactor to separate pipeline model implementation files.

* What a wonderful day it will be when all implementation files have a .cc file extension.

* Fix another file extension problem.

* Fix problem with undefined symbols in species_advance source files resulting
from refactor to separate pipeline model implementation files.

* Fix two bugs that resulted in infinite loops that showed up at run time.

* Format tweak.

* Fix an issue with the pipeline refactor of VPIC by creating and using pipelines_exec.h which will
guarantee that the EXEC_PIPELINES macro is processed with the appropriate macro definitions for each
function that uses it.

* Add Altivec intrinsics optimizations from Bob Walkup. Add more Altivec intrinsics implementations for
load and transpose operations. Add more AVX-512 intrinsics implementations for load and transpose
operations. Tweak AVX-2 load and transpose operations.

* Code cleanup.

* Sync rc_1pt1 branch with some changes that were made on the rc_1pt1_pipeline branch.

* Update the ALIGNED CPP macro to reflect actual data alignment that is used.

* Remove tabs from some of the intrinsics pipeline files.

* Code cleanup.

* Format tweaks.

* Make some peer reviewed cleanups.

* First pass at fixing .cc inclusion problem. More release related cleanup.

* More work to fix the issue of including .cc files i.e. we do not want to include
.cc files.

* Fix a bug in the rework of sort_p.

* Finish fixing the issue of inclusion of .cc and .c files.

* Add and use some header files so that the pipeline_args_t typedef can get
properly declared for each file set where it is used. Also, use the new header
files to provide declarations for the functions used in its respective file set.

* Hopefully the final set of changes required to complete resolution of the issue related to inclusion of .cc files.

* Put files specific to pipeline abstraction in their own directory for organizational purposes.

* Remove some commented out code.

* Make pipeline directories for holding files specific to the use of the pipeline abstraction.

* Format tweaks.

* More work to isolate implementation details of pipeline abstraction.

* added note on mailing list

* reformatting readme code

* making code snippets in readme conform to same style

* changing pthread to threads, to reflect existence of openmp

* removed autovec from current rc

* removed references to cinch'

* cleaning up uninited value in poynting flux

* Readme rewrite (#35)

* added first outline of possible compile time options

* added cmake error if multiple threading models are selected

* added documentation of compile time options and added note on git workflow

* first pass at trying to document the main file using doxygen as an example for other files, and also to increase code coverage

* Add dash to "restart" 

Fixes bug when trying to restart from file/folder called restart (still won't work if file has a dash in..)

* Particle arrays now resized dynamically. (#39)

* Particle arrays now resized dynamically.

* Added option for disabling dynamic resize at compile time using Cmake

* Changed default min_np to be 4kb worth

* Added cmake option to allow user specified min_np

* moved declaration of min_np to no longer be in the middle of an if-else block, as it could techincally be defined globally any way

* Devel (#46)

* Add configurable and documented build scripts for building VPIC on LANL ATS-1 and CTS-1 machines.
Document how to use these two scripts.

* Additional updates to documentation.

* Update compiler option documentation to make more accurate.

* Reorder options a bit.

* For the lanl-ats1 script, make sure that the Cray programming environment
starts out as the Cray default of PrgEnv-intel. This change checks for the
case where the user has modified their module environment and swaps it back
to the case assumed by the build script.

* fixed readme typo where vector length was listen incorrectly

* document options for dynamic particle resizing

* fixe typo in Readme

* Adding Unit tests (#44)

This commit adds a significant amount of unit tests, and also:

1. demonstrates how to write a unit test without doing a full input deck read/build cycle
2. Significantly increases code coverage from the tests
3. Takes a first step is adding a test for good "physics correctness" by trying to match against the energies from GY's IVPIC Weibel deck

* Moved old legacy tests to ./test/integrated/legacy to allow for cleanly adding new tests

* Added a particle_push integrated test that demonstrates that array snytax gives the same answer as the main kernel

* Updated gitignore to ignore tags files

* converted integrated style vpic to directly built binary

* updated code to match new pipeline file structure

* updated code to match new pipeline file structure

* Added code to protect the conditional building of code paths which assume none V_ intrinsics

* added simple short running deck that will detect runtime seg faults

* Added a unit test to both dump a checkpoint, and one to restart from it

* removed un-used code from unit test for array syntax

* added parallel test

* Added hwloc as dependcy on travis, which suddenly started failing beecause it was missing

* Added a threaded test

* tidied up test cmake file

* Fixed typo in end of for loop for tests

* added cleaning to simple example

* increased time steps in simple to encourage particles to cross boundaries during parallel

* Added note to mark the duplication of stringify macro

* Split wrapper into 2 files so the macro section can be reused elsehwere

* reverted spacing change in dump

* Fixed test looking for file in bin not src

* Added first draft of weibel test that detects numbers within a relative tolerance

* changed skip to default to 0

* Updated energy comparison to make a file displaying relative errors

* split function up into a couple of methods, and added code to log errors to file for further anaylsis

* added code to let energy error detection be aggregated as a sum based on a mask

* removed accidentally added prints

* converted to custom main to demonstrate how to avoid duplicate init/tear down in mpi and services

* 1) removed un needed if check. 2) made it clearer which file has an issue when opening

* made error bounds more permissive for unit tests

* added v8 builds to tests

* added v4 to v8 builds

* fixed typo in cmake script

* updated travis builds to reflect daves comments

* removing legacy python config file

* removed unneeded include

* added v16 portable test

* added sample deck from patrick

* Removed duplicate lines from boundary_p

Fixes #14

* Removed ambiguous doubles and unneeded int that may cause unneeded casts
on some platforms

Fixes #20

* added draft gitlab_ci file for custom LANL runner

* re-wrote pcomm to no longer do int to float literal comparison

* Devel (#49)

Fixed some issues with floating point literals in move_p.

1. Some usage of floating point literals in the affected region were not modified.
2. The axis variable is an int.

* Restructure Arch files (#47)

* first pass refactoring intel files

* Tidied up files and add gcc folders

* updated src path in arch files

* added v8 avx2 gcc

* updated readme to reflect new arch structure

* Remove global macro, this does the following things: (#48)

* Remove global macro, this does the following things:

1. Removes an instance where we violate ansi alias rules by cast a char*
to a struct
2. Allows for file paths that contain the work global (previously
        defined macro messed them up)
3. Is somewhat cleaner

Largely this is an improvement, but comes with one down side:

The global type cannot be used before initialization, and initialization
must appear near the start of the file (previously only begin_globals
        had to be near the top)

* added comments to explain the semantics of the global macro

* LANL specific arch build scripts (#50)

* Add configurable and documented build scripts for building VPIC on LANL ATS-1 and CTS-1 machines.
Document how to use these two scripts.

* Additional updates to documentation.

* Update compiler option documentation to make more accurate.

* Reorder options a bit.

* For the lanl-ats1 script, make sure that the Cray programming environment
starts out as the Cray default of PrgEnv-intel. This change checks for the
case where the user has modified their module environment and swaps it back
to the case assumed by the build script.

* Fix issues and errors in use of float literals introduced in a previous commit.

* Add CMake support for configuring a VPIC build with the legacy particle sort
implementation. Add build script support for a few more CMake variables that
were missing and should be availble to users of the build scripts.

* Separate lanl-ats1 script into two separate scripts, one for Haswell nodes and one for KNL nodes.

* Reorder build configuration options to something that seems more sensible.

* fixed readme typo

* Documentation of VPIC CMake build variables (#52)

* Add configurable and documented build scripts for building VPIC on LANL ATS-1 and CTS-1 machines.
Document how to use these two scripts.

* Additional updates to documentation.

* Update compiler option documentation to make more accurate.

* Reorder options a bit.

* For the lanl-ats1 script, make sure that the Cray programming environment
starts out as the Cray default of PrgEnv-intel. This change checks for the
case where the user has modified their module environment and swaps it back
to the case assumed by the build script.

* Fix issues and errors in use of float literals introduced in a previous commit.

* Add CMake support for configuring a VPIC build with the legacy particle sort
implementation. Add build script support for a few more CMake variables that
were missing and should be availble to users of the build scripts.

* Separate lanl-ats1 script into two separate scripts, one for Haswell nodes and one for KNL nodes.

* Reorder build configuration options to something that seems more sensible.

* Add more documentation about various available CMake configuration variables.

* added note in readme to reflect release versioning

* added note in readme to reflect release versioning formatting

1.0

Toggle 1.0's commit message
Change to Cinch build system