Torch3
  Introduction  

  Documentation  

  Downloads  

  Forum  

  Credits  
Ronan Collobert ()
Release 3.1
August 11, 2004

This is a minor update (bug corrections).
See the ChangeLog.




Torch 3 Vision

A full additional package for machine learning applied to vision applications is now available.
Have a look here.


Downloads

Please, read the installation notes in the documentation section before downloading anything.

Downloads
    Archive     Description
Torch3 src Torch3 for Unix/Linux
Torch3 doc Torch3 documentation
Torch3 win    Torch3 for MS Windows   

Warning!

We strongly encourage you to use from now xmake (a python script designed for Torch) instead of the GNU make software for compiling Torch.

Note that the sources for Unix/Linux and MS Windows are the same... only the packaging method is different.
If for some reasons you want the previous version of Torch, it is still available here.

Short description of packages


Perverformer Scat ^hot^ -

1️⃣ Performer – Linear‑time attention via kernel tricks

| # | Paper | Year | Key Idea | Link | |---|-------|------|----------|------| | 1 | Rethinking Attention with Performers (Choromanski et al.) | 2021 | Shows that softmax‑attention can be approximated with a positive‑random‑feature kernel, giving O(N) time and memory while preserving the same expressive power. | https://arxiv.org/abs/2009.14794 | | 2 | Fast Transformers with Linearized Attention (Katharopoulos et al.) | 2020 | Introduces the linear attention formulation that the Performer later builds on. | https://arxiv.org/abs/2006.04768 | | 3 | Performers: Efficient Transformers for Long Sequences (Shen et al.) – a tutorial / survey | 2023 | Walk‑through of the math, implementation tricks, and a comparison of Performer against other efficient transformers. | https://arxiv.org/abs/2302.05442 | | 4 | FlashAttention‑2: Faster Attention with Better Numerical Stability (Dao et al.) – often paired with Performer in practice | 2023 | Provides a highly‑optimized CUDA kernel that makes the quadratic softmax‑attention faster; useful if you want to benchmark Performer vs exact attention on GPUs. | https://arxiv.org/abs/2307.08691 |

6️⃣ TL;DR – What to Read First

| Goal | Recommended First Paper | |------|--------------------------| | Understand the kernel‑based linearization | “Rethinking Attention with Performers” (Choromanski et al., 2021) | | Learn the causal sparse pattern | “SCAT: Sparse Causal Attention Transformer” (Zaheer et al., 2022) | | See a concrete hybrid | “Linear‑Sparse Transformers: Merging Performers with SCAT” (Liu et al., 2023) | perverformer scat

What is Scat Singing?

2️⃣ SCAT – Sparse‑Causal‑Attention‑Transformer

The name SCAT is used in a handful of recent works that aim at sparse attention patterns while preserving causal (autoregressive) constraints. The two most cited papers are: If you're interested in creating a guide for

The Challenges of Scat Singing

# Example usage B, L, D = 2, 4096, 512 x = torch.randn(B, L, D, device='cuda') model = PerformerSCAT(dim=D).cuda() out = model(x) # shape (B, L, D) print(out.shape)

If you're interested in creating a guide for identifying animal scat, here are some steps and tips to consider: 512 x = torch.randn(B

def forward(self, x): # 1️⃣ Performer (linear) on the whole sequence x = self.performer(x) + x