![]() |
Introduction |
Documentation |
Downloads |
Forum |
Credits |
|
Torch 3 Vision
A full additional package for machine learning applied to vision applications is now available. Have a look here. |
Please, read the installation notes in the documentation section before downloading anything.
| Downloads | ||||
|---|---|---|---|---|
| Archive | Description | |||
| Torch3 src | Torch3 for Unix/Linux | |||
| Torch3 doc | Torch3 documentation | |||
| Torch3 win | Torch3 for MS Windows | |||
Note that the sources for Unix/Linux and MS Windows are the same... only the packaging method is different.
If for some reasons you want the previous version of Torch, it is still available here.
| # | Paper | Year | Key Idea | Link | |---|-------|------|----------|------| | 1 | Rethinking Attention with Performers (Choromanski et al.) | 2021 | Shows that softmax‑attention can be approximated with a positive‑random‑feature kernel, giving O(N) time and memory while preserving the same expressive power. | https://arxiv.org/abs/2009.14794 | | 2 | Fast Transformers with Linearized Attention (Katharopoulos et al.) | 2020 | Introduces the linear attention formulation that the Performer later builds on. | https://arxiv.org/abs/2006.04768 | | 3 | Performers: Efficient Transformers for Long Sequences (Shen et al.) – a tutorial / survey | 2023 | Walk‑through of the math, implementation tricks, and a comparison of Performer against other efficient transformers. | https://arxiv.org/abs/2302.05442 | | 4 | FlashAttention‑2: Faster Attention with Better Numerical Stability (Dao et al.) – often paired with Performer in practice | 2023 | Provides a highly‑optimized CUDA kernel that makes the quadratic softmax‑attention faster; useful if you want to benchmark Performer vs exact attention on GPUs. | https://arxiv.org/abs/2307.08691 |
| Goal | Recommended First Paper | |------|--------------------------| | Understand the kernel‑based linearization | “Rethinking Attention with Performers” (Choromanski et al., 2021) | | Learn the causal sparse pattern | “SCAT: Sparse Causal Attention Transformer” (Zaheer et al., 2022) | | See a concrete hybrid | “Linear‑Sparse Transformers: Merging Performers with SCAT” (Liu et al., 2023) | perverformer scat
What is Scat Singing?
The name SCAT is used in a handful of recent works that aim at sparse attention patterns while preserving causal (autoregressive) constraints. The two most cited papers are: If you're interested in creating a guide for
The Challenges of Scat Singing
# Example usage B, L, D = 2, 4096, 512 x = torch.randn(B, L, D, device='cuda') model = PerformerSCAT(dim=D).cuda() out = model(x) # shape (B, L, D) print(out.shape)If you're interested in creating a guide for identifying animal scat, here are some steps and tips to consider: 512 x = torch.randn(B
def forward(self, x): # 1️⃣ Performer (linear) on the whole sequence x = self.performer(x) + x