Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crossvit vs vision transformer #8598

Open
Navoditamathur opened this issue Aug 17, 2024 · 2 comments
Open

crossvit vs vision transformer #8598

Navoditamathur opened this issue Aug 17, 2024 · 2 comments

Comments

@Navoditamathur
Copy link

🚀 The feature

Implement CrossVIT model for Fine grained classification

Motivation, pitch

CrossViT integrates multi-scale feature representations, enabling it to efficiently process images of varying resolutions. By implementing CrossViT in PyTorch, you can harness the strength of multi-scale feature fusion to improve performance in image classification, object detection, and other computer vision tasks.

Key Points:

Multi-Scale Representation:
CrossViT uses two separate branches with different image patch sizes, allowing the model to capture both fine and coarse-grained features. This dual-branch architecture significantly enhances the model's ability to understand complex image structures.

Cross-Attention Mechanism:
The core innovation of CrossViT lies in its cross-attention mechanism, where features from one branch are fused with features from another. This interaction facilitates information exchange between scales, improving the model's capability to detect patterns across different granularities.

Real-World Applications:
CrossViT has shown promise in tasks ranging from image classification to object detection, making it a versatile choice for real-world applications such as medical imaging, remote sensing, and autonomous vehicles. PyTorch's support for deployment on different platforms (e.g., mobile and embedded systems) ensures that CrossViT can be used in diverse environments. It shows strong performance in scenarios where multi-scale feature extraction is crucial, such as fine-grained image classification or tasks requiring both global context and local details

Alternatives

No response

Additional context

No response

@abhi-glitchhg
Copy link
Contributor

There are so many versions of vision transformers paper, I feel like it's better to use Timm library. It has implementation of many vision models.

@NicolasHug
Copy link
Member

Hi @Navoditamathur

Thank you for opening this issue. We're not planning on adding new models to torchvision at this point. I agree with @abhi-glitchhg that other repos like timm might be better venue for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants