-
-
Notifications
You must be signed in to change notification settings - Fork 6k
Description
Background
As more and more big Gitea instances, the current implementation have two drawbacks.
- Scalable
The git repositories stored in the disk and only under one directories. It’s hard to scale for those big Gitea instances. Because of the repository absolute path have already been used everywhere.
- Fork disk optimization
Git itself supports shared repositories but Gitea haven't use this feature to reduce forked repositories disk usage. Some designs need to be considered. Which one should be the root repositories of the base and forked repositories? Should we have a hide repository as the root repositories? This is also related as the layer.
- Risk to failed when renaming
When renaming a repository or a user, some folders needs to be renamed, this operations mixed with some database transactions. It have a high risk that the inconsistent between disk name and database records.
Purpose
So that I propose to have an abstract layer for managed repositories.
What is managed repositories? Now we have git
package which can handle all git repositories, some repositories are created for pushing, editing and various reasons. Another repositories like the code repository, wiki repository, profile repository and package repositories. We call these repositories managed repositories which is not created and destroy for a special operation.
All operations of managed repositories will depends on a new package named gitrepo
package rather than directly depends on git
package.
I think there are some benefits for that.
- It will be easier to introduce a distributed git storage based on the
gitrepo
package. After all abstracts completed, we can have a proxy mode inside ofgitrepo
package. i.e.
OriginalGitStorageService
could keep the original logic with a root repositories path.
HTTPGitStorageService
could store the managed git repositories into another server against Gitea server and provide a HTTP service to read/write managed git repositories.
- Convert to a different storage directory structure. Currently, renaming a user or repository will need to rename the disk directories. This makes it difficult to keep consistent when operations failure. The best method is to use fixed repository information as directorie names, we can use user/repository id or others as directories name so when rename user/repository, no disk operation is necessary.
Concepts
I ever sent some PRs to want to introduce a layer in the module/git
but I found it's not the right direction. That package modules/git
should be a basis package which will always focus on handling disk operations. Whatever the repository is the managed one, the wiki one, the temporary one or the hide one. So I think some concepts need to be introduced to clarify.
- Managed Git Repositories: All repositories recorded on Gitea's databases include wiki repositories or future other types repositories can be considered as managed git repositories. Only these git repositories should be managed by the distributed system.
- Temporary Git Repositories: The repositories will be created/deleted when doing some operations in Gitea internal. Those repositories will be stored on system's temporary file system and will be clean after the related operations finished.
modules/git
: This package should be a low level package which can handle any disk git repositories. For managed git repositories, a new package should be introduced.modules/gitrepo
: This is the new package introduced as an abstract layer to handle managed git repositories. It may include different storage strategy but the interface to other package is almost the same as before to hide the implementation details. This package will depend onmodules/git
and should not depend on anymodels
packages. It can be dependent by othermodules
,services
layer packages.
Refactoring
To address the purpose, we need do some refactors.
Move managed git operations and setting.RepoRootPath
to modules/gitrepo
package.
All operations related to managed git repositories should be moved togitrepo
package but not depends on modules/git
directly. modules/git
is still useful. It can handle temporary repositories and is dependent by modules/gitrepo
.
An abstract storage repository interface like
type Repository interface {
RelativePath() string
}
So that, we need have CodeStorageRepository
, WikiStorageRepository
, ProfileStorageRepository
and PackageRepository
which implemented this interfact.
The interface should only focus on the storage of managed git repositories.
All functions under modules/gitrepo
should use this interface as the second parameters, the first one is context.Context
.
Storage strategies
The relative path now is generated dynamically by ownername and reponame, it should be stored in the database, we can have some new columns in the database table repository
i.e.
type Repository struct {
...
StorageRelativePath string `xorm:"VARCHAR(2048)`
...
}
For the storage path generating, we can introduce different storage strategies. i.e.
type GenerateTraditionalRelativePath(repo *repo_model.Repository) string {
return repo.OwnerName + "/" + repo.Name
}
type GenerateHashedRelativePath(repo *repo_model.Repository) string {
return hashfunc(repo.ID)
}
The strategy should be applied only to new created repository, the old created repositories will depend on the database table column as storage relative path.
Some strategies will require disk operations when renaming which should be part of the strategy.
We can have a convert tool to convert the traditional relative path strategy to the hashed relative path. The hashed relative path will use the repository’s ID which is a 64-bit
Multiple storage services
After the first two steps, we have enough abstract to introduce GitStorageService
. A GitStorageService
could have such an interface
type GitStorageService struct {
Init(ctx context.Context) error
OpenRepository(ctx context.Context) (GitRepository, error)
RunCommand(ctx context.Context, repo Repository, c *git.Command, opts *git.RunOpts) error
}
A repository interafce
Since for difference
type GitRepository interface {
GetCommit(ctx context.Context, commitID string) (*git.Commit, error)
}
Git Objects rewrite
Many git objects contains a reference to git.Repository
which prevent the above abstract, so that a prepare step is to remove the reference inside the git objects like git.Commit
, git.Tag
and etc.