Skip to content

Proposal: An abstract layer for managed git repositories #29033

@lunny

Description

@lunny

Background

As more and more big Gitea instances, the current implementation have two drawbacks.

  • Scalable

The git repositories stored in the disk and only under one directories. It’s hard to scale for those big Gitea instances. Because of the repository absolute path have already been used everywhere.

  • Fork disk optimization

Git itself supports shared repositories but Gitea haven't use this feature to reduce forked repositories disk usage. Some designs need to be considered. Which one should be the root repositories of the base and forked repositories? Should we have a hide repository as the root repositories? This is also related as the layer.

  • Risk to failed when renaming

When renaming a repository or a user, some folders needs to be renamed, this operations mixed with some database transactions. It have a high risk that the inconsistent between disk name and database records.

Purpose

So that I propose to have an abstract layer for managed repositories.

What is managed repositories? Now we have git package which can handle all git repositories, some repositories are created for pushing, editing and various reasons. Another repositories like the code repository, wiki repository, profile repository and package repositories. We call these repositories managed repositories which is not created and destroy for a special operation.

All operations of managed repositories will depends on a new package named gitrepo package rather than directly depends on git package.

I think there are some benefits for that.

  • It will be easier to introduce a distributed git storage based on the gitrepo package. After all abstracts completed, we can have a proxy mode inside of gitrepo package. i.e.

OriginalGitStorageService could keep the original logic with a root repositories path.

HTTPGitStorageService could store the managed git repositories into another server against Gitea server and provide a HTTP service to read/write managed git repositories.

  • Convert to a different storage directory structure. Currently, renaming a user or repository will need to rename the disk directories. This makes it difficult to keep consistent when operations failure. The best method is to use fixed repository information as directorie names, we can use user/repository id or others as directories name so when rename user/repository, no disk operation is necessary.

Concepts

I ever sent some PRs to want to introduce a layer in the module/git but I found it's not the right direction. That package modules/git should be a basis package which will always focus on handling disk operations. Whatever the repository is the managed one, the wiki one, the temporary one or the hide one. So I think some concepts need to be introduced to clarify.

  • Managed Git Repositories: All repositories recorded on Gitea's databases include wiki repositories or future other types repositories can be considered as managed git repositories. Only these git repositories should be managed by the distributed system.
  • Temporary Git Repositories: The repositories will be created/deleted when doing some operations in Gitea internal. Those repositories will be stored on system's temporary file system and will be clean after the related operations finished.
  • modules/git: This package should be a low level package which can handle any disk git repositories. For managed git repositories, a new package should be introduced.
  • modules/gitrepo: This is the new package introduced as an abstract layer to handle managed git repositories. It may include different storage strategy but the interface to other package is almost the same as before to hide the implementation details. This package will depend on modules/git and should not depend on any models packages. It can be dependent by other modules, services layer packages.

Refactoring

To address the purpose, we need do some refactors.

Move managed git operations and setting.RepoRootPath to modules/gitrepo package.

All operations related to managed git repositories should be moved togitrepo package but not depends on modules/git directly. modules/git is still useful. It can handle temporary repositories and is dependent by modules/gitrepo.

An abstract storage repository interface like

type Repository interface {
RelativePath() string
}

So that, we need have CodeStorageRepository , WikiStorageRepository , ProfileStorageRepository and PackageRepository which implemented this interfact.

The interface should only focus on the storage of managed git repositories.

All functions under modules/gitrepo should use this interface as the second parameters, the first one is context.Context .

Storage strategies

The relative path now is generated dynamically by ownername and reponame, it should be stored in the database, we can have some new columns in the database table repository i.e.

type Repository struct {
...
StorageRelativePath string `xorm:"VARCHAR(2048)`
...
}

For the storage path generating, we can introduce different storage strategies. i.e.

type GenerateTraditionalRelativePath(repo *repo_model.Repository) string {
    return repo.OwnerName + "/" + repo.Name
}

type GenerateHashedRelativePath(repo *repo_model.Repository) string {
    return hashfunc(repo.ID)
}

The strategy should be applied only to new created repository, the old created repositories will depend on the database table column as storage relative path.

Some strategies will require disk operations when renaming which should be part of the strategy.

We can have a convert tool to convert the traditional relative path strategy to the hashed relative path. The hashed relative path will use the repository’s ID which is a 64-bit

Multiple storage services

After the first two steps, we have enough abstract to introduce GitStorageService. A GitStorageService could have such an interface

type GitStorageService struct {
  Init(ctx context.Context) error
	OpenRepository(ctx context.Context) (GitRepository, error)
	RunCommand(ctx context.Context, repo Repository, c *git.Command, opts *git.RunOpts) error
}

A repository interafce

Since for difference

type GitRepository interface {
    GetCommit(ctx context.Context, commitID string) (*git.Commit, error)
}

Git Objects rewrite

Many git objects contains a reference to git.Repository which prevent the above abstract, so that a prepare step is to remove the reference inside the git objects like git.Commit, git.Tag and etc.

Related PRs

#28937
#28940
#28966

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/proposalThe new feature has not been accepted yet but needs to be discussed first.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions