# GitLab Elasticsearch Indexer

![Pipeline Status](https://gitlab.com/gitlab-org/gitlab-elasticsearch-indexer/badges/main/pipeline.svg)

This project indexes Git repositories into Elasticsearch for GitLab. The indexed data enables GitLab to search through code, wikis, and commits in GitLab repositories using Elasticsearch's powerful search capabilities.

The indexer is designed with a modular architecture that supports different indexing modes to optimize for various deployment scenarios. It uses structured logging to help with troubleshooting and debugging.

## Dependencies

This project relies on the following dependencies:

- [ICU](http://site.icu-project.org/) for text encoding
- Go 1.20 or later for building from source
- [Gitaly](https://gitlab.com/gitlab-org/gitaly) for accessing Git repositories
- [Elasticsearch](https://www.elastic.co/elasticsearch/) v7.x or compatible OpenSearch instance

Ensure the development packages for your platform are installed before running `make`:

### Debian / Ubuntu

```
# apt install libicu-dev
```

### Mac OSX

```
$ brew install icu4c
$ export PKG_CONFIG_PATH="$(brew --prefix)/opt/icu4c/lib/pkgconfig:$PKG_CONFIG_PATH"
```

## Modes Architecture

The GitLab Elasticsearch Indexer supports multiple operating modes that can be configured using the `GITLAB_INDEXER_MODE` environment variable. Each mode is optimized for different use cases:

### Advanced Mode (Default)

The Advanced Mode is the default mode for the indexer. It provides full-featured indexing with support for:

- Indexing code (blobs), commits, and wikis
- Project permission handling
- Namespace traversal IDs
- Schema versioning

This mode is recommended for most standard GitLab deployments.

### Chunk Mode

The Chunk Mode is an alternative indexing approach designed for large repositories or specialized deployment scenarios. This mode is currently under development and will provide enhanced features for handling very large codebases more efficiently.

To select a specific mode, set the `GITLAB_INDEXER_MODE` environment variable:

```bash
# For Advanced Mode (default if not specified)
export GITLAB_INDEXER_MODE=advanced

# For Chunk Mode
export GITLAB_INDEXER_MODE=chunk
```

## Building & Installing

### Local Build

To build and install the indexer locally:

```bash
make
sudo make install
```

`gitlab-elasticsearch-indexer` will be installed to `/usr/local/bin`

You can change the installation path with the `PREFIX` environment variable. Please remember to pass the `-E` flag to sudo if you do so.

Example:
```bash
PREFIX=/usr sudo -E make install
```

### Development Helpers

The project includes several helpful Makefile targets to assist with development:

```bash
# View all available Makefile targets with descriptions
make help

# Run tests in watch mode (automatically re-run on file changes)
make watch-test
```

### Using Docker

You can also build and use the indexer as a Docker image:

```bash
docker build . -t gitlab-elasticsearch-indexer
```

You can edit your shell profile (like `~/.zshrc`) to use the image as a binary:

```bash
func gitlab-elasticsearch-indexer() {
  docker run --rm -it gitlab-elasticsearch-indexer "$@"
}
```

## Lefthook Static Analysis

[Lefthook](https://github.com/evilmartians/lefthook) is a Git hooks manager that allows
custom logic to be executed prior to Git committing or pushing. `gitlab-elasticsearch-indexer`
comes with Lefthook configuration (`lefthook.yml`), which helps ensure code quality by running
linters and static analysis tools automatically.

The configuration file is checked in but ignored until Lefthook is installed.

### Install Lefthook

1. [Install `lefthook`](https://lefthook.dev/installation/)
2. Install Lefthook Git hooks:

   ```shell
   lefthook install
   ```

3. Test Lefthook is working by running the Lefthook `pre-push` Git hook:

   ```shell
   lefthook run pre-push
   ```

Lefthook will now automatically run configured checks before commits and pushes.

## Testing

The project includes a comprehensive test suite and developer-friendly testing features to help ensure code quality.

### Test Requirements

The test suite expects Gitaly and Elasticsearch to be running on the following ports:

- Gitaly: 8075
- ElasticSearch v7.14.2: 9201

Make sure you have `docker` and `docker-compose` installed. On macOS, you can use [colima](https://gitlab.com/-/snippets/2259133) to run Docker since [Docker Desktop cannot be used due to licensing](https://about.gitlab.com/handbook/tools-and-tips/mac/#docker-desktop).

```bash
brew install docker docker-compose colima
colima start
```

### Quick Tests

```bash
# Start the test infrastructure (only needed once)
make test-infra

# Source the default connection settings
source .env.test

# Run the test suite
make test

# Run tests in watch mode (auto-rerun on file changes)
make watch-test

# Run a specific test
go test -v gitlab.com/gitlab-org/gitlab-elasticsearch-indexer -run TestIndexingGitlabTest
```

If you want to re-create the test infrastructure, you can run `make test-infra` again.

### Custom Test Configuration

For testing with custom configurations:

1. Start only the services you need:

   ```bash
   # Start Gitaly
   docker-compose up -d gitaly

   # Start ElasticSearch
   docker-compose up -d elasticsearch
   ```

2. Configure the test environment:

   ```bash
   # These are the defaults from .env.test
   export GITALY_CONNECTION_INFO='{"address": "tcp://localhost:8075", "storage": "default"}'
   export ELASTIC_CONNECTION_INFO='{"url":["http://localhost:9201"], "index_name":"gitlab-test", "index_name_commits":"gitlab-test-commits"}'
   ```

   **Note**: When using a Unix socket, use the format `unix://FULL_PATH_WITH_LEADING_SLASH`

   Example with custom Gitaly connection:
   ```bash
   # Source default connections
   source .env.test

   # Override Gitaly connection for GDK
   export GITALY_CONNECTION_INFO='{"address": "unix:///gitlab/gdk/gitaly.socket", "storage": "default"}'

   # Run tests
   make test
   ```

## Testing in GDK

You can test changes to the indexer in the GitLab Development Kit (GDK) in multiple ways.

### Using the `GITLAB_ELASTICSEARCH_INDEXER_VERSION` File

**Warning:** Do not create tags to test code. Tags are created for released versions only.

The [GITLAB_ELASTICSEARCH_INDEXER_VERSION file](https://gitlab.com/gitlab-org/gitlab/-/blob/master/GITLAB_ELASTICSEARCH_INDEXER_VERSION) accepts commit SHAs and branch names. This method works for both local development and spec execution.

To test a branch or specific commit:

1. Update the GITLAB_ELASTICSEARCH_INDEXER_VERSION file with your branch name or commit SHA
2. Run `gdk reconfigure` to apply the changes

### Building a Binary for GDK

You can test changes to the indexer in your GDK by:

1. Building the indexer with the `PREFIX` environment variable set to your GDK directory
2. This installs the indexer directly in the GDK, making it available for immediate testing

```bash
# Build and install directly to GDK
PREFIX=<gdk_install_directory>/gitlab-elasticsearch-indexer make install
```

**Note:** Running `gdk update` will reset the indexer back to the version specified in the GITLAB_ELASTICSEARCH_INDEXER_VERSION file. The specs use this file to build the indexer to `<gdk_install_directory>/gitlab/tmp/tests/gitlab-elasticsearch-indexer`.

### Debugging with Delve

[Delve](https://github.com/go-delve/delve/) is a powerful Go debugger that can help troubleshoot issues.

Start a debugging session with:

```bash
dlv test <path-to-package> -- -test.run <regex-matching-test-name>
```

Example:
```bash
dlv test gitlab.com/gitlab-org/gitlab-elasticsearch-indexer -- -test.run ^TestIndexingWikiBlobs$
```

Common debugging commands:
- Set a breakpoint: `break <path-to-file>:<line-number>`
- Continue execution until next breakpoint: `continue`
- Print variable value: `print <variable-name>`
- Step to next source line: `next`
- Exit debugger: `exit`

For more details, see the [Delve documentation](https://github.com/go-delve/delve/blob/master/Documentation/cli/getting_started.md#debugging-tests).

## Obtaining a package or Docker image for testing an MR

GitLab team members can use the `build-package-and-qa` job in their MR pipeline
to trigger a pipeline in the `omnibus-gitlab-mirror` project. This pipeline
produces:

- An `omnibus-gitlab` package for Ubuntu (as an artifact of the `Trigger:package` job)
- A Docker image (in the `Trigger:gitlab-docker` job)

These artifacts include the changes from the MR and can be used to deploy a GitLab instance locally for testing.

The job is automatically started if the MR includes changes to any of the dependencies of the project,
which could potentially break builds in any of the operating systems GitLab provides packages for.
For other types of MRs, this is available as a manual job for developers to run when needed.

## Configuration Options

The GitLab Elasticsearch Indexer can be configured using both environment variables and command-line flags.

### Environment Variables

| Variable | Description | Default | Example |
|----------|-------------|---------|---------|
| `GITLAB_INDEXER_MODE` | The indexing mode to use | `advanced` | `advanced`, `chunk` |
| `GITLAB_INDEXER_DEBUG_LOGGING` | Enable debug logging | `false` | `true`, `1` |
| `CORRELATION_ID` | ID for tracking operations across components | Auto-generated | `abc123` |
| `GITALY_CONNECTION_INFO` | Gitaly connection details (JSON) | | `{"address": "unix:///path/to/gitaly.socket", "storage": "default"}` |
| `ELASTIC_CONNECTION_INFO` | Elasticsearch connection details (JSON) | | `{"url":["http://localhost:9200"], "index_name":"gitlab-production", "index_name_commits":"gitlab-production-commits"}` |
| `DEBUG` | Legacy debug mode (deprecated) | | `true` |

### Command-line Flags

The indexer supports numerous command-line flags, particularly in advanced mode:

| Flag | Description | Example |
|------|-------------|---------|
| `--version` | Print version information and exit | |
| `--blob-type` | Type of blobs to index | `blob` (default), `wiki_blob` |
| `--skip-commits` | Skip indexing commits for the repo | |
| `--search-curation` | Enable deleting documents from rolled over indices | |
| `--visibility-level` | Project/Group visibility access level | `0`, `10`, `20` |
| `--repository-access-level` | Project repository access level | `0`, `10`, `20` |
| `--wiki-access-level` | Wiki repository access level | `0`, `10`, `20` |
| `--project-id` | Project ID | `42` |
| `--group-id` | Group ID | `24` |
| `--full-path` | Project or group full path | `group/project` |
| `--timeout` | Process timeout duration | `5m`, `1h` |
| `--traversal-ids` | Namespace traversal IDs for indexed documents | `5-1-6-` |
| `--hashed-root-namespace-id` | Hashed root namespace ID | `42` |
| `--schema-version-blob` | Schema version for blob documents (YYMM format) | `2305` |
| `--schema-version-commit` | Schema version for commit documents (YYMM format) | `2305` |
| `--schema-version-wiki` | Schema version for wiki documents (YYMM format) | `2305` |
| `--from-sha` | Starting commit SHA for indexing | `abc123...` |
| `--to-sha` | Ending commit SHA for indexing | `def456...` |
| `--archived` | Whether the project is archived | `true`, `false` |

## Logging

The GitLab Elasticsearch Indexer uses structured JSON logging with the Go standard library's `log/slog` package. This provides:

- Consistent log format with key-value pairs
- Configurable log levels
- Easy integration with log management systems

### Debug Logging

Debug logging can be enabled by setting the `GITLAB_INDEXER_DEBUG_LOGGING` environment variable:

```bash
# Enable debug logging
export GITLAB_INDEXER_DEBUG_LOGGING=true
# or
export GITLAB_INDEXER_DEBUG_LOGGING=1

# Run the indexer with debug logging enabled
gitlab-elasticsearch-indexer [options] /path/to/repo
```

When debug logging is enabled, you'll see additional information about:
- Mode selection and initialization
- Elasticsearch queries and responses
- Git operations
- Performance metrics

Debug logs are automatically formatted as structured JSON for easy filtering and analysis.

## CI/CD Configuration

### Automatic Tag Creation

The project contains a CI job that automatically creates version tags based on the content of the `VERSION` file. When changes are merged to the main branch, the system checks if a tag for the current version exists and creates one if needed.

#### TAG_CREATOR_TOKEN Requirements

To enable automatic tag creation, you need to set up a GitLab CI/CD variable:

- **Variable Name**: `TAG_CREATOR_TOKEN`
- **Type**: Masked and Protected variable
- **Requirements**:
  - Must be a project access token with Developer role
  - Scope required: `api`
  - The bot user created with the token must have permission to create protected tags

To set up this token:
1. Create a project access token with Developer role and api scope
2. Add the token as a Masked and Protected CI/CD variable in your project settings
3. Go to your project's Settings > Repository > Protected Tags
4. Add the project bot user (appears as "Project bot: [project-name]") to the list of users allowed to create protected tags

## Contributing

Please see the [contribution guidelines](CONTRIBUTING.md) and the [development process documentation](doc/process.md) for information about the release process, maintainership, and versioning.
