Improving NuGet package extraction performance

A few weeks ago I received the following bug report for Octopus from a customer:

We have recently rolled out Octopus deploy [..] and we are having severe memory leak problems on the tentacles. After only 1 deploy one of the tentacles had used 1.2GB ram (Yes gigabytes). The package was quite large - circa - 180MB - but there must be a serious leak somewhere. Other projects are having similar issues - up-to about 600 MB memory usage.

I put together a test harness and created a NuGet package containing a pair of 90mb files. The harness simply used the NuGet.Core library's PackageManager class to install the package to a local folder. 55 seconds later, and having used 1.17GB of memory (as measured by GC.GetTotalMemory(false)), NuGet had finished extracting the package.

The good news is that given time, the memory usage reduced to normal, so the GC was able to free the memory (though much of it stayed allocated to the process just in case). The memory wasn't being leaked, it was just being wasted.

Wierdly, the NuGet API is designed around streams. And the System.IO.Packaging classes which NuGet depends on is also designed around streams. Looking into the implementation, the problem seemed to be down to NuGet.Core's ZipPackage class.

When you ask a NuGet.Core ZipPackage to list the files in its packages (GetFiles()) the implementation looks like this:

private List<IPackageFile> GetFilesNoCache()
{
    using (Stream stream = _streamFactory())
    {
        Package package = Package.Open(stream);

        return (from part in package.GetParts()
                where IsPackageFile(part)
                select (IPackageFile)new ZipPackageFile(part)).ToList();
    }
}

The ZipPackageFile constructor is being passed a PackagePart, which exposes a stream. But what happens with that stream?

public ZipPackageFile(IPackageFile file)
    : this(file.Path, file.GetStream().ToStreamFactory())
{
}

The ToStreamFactory call looks innocuous, but here's the implementation:

public static Func<Stream> ToStreamFactory(this Stream stream)
{
    byte[] buffer;

    using (var ms = new MemoryStream())
    {
        try
        {
            stream.CopyTo(ms);
            buffer = ms.ToArray();
        }
        finally 
        {
            stream.Close();
        }
    }

    return () => new MemoryStream(buffer);
}

That's right - it's reading the entire stream into an array, and then returning a new MemoryStream populated by the array anytime someone requests the contents of the file.

The reason for this appears to be that while System.IO.Packaging is designed to use streams which need to be disposed, the NuGet API and classes like ZipPackage are intended to be passed around without needing to be disposed. So instead of opening/closing the .nupkg file to read file contents when required, it copies it to memory.

This isn't a problem when your packages are less than a few MB, but it's pretty harmful when you're distributing applications.

After spending half a day trying to patch NuGet.Core to avoid reading the files into memory, in hopes of sending a pull request, I found that other people had also tried and been rejected - it seems like this is a problem the NuGet team plan to solve in an upcoming release.

Instead, I gave up and decided to write a package extraction function to suit my needs. This gist extracts the same file in 10 seconds using only 6mb of memory:

public class LightweightPackageInstaller
{
    private static readonly string[] ExcludePaths = new[] { "_rels", "package\\services\\metadata" };

    public void Install(string packageFile, string directory)
    {
        using (var package = Package.Open(packageFile, FileMode.Open, FileAccess.Read, FileShare.Read))
        {
            var files = (from part in package.GetParts()
                         where IsPackageFile(part)
                         select part).ToList();

            foreach (var part in files)
            {
                Console.WriteLine(" " + part.Uri);

                var path = UriUtility.GetPath(part.Uri);
                path = Path.Combine(directory, path);

                var parent = Path.GetDirectoryName(path);
                if (parent != null && !Directory.Exists(parent))
                {
                    Directory.CreateDirectory(parent);
                }

                using (var fileStream = new FileStream(path, FileMode.Create, FileAccess.Write, FileShare.Read))
                {
                    using (var stream = part.GetStream())
                    {
                        stream.CopyTo(fileStream);
                        fileStream.Flush();
                        fileStream.Dispose();
                    }
                }
            }
        }
    }

    internal static bool IsPackageFile(PackagePart part)
    {
        var path = UriUtility.GetPath(part.Uri);
        return !ExcludePaths.Any(p => path.StartsWith(p, StringComparison.OrdinalIgnoreCase)) &&
               !PackageUtility.IsManifest(path);
    }
}

There are other parts of NuGet.Core that break when large files are used. For example, some NuGet repositories use SHA hashes so that consumers can verify a package download. This is implemented in NuGet.Core's CryptoHashProvider. The method signature?

public interface IHashProvider
{
    byte[] CalculateHash(Stream stream);

    byte[] CalculateHash(byte[] data);

    bool VerifyHash(byte[] data, byte[] hash);
}

Again, instead of passing the stream (even though the underlying crypto classes we use accept streams), it will just read the entire file (180mb in this case) into an array just to hash it.

Here's hoping these problems are fixed soon. For now, at least, Octopus has a workaround that's only a few lines of code and performs much faster. As of Octopus 1.3.5, you'll be able to install large packages without dedicating most of your memory to it.

A picture of me

Welcome, my name is Paul Stovell. I live in Brisbane and work on Octopus Deploy, an automated deployment tool for .NET applications.

Prior to founding Octopus Deploy, I worked for an investment bank in London building WPF applications, and before that I worked for Readify, an Australian .NET consulting firm. I also worked on a number of open source projects and was an active user group presenter. I was a Microsoft MVP for WPF from 2006 to 2013.

Henning Kilset
Henning Kilset
19 Feb 2013

Great detective work, Paul. And here's to hoping the NuGet core team will get this patched into their implementation sooner rather than later...

20 Feb 2013

That's nice to know in advance.

Alex Falkowski
Alex Falkowski
28 Feb 2013

Thanks for the mention. Looks like they have introduced a class called OptimizedZipPackage. These are the comments for the class

Represents a NuGet package backed by a .nupkg file on disk. Unlike ZipPackage, OptimizedZipPackage doesn't store content files in memory. Instead, it unzips the .nupkg file to a temp folder on disk, which helps reduce overall memory usage.

So it looks like they are just storing it on disk, sigh :)

Mick
Mick
04 Mar 2013

Looks like yesterday they released a new version of Nuget.Core. I think they've fixed all the issues you have found, the implementation now is a lot more simple and straight-forward.

var pathToZip = @"";

var zp = new ZipPackage(pathToZip);

var files = zp.GetFiles();

var dir = @"";

foreach(var file in files)
{
   var path = Path.Combine(dir, file.Path);

   // directory creation as per existing gist example

   using var fileStream = new FileStream(path, FileMode.Create, FileAccess.Write, FileShare.Read))
   {
      using(var stream = file.GetStream())
      {
         stream.CopyTo(fileStream);
         fileStream.Flush();
         fileStream.Dispose();
      }
   }
}

There also is a PackageDownloader class which you can reference a NuGet feed and then handle as per the above.

06 Mar 2013

Hi Paul,

I'm from the NuGet team. Great investigation. Yes, the problem with ZipPackage were well aware and we've fixed it in our 2.3 release with the OptimizedZipPackage.

We also fixed the same issue in some other areas, but we did miss the CryptoHashProvider. Thanks for pointing it out. I'll make sure we'll get it fixed before 2.3 is released.

@AlexFalkowski: why the sigh? :)