Improving NuGet package extraction performance
A few weeks ago I received the following bug report for Octopus from a customer:
We have recently rolled out Octopus deploy [..] and we are having severe memory leak problems on the tentacles. After only 1 deploy one of the tentacles had used 1.2GB ram (Yes gigabytes). The package was quite large - circa - 180MB - but there must be a serious leak somewhere. Other projects are having similar issues - up-to about 600 MB memory usage.
I put together a test harness and created a NuGet package containing a pair of 90mb files. The harness simply used the NuGet.Core library's PackageManager
class to install the package to a local folder. 55 seconds later, and having used 1.17GB of memory (as measured by GC.GetTotalMemory(false)
), NuGet had finished extracting the package.
The good news is that given time, the memory usage reduced to normal, so the GC was able to free the memory (though much of it stayed allocated to the process just in case). The memory wasn't being leaked, it was just being wasted.
Wierdly, the NuGet API is designed around streams. And the System.IO.Packaging
classes which NuGet depends on is also designed around streams. Looking into the implementation, the problem seemed to be down to NuGet.Core's ZipPackage
class.
When you ask a NuGet.Core ZipPackage
to list the files in its packages (GetFiles()
) the implementation looks like this:
private List<IPackageFile> GetFilesNoCache()
{
using (Stream stream = _streamFactory())
{
Package package = Package.Open(stream);
return (from part in package.GetParts()
where IsPackageFile(part)
select (IPackageFile)new ZipPackageFile(part)).ToList();
}
}
The ZipPackageFile
constructor is being passed a PackagePart
, which exposes a stream. But what happens with that stream?
public ZipPackageFile(IPackageFile file)
: this(file.Path, file.GetStream().ToStreamFactory())
{
}
The ToStreamFactory
call looks innocuous, but here's the implementation:
public static Func<Stream> ToStreamFactory(this Stream stream)
{
byte[] buffer;
using (var ms = new MemoryStream())
{
try
{
stream.CopyTo(ms);
buffer = ms.ToArray();
}
finally
{
stream.Close();
}
}
return () => new MemoryStream(buffer);
}
That's right - it's reading the entire stream into an array, and then returning a new MemoryStream
populated by the array anytime someone requests the contents of the file.
The reason for this appears to be that while System.IO.Packaging
is designed to use streams which need to be disposed, the NuGet API and classes like ZipPackage
are intended to be passed around without needing to be disposed. So instead of opening/closing the .nupkg
file to read file contents when required, it copies it to memory.
This isn't a problem when your packages are less than a few MB, but it's pretty harmful when you're distributing applications.
After spending half a day trying to patch NuGet.Core to avoid reading the files into memory, in hopes of sending a pull request, I found that other people had also tried and been rejected - it seems like this is a problem the NuGet team plan to solve in an upcoming release.
Instead, I gave up and decided to write a package extraction function to suit my needs. This gist extracts the same file in 10 seconds using only 6mb of memory:
public class LightweightPackageInstaller
{
private static readonly string[] ExcludePaths = new[] { "_rels", "package\\services\\metadata" };
public void Install(string packageFile, string directory)
{
using (var package = Package.Open(packageFile, FileMode.Open, FileAccess.Read, FileShare.Read))
{
var files = (from part in package.GetParts()
where IsPackageFile(part)
select part).ToList();
foreach (var part in files)
{
Console.WriteLine(" " + part.Uri);
var path = UriUtility.GetPath(part.Uri);
path = Path.Combine(directory, path);
var parent = Path.GetDirectoryName(path);
if (parent != null && !Directory.Exists(parent))
{
Directory.CreateDirectory(parent);
}
using (var fileStream = new FileStream(path, FileMode.Create, FileAccess.Write, FileShare.Read))
{
using (var stream = part.GetStream())
{
stream.CopyTo(fileStream);
fileStream.Flush();
fileStream.Dispose();
}
}
}
}
}
internal static bool IsPackageFile(PackagePart part)
{
var path = UriUtility.GetPath(part.Uri);
return !ExcludePaths.Any(p => path.StartsWith(p, StringComparison.OrdinalIgnoreCase)) &&
!PackageUtility.IsManifest(path);
}
}
There are other parts of NuGet.Core that break when large files are used. For example, some NuGet repositories use SHA hashes so that consumers can verify a package download. This is implemented in NuGet.Core's CryptoHashProvider
. The method signature?
public interface IHashProvider
{
byte[] CalculateHash(Stream stream);
byte[] CalculateHash(byte[] data);
bool VerifyHash(byte[] data, byte[] hash);
}
Again, instead of passing the stream (even though the underlying crypto classes we use accept streams), it will just read the entire file (180mb in this case) into an array just to hash it.
Here's hoping these problems are fixed soon. For now, at least, Octopus has a workaround that's only a few lines of code and performs much faster. As of Octopus 1.3.5, you'll be able to install large packages without dedicating most of your memory to it.