Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement on-disk chunked array #2096

Merged
merged 14 commits into from Jul 24, 2020

Conversation

@davidbrochart
Copy link
Member

@davidbrochart davidbrochart commented Jul 22, 2020

Checklist

  • The title and commit message(s) are descriptive.
  • Small commits made to fix your PR have been squashed to avoid history pollution.
  • Tests have been added for new features or bug fixes.
  • API of new functions and classes are documented.

Description

}

template <class T1, class T2, class S1, class S2>
void remap(T1** file_array, T2& array, S1& in_stream, S2& out_stream)

This comment has been minimized.

@JohanMabille

JohanMabille Jul 22, 2020
Member

Pointers and pointers to pointers should be avoided if not in very low-level code. What prevents from using references?

Besides this function seems to be called in access methods, having a lot of memory deletion and allocation will be a performance issue.

This comment has been minimized.

@davidbrochart

davidbrochart Jul 22, 2020
Author Member

Yes, I can use a reference.
You're right, the current implementation is really not production-ready, it is more like an attempt to have something working and to discuss choices. I will explain some of them.

return m_shape;
}

xfiles_array(): m_file_array(NULL)

This comment has been minimized.

@JohanMabille

JohanMabille Jul 22, 2020
Member

Why is m_file_array a pointer instead of a plain object? If you keep the pointer, you need to carefully implement the copy and move semantics, otherwise you'll have a lot of crashes when using the xfiles_array:

xfiles_array<xdisk_array>> a1, a2;
// ....
a1 = a2;
// => a1.m_file_array is never destroyed before beeing assigned a2.m_array.
// => a2.m_array will be destroyed twice.

This comment has been minimized.

@davidbrochart

davidbrochart Jul 22, 2020
Author Member

Thanks, having a plain object simplifies a lot indeed!

@emmenlau
Copy link
Contributor

@emmenlau emmenlau commented Jul 22, 2020

Guys this is really cool! I'm following the code only on the surface but I understand that file-backed caching will soon be supported? If this is interesting I can also take a look at HDF5 based caching in the future, because the file format supports basically everything that an N-dimensional data structure could need.

On the other hand, xtensor already supports HDF5 as a backend so maybe its overkill to add dedicated support here (again)?

@davidbrochart
Copy link
Member Author

@davidbrochart davidbrochart commented Jul 22, 2020

file-backed caching will soon be supported?

xdisk_array indeed acts as a file-backed cached array.

On the other hand, xtensor already supports HDF5 as a backend so maybe its overkill to add dedicated support here (again)?

This PR gets us closer to Zarr, I'm not sure about HDF5.

@JohanMabille
Copy link
Member

@JohanMabille JohanMabille commented Jul 22, 2020

There are two orthogonal things here:

  • "chunk array" and "file array": these features are not bound to any file format or spec implementation, they are generic enough to live in xtensor core.
  • the different file formats (HDF5, zarr), which should be implemented either in xtensor-io, or in a dedicated package.

EDIT: replace "disk array" with "file array" according to the comment below.

@davidbrochart
Copy link
Member Author

@davidbrochart davidbrochart commented Jul 22, 2020

@JohanMabille I think xfiles_array is the one which is generic: it should manage a pool of file-backed arrays (currently the pool only consists of one file array). It doesn't know what these file arrays are, they can live on disk or in the cloud.
xdisk_array is an on-disk file array. In the current PR, I used a CSV file format just to make debugging easy.The xdisk_array is passed a file and an in-memory container for its content (an xarray) from xfiles_array.
I'm not sure this is the right architecture though, any feedback is very welcome.

@JohanMabille
Copy link
Member

@JohanMabille JohanMabille commented Jul 22, 2020

Alright, so maybe the disk array will be replaced by an in-memory container + a serializer passed directly to the file array class. Let's see what's required in the disk array while you complete itsimplementation, and if the class is really required or not.

@davidbrochart
Copy link
Member Author

@davidbrochart davidbrochart commented Jul 22, 2020

The new test fails with the following error:

unknown file: Failure
C++ exception with description "stod" thrown in the test body.

which happens as soon as a file has to be written. It works fine outside the test environment. Thoughts?


static std::ifstream* m_in_file;
static std::ofstream* m_out_file;
const static xarray<T>* m_array;

This comment has been minimized.

@davidbrochart

davidbrochart Jul 22, 2020
Author Member

I am really not sure about having these static members. The xdisk_reference class is here to catch assignments to elements of m_array through its definition of the operator=(), by making xdisk_reference the type of xdisk_array's reference. But it needs to access m_array and the output file in order to dump to disk, and they cannot be passed through e.g. the constructor since (as I understand it) the copy constructor is used.

This comment has been minimized.

@JohanMabille

JohanMabille Jul 22, 2020
Member

You can explicitly invoke the constructor of xdisk_reference and pass it whatever you want. For instance:

template <class It>
inline reference element(It first, It last)
{
    return reference(m_array->element(first, last), &m_array, &m_in_file, &m_out_file);
}

This comment has been minimized.

@davidbrochart

davidbrochart Jul 22, 2020
Author Member

Awesome!

@davidbrochart
Copy link
Member Author

@davidbrochart davidbrochart commented Jul 22, 2020

Alright, so maybe the disk array will be replaced by an in-memory container + a serializer passed directly to the file array class.

The issue I see with that is that xfiles_array doesn't know if the array is being assigned, the assignment takes place in xdisk_reference's operator=(). At the same time, this is not great either because there is no "chunk view", only an "element view", which means that as soon as a single element is assigned the whole chunk is saved.

@JohanMabille
Copy link
Member

@JohanMabille JohanMabille commented Jul 22, 2020

Another possibility is to flag the xdisk_array as "dirty" when an element is assigned, and save to disk only upon destruction / unload.

@davidbrochart
Copy link
Member Author

@davidbrochart davidbrochart commented Jul 22, 2020

Not sure about the failing test:

/home/travis/build/xtensor-stack/xtensor/test/test_xmanipulation.cpp:121: Failure
Expected equality of these values:
  fl
    Which is: { 1, 2, 1, 11, 12, 13, 21, 22, 23, 51, 52, 53, 61, 62, 63, 71, 72, 73 }
  expected
    Which is: { 1, 2, 3, 11, 12, 13, 21, 22, 23, 51, 52, 53, 61, 62, 63, 71, 72, 73 }
@davidbrochart
Copy link
Member Author

@davidbrochart davidbrochart commented Jul 22, 2020

Just thinking, currently a xdisk_array doesn't work on its own because it expects an xarray and a file created in xfiles_array, which manages a pool of arrays. But it might be interesting to have an xdisk_array that embeds its own xarray and file, so that it can be used as a persistent array. It would have to implement all the semantic stuff though.

@davidbrochart
Copy link
Member Author

@davidbrochart davidbrochart commented Jul 23, 2020

Now that xdisk_array has nothing to do with an on-disk array, it should be renamed. It is basically a wrapper around an array notifying when the array is changed. Any idea?
xfiles_array does all the file loading/dumping. It really makes sense only in the context of a chunked array. We should also find a better name.

@JohanMabille
Copy link
Member

@JohanMabille JohanMabille commented Jul 23, 2020

What about the following:

  • xfile_array: a N-D array backed by a file on the disk. It is a full expression (so it can be assigned other expressions) and can be used outside of the context of chunk arrayst. It embeds an in-memory container (the cache), a flag to say if the cache is dirty, and another flag to say if the file is already loaded in memory.. It also provides explicit load / unload methods to open the file and put its content in the in-memory container / write the in-memory container to the file and free its memory. Ideally, it is templated by a serializer / deserializer so it is independent form the file format.

  • xchunk_loading_manager: a container holding xfile_array objects (or any expression providing explicit load / unload methods) managing the load / unload of chunks depending on their size and the maximal amount of memory that a chunk array can use, and which chunk is accessed.

Notice that xfile_array assumes that the file can be entirely loaded into memory, we may need another mechanism for large files (maybe a new xmaped_file_array class that will rely on memory mapping).

@davidbrochart
Copy link
Member Author

@davidbrochart davidbrochart commented Jul 23, 2020

Thanks for the suggestions. I think it is a better abstraction than what we have right now, where we have the file loading/dumping in xfiles_array (that you now call xchunk_loading_manager), but should really be in xdisk_array (that you now call xfile_array).
A couple of remarks:

  • maybe we should introduce another abstraction in xfile_array, which doesn't have to be an on-disk array, but could be e.g. a cloud-based array. It could be through a template file type, providing that this type accepts common methods like open and close.
  • maybe change xchunk_loading_manager to xchunk_store_manager?
@JohanMabille
Copy link
Member

@JohanMabille JohanMabille commented Jul 23, 2020

For the first point, that could be handled by a io_type template parameter (or io_handler), which implements open / close methods. I agree that this will avoid to reinvent the wheel when we want to support distant files. Therefore, the declaration of xfile_array would look like:

template <class T, class io_handler, class serializer>
class xfile_array;

Totally agree on the second point, xchunk_store_manager is a better name.

@davidbrochart
Copy link
Member Author

@davidbrochart davidbrochart commented Jul 23, 2020

@JohanMabille This last commit renames xfiles_array to xchunk_store_manager, and xdisk_array to xfile_array. It also moves the file-related logic to the new xfile_array.
For now the file abstraction is handled by the SIN and SOUT template parameters (e.g. ifstream and ofstream for disk files):

template <class EC, class SIN, class SOUT>
class xfile_array;

I will look at your suggestion later.
Also, xfile_array is not a full expression yet. And xchunk_store_manager only manages one xfile_object for now, I need to implement a pool of objects.

@JohanMabille
Copy link
Member

@JohanMabille JohanMabille commented Jul 23, 2020

@davidbrochart awesome! I think further improvements (such as making xfile_array a valid tensor expression, or making xchunk_store_manager handling a pool) can be done in dedicated PRs.

I would also gather SIN and SOUT in a single io_handler tempate paremeter that handles both reading and writing.

@JohanMabille JohanMabille changed the title [WIP] Implement on-disk chunked array Implement on-disk chunked array Jul 24, 2020
@JohanMabille JohanMabille merged commit d44d8ae into xtensor-stack:master Jul 24, 2020
12 checks passed
12 checks passed
Travis CI - Pull Request Build Passed
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
xtensor-stack.xtensor Build #20200723.6 succeeded
Details
xtensor-stack.xtensor (Linux clang_4) Linux clang_4 succeeded
Details
xtensor-stack.xtensor (Linux clang_5) Linux clang_5 succeeded
Details
xtensor-stack.xtensor (Linux clang_6) Linux clang_6 succeeded
Details
xtensor-stack.xtensor (Linux clang_7) Linux clang_7 succeeded
Details
xtensor-stack.xtensor (Linux clang_8) Linux clang_8 succeeded
Details
xtensor-stack.xtensor (Linux clang_9) Linux clang_9 succeeded
Details
xtensor-stack.xtensor (OSX macOS_10_14) OSX macOS_10_14 succeeded
Details
xtensor-stack.xtensor (OSX macOS_10_15) OSX macOS_10_15 succeeded
Details
xtensor-stack.xtensor (Windows_clangcl) Windows_clangcl succeeded
Details
@davidbrochart davidbrochart deleted the davidbrochart:xchunked_array branch Jul 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.