Discussion:
[lxc-devel] device namespaces
(too old to reply)
Eric W. Biederman
2014-09-24 05:10:01 UTC
Permalink
(Please pardon multiple emails, artifact of merging all separate conversations)
Thanks for your feedback!
Letting the kernel know about what devices a container could access (based on
device cgroups) and having devtmpfs in the kernel create device nodes for a
container that map to corresponding CUSE nodes is what I thought of. For
example, "echo 29:0 > /proc/<pid>/devices" would prepare a virtual framebuffer
(based on real fb0 SCREENINFO properties) for this process provided permissions
allow this operation. To view the framebuffer, the CUSE based virtual device
would talk to the actual hardware. Since namespaces would have different view of
the underlying devices, "sysfs" has to made aware of this as well.
Please let me know your inputs. Thanks again!
The solution hugely depends on what you are trying to do with it.

The situation today is that device nodes are slowly fading out. In
another 20 years linux may not have any device nodes at all.

Therefore the question becomes what are you trying to support.

If it is just filtering of existing device nodes. We can do a pretty
good approximation with bind mounts.

If you want to emulate a device you can use normal fuse (not cuse).
As normal fuse file will support arbitrary ioctls.

There are a few cases where it is desirable to emulate what devpts
does for allowing arbitrary users to creating virtual devices in the
kernel. Loop devices in particular.

Ultimately given the existence of device hotplug I don't see any call
for being able to create device nodes with well known device numbers
(fundamentally what a device namespace would be about).

The conversation last year was about people wanting to multiplex devices
that don't have multiplexer support in the kernel. If that is your
desire I think it is entirely reasonable to device type by device type
add support for multiplexing that device type to the kernel, or
potentially just use fuse or cuse to implement your multiplexer in
userspace but that has the potential to be unusably slow.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Serge Hallyn
2014-09-24 16:40:03 UTC
Permalink
Isolation is provided by the devices cgroup. You want something more
than isolation.
My use case for having device namespaces is device isolation. Isn't what
namespaces are there for (as I understand)? Not everything should be
accessible (or even visible) from a container all the time (we have seen
people come up with different use cases for this). However, bind-mounting
takes away this flexibility. I agree that assigning fixed device numbers is
clearly not a long-term solution. Emulation for safe and flexible
multiplexing, like you suggested either using CUSE/FUSE or something like
devpts, is what I'm exploring.
(Please pardon multiple emails, artifact of merging all separate
conversations)
Thanks for your feedback!
Letting the kernel know about what devices a container could access
(based on
device cgroups) and having devtmpfs in the kernel create device nodes
for a
container that map to corresponding CUSE nodes is what I thought of. For
example, "echo 29:0 > /proc/<pid>/devices" would prepare a virtual
framebuffer
(based on real fb0 SCREENINFO properties) for this process provided
permissions
allow this operation. To view the framebuffer, the CUSE based virtual
device
would talk to the actual hardware. Since namespaces would have different
view of
the underlying devices, "sysfs" has to made aware of this as well.
Please let me know your inputs. Thanks again!
The solution hugely depends on what you are trying to do with it.
The situation today is that device nodes are slowly fading out. In
another 20 years linux may not have any device nodes at all.
Therefore the question becomes what are you trying to support.
If it is just filtering of existing device nodes. We can do a pretty
good approximation with bind mounts.
If you want to emulate a device you can use normal fuse (not cuse).
As normal fuse file will support arbitrary ioctls.
There are a few cases where it is desirable to emulate what devpts
does for allowing arbitrary users to creating virtual devices in the
kernel. Loop devices in particular.
Ultimately given the existence of device hotplug I don't see any call
for being able to create device nodes with well known device numbers
(fundamentally what a device namespace would be about).
The conversation last year was about people wanting to multiplex devices
that don't have multiplexer support in the kernel. If that is your
desire I think it is entirely reasonable to device type by device type
add support for multiplexing that device type to the kernel, or
potentially just use fuse or cuse to implement your multiplexer in
userspace but that has the potential to be unusably slow.
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Eric W. Biederman
2014-09-24 17:50:02 UTC
Permalink
Post by Serge Hallyn
Isolation is provided by the devices cgroup. You want something more
than isolation.
My use case for having device namespaces is device isolation. Isn't what
namespaces are there for (as I understand)?
Namespaces fundamentally provide for using the same ``global'' name
in different contexts. This allows them to be used for isolation
and process migration (because you can take the same name from
machine to machine).

Unless someone cares about device numbers at a namespace level
the work is done.

The mount namespace provides exsits to deal with file names.
The devices cgroup will limit which devices you can access (although
I can't ever imagine a case where the mout namespace would be
insufficient).
Post by Serge Hallyn
Not everything should be
accessible (or even visible) from a container all the time (we have seen
people come up with different use cases for this). However, bind-mounting
takes away this flexibility.
I don't see how. If they are mounts that propogate into the container
and are controlled from outside you can do whatever you want. (I am
imagining device by device bind mounts here). It should be trivial
to have a a directory tree that propogates into a container and works.
Post by Serge Hallyn
I agree that assigning fixed device numbers is
clearly not a long-term solution. Emulation for safe and flexible
multiplexing, like you suggested either using CUSE/FUSE or something like
devpts, is what I'm exploring.
Is the problem you actually care about multiplexing devices?

I think there is quite a bit of room to talk about how to safely
and effectively use devices in containers. So let's make that the
discussion. No one actually wants device number namespaces and talking
about them only muddies the watters.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Riya Khanna
2014-09-24 19:40:02 UTC
Permalink
Post by Eric W. Biederman
Post by Serge Hallyn
Isolation is provided by the devices cgroup. You want something more
than isolation.
My use case for having device namespaces is device isolation. Isn't what
namespaces are there for (as I understand)?
Namespaces fundamentally provide for using the same ``global'' name
in different contexts. This allows them to be used for isolation
and process migration (because you can take the same name from
machine to machine).
Unless someone cares about device numbers at a namespace level
the work is done.
The mount namespace provides exsits to deal with file names.
The devices cgroup will limit which devices you can access (although
I can't ever imagine a case where the mout namespace would be
insufficient).
Post by Serge Hallyn
Not everything should be
accessible (or even visible) from a container all the time (we have seen
people come up with different use cases for this). However, bind-mounting
takes away this flexibility.
I don't see how. If they are mounts that propogate into the container
and are controlled from outside you can do whatever you want. (I am
imagining device by device bind mounts here). It should be trivial
to have a a directory tree that propogates into a container and works.
Device-by-device bind mounts can grant/revoke access to real individual devices as and when needed. However, revoking the access to real devices could break the applications if there’s no transparent mechanism to back up the propagated (but now revoked) device bind mounts that could fool the apps into believing that they are working with real devices. Frame buffer is one such example, where safe multiplexing could be applied.
Post by Eric W. Biederman
Post by Serge Hallyn
I agree that assigning fixed device numbers is
clearly not a long-term solution. Emulation for safe and flexible
multiplexing, like you suggested either using CUSE/FUSE or something like
devpts, is what I'm exploring.
Is the problem you actually care about multiplexing devices?
The problem I care about is access to real devices, such as input, fb, loop, etc. as and when needed, thereby having native I/O performance - either through secure multiplexing or exclusive ownership, whatever makes sense according to the device type.
Post by Eric W. Biederman
I think there is quite a bit of room to talk about how to safely
and effectively use devices in containers. So let's make that the
discussion. No one actually wants device number namespaces and talking
about them only muddies the watters.
I cannot agree more. Let’s restrict the discussion to it.

Thanks,
Riya
Post by Eric W. Biederman
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Eric W. Biederman
2014-09-24 22:40:01 UTC
Permalink
Post by Riya Khanna
Post by Eric W. Biederman
Post by Serge Hallyn
Isolation is provided by the devices cgroup. You want something more
than isolation.
My use case for having device namespaces is device isolation. Isn't what
namespaces are there for (as I understand)?
Namespaces fundamentally provide for using the same ``global'' name
in different contexts. This allows them to be used for isolation
and process migration (because you can take the same name from
machine to machine).
Unless someone cares about device numbers at a namespace level
the work is done.
The mount namespace provides exsits to deal with file names.
The devices cgroup will limit which devices you can access (although
I can't ever imagine a case where the mout namespace would be
insufficient).
Post by Serge Hallyn
Not everything should be
accessible (or even visible) from a container all the time (we have seen
people come up with different use cases for this). However, bind-mounting
takes away this flexibility.
I don't see how. If they are mounts that propogate into the container
and are controlled from outside you can do whatever you want. (I am
imagining device by device bind mounts here). It should be trivial
to have a a directory tree that propogates into a container and works.
Device-by-device bind mounts can grant/revoke access to real
individual devices as and when needed. However, revoking the access to
real devices could break the applications if there’s no transparent
mechanism to back up the propagated (but now revoked) device bind
mounts that could fool the apps into believing that they are working
with real devices. Frame buffer is one such example, where safe
multiplexing could be applied.
Post by Eric W. Biederman
Post by Serge Hallyn
I agree that assigning fixed device numbers is
clearly not a long-term solution. Emulation for safe and flexible
multiplexing, like you suggested either using CUSE/FUSE or something like
devpts, is what I'm exploring.
Is the problem you actually care about multiplexing devices?
The problem I care about is access to real devices, such as input, fb,
loop, etc. as and when needed, thereby having native I/O performance -
either through secure multiplexing or exclusive ownership, whatever
makes sense according to the device type.
I guess policy-based multiplexing (or exclusive ownership) is the
usage. What kind of devices (loop, fb, etc.) this is needed for
depends on the usage. If there are multiple FBs, then each container
could potentially own one. One may want to provide exclusive ownership
of input devices to one container at a time to avoid information
leakage. Like we saw at LPC last year, this applies to sensors (gps,
accelerometer, etc.) on mobile devices as well.
Allowing mutiplexing of those devices seems reasonable.

Where the discussion ran into problems last time was that people did not
want to use any of the existing linux solutions for multiplexing those
kind of thing and wanted to invent something new.

Inventing something new is fine if it the extra code maintenance can be
justified, or if the invention just a better solution for all users and
new code can just start using that in general.

The old solution to your problem of multiplexing devices is by
allocating a virtual terminal nd sending signals to coordinate
cooperatively sharing those resources.

If you want some sort of preemtive multitasking that requires
something a bit more effort, and work in the device abstractions.
You may be able to share concepts and library code but I don't believe
there is something you can just pain on top of devices and make it
happen. Certainly in the bad old days of X terminal switching the
cooperation was necessary so that when a video card was yanked from an
application writing directly to that video card the application would
need to restore the video card to a known state so the next application
would have a chance of making sense of it. Furthermore most devices
are not safe to let unprivileged users to access their control registers
directly.

All of which boils down the simple fact that for each type of device you
would like to share it is necessary to update the subsystem to support
arbitrary numbers of virtual devices that you can talk to.

The macvlan driver in the networking stack is a rough example of what I
expect you would like. Something that takes one real physical device
and turns it into N virtual devices each of which runs at effectively
full speed. Along with some kind of new master interface for
controlling when the multiplexing takes place.

I think we do most of this is software today and arguably for a lot of
devices the overhead is small enough that a software solution is fine.
So perhaps all you need is a fuse interface to the existing software
multiplexers so that weird legacy code can be made to run.

Now I suspect part of doing this right will be getting proper video
drivers on Android. I assume that Android is the platform you care
about.

Eric


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
riya khanna
2014-09-25 15:50:02 UTC
Permalink
Is there a plan or work-in-progress to add namespace tags to other
classes in sysfs similar to net? Does it make sense to add namespace
tags to kobjects?

-Riya
Post by Eric W. Biederman
Post by Riya Khanna
Post by Eric W. Biederman
Post by Serge Hallyn
Isolation is provided by the devices cgroup. You want something more
than isolation.
My use case for having device namespaces is device isolation. Isn't what
namespaces are there for (as I understand)?
Namespaces fundamentally provide for using the same ``global'' name
in different contexts. This allows them to be used for isolation
and process migration (because you can take the same name from
machine to machine).
Unless someone cares about device numbers at a namespace level
the work is done.
The mount namespace provides exsits to deal with file names.
The devices cgroup will limit which devices you can access (although
I can't ever imagine a case where the mout namespace would be
insufficient).
Post by Serge Hallyn
Not everything should be
accessible (or even visible) from a container all the time (we have seen
people come up with different use cases for this). However, bind-mounting
takes away this flexibility.
I don't see how. If they are mounts that propogate into the container
and are controlled from outside you can do whatever you want. (I am
imagining device by device bind mounts here). It should be trivial
to have a a directory tree that propogates into a container and works.
Device-by-device bind mounts can grant/revoke access to real
individual devices as and when needed. However, revoking the access to
real devices could break the applications if there’s no transparent
mechanism to back up the propagated (but now revoked) device bind
mounts that could fool the apps into believing that they are working
with real devices. Frame buffer is one such example, where safe
multiplexing could be applied.
Post by Eric W. Biederman
Post by Serge Hallyn
I agree that assigning fixed device numbers is
clearly not a long-term solution. Emulation for safe and flexible
multiplexing, like you suggested either using CUSE/FUSE or something like
devpts, is what I'm exploring.
Is the problem you actually care about multiplexing devices?
The problem I care about is access to real devices, such as input, fb,
loop, etc. as and when needed, thereby having native I/O performance -
either through secure multiplexing or exclusive ownership, whatever
makes sense according to the device type.
I guess policy-based multiplexing (or exclusive ownership) is the
usage. What kind of devices (loop, fb, etc.) this is needed for
depends on the usage. If there are multiple FBs, then each container
could potentially own one. One may want to provide exclusive ownership
of input devices to one container at a time to avoid information
leakage. Like we saw at LPC last year, this applies to sensors (gps,
accelerometer, etc.) on mobile devices as well.
Allowing mutiplexing of those devices seems reasonable.
Where the discussion ran into problems last time was that people did not
want to use any of the existing linux solutions for multiplexing those
kind of thing and wanted to invent something new.
Inventing something new is fine if it the extra code maintenance can be
justified, or if the invention just a better solution for all users and
new code can just start using that in general.
The old solution to your problem of multiplexing devices is by
allocating a virtual terminal nd sending signals to coordinate
cooperatively sharing those resources.
If you want some sort of preemtive multitasking that requires
something a bit more effort, and work in the device abstractions.
You may be able to share concepts and library code but I don't believe
there is something you can just pain on top of devices and make it
happen. Certainly in the bad old days of X terminal switching the
cooperation was necessary so that when a video card was yanked from an
application writing directly to that video card the application would
need to restore the video card to a known state so the next application
would have a chance of making sense of it. Furthermore most devices
are not safe to let unprivileged users to access their control registers
directly.
All of which boils down the simple fact that for each type of device you
would like to share it is necessary to update the subsystem to support
arbitrary numbers of virtual devices that you can talk to.
The macvlan driver in the networking stack is a rough example of what I
expect you would like. Something that takes one real physical device
and turns it into N virtual devices each of which runs at effectively
full speed. Along with some kind of new master interface for
controlling when the multiplexing takes place.
I think we do most of this is software today and arguably for a lot of
devices the overhead is small enough that a software solution is fine.
So perhaps all you need is a fuse interface to the existing software
multiplexers so that weird legacy code can be made to run.
What kind of existing multiplexers could be used? Is there one for fb? We
have evdev abstractions for input in place already.
Post by Eric W. Biederman
Now I suspect part of doing this right will be getting proper video
drivers on Android. I assume that Android is the platform you care
about.
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Eric W. Biederman
2014-09-25 18:20:01 UTC
Permalink
Post by riya khanna
Is there a plan or work-in-progress to add namespace tags to other
classes in sysfs similar to net? Does it make sense to add namespace
tags to kobjects?
Currently the a general nack from gregkh on such work.

Given that sysfs is almost never a fast path I suspect it makes most
sense to filter sysfs in some way (aka bind mounts or fuse) and present
the results to the container.

At the point this is something that we are using a lot and have
demonstrated the usefulness of it and it appears a kernel level
solution would be better it would be worth reopening the disucssion.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Eric W. Biederman
2014-09-25 18:30:02 UTC
Permalink
What kind of existing multiplexers could be used? Is there one for fb? We have
evdev abstractions for input in place already.
We have X and Wayland/Weston and pulse audio and doubtless more that I
am not aware of.

For video a lot of working is going into compositing and handling
multiple contexts in the hardware so there may already be support in the
kernel.

Fundamentally these are all pieces of hardware we allow multiple
userspace applications access to their information or to modify.
Therefore there is existing multiplexing somewhere.

I won't claim all of the existing multiplexing methods are good and
should be used as is, but they definitely should be used as a starting
point.


From another perspective there is how kvm tackles this today. If you
really want to emulate the hardware and make it appear that your
instance of userspace has direct hardware access building upon the
infrastructure that is used for kvm may be worth exploring.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Riya Khanna
2014-09-24 19:10:02 UTC
Permalink
I guess policy-based multiplexing (or exclusive ownership) is the usage. What kind of devices (loop, fb, etc.) this is needed for depends on the usage. If there are multiple FBs, then each container could potentially own one. One may want to provide exclusive ownership of input devices to one container at a time to avoid information leakage. Like we saw at LPC last year, this applies to sensors (gps, accelerometer, etc.) on mobile devices as well.
Post by Serge Hallyn
Isolation is provided by the devices cgroup. You want something more
than isolation.
My use case for having device namespaces is device isolation. Isn't what
namespaces are there for (as I understand)? Not everything should be
accessible (or even visible) from a container all the time (we have seen
people come up with different use cases for this). However, bind-mounting
takes away this flexibility. I agree that assigning fixed device numbers is
clearly not a long-term solution. Emulation for safe and flexible
multiplexing, like you suggested either using CUSE/FUSE or something like
devpts, is what I'm exploring.
(Please pardon multiple emails, artifact of merging all separate
conversations)
Thanks for your feedback!
Letting the kernel know about what devices a container could access
(based on
device cgroups) and having devtmpfs in the kernel create device nodes
for a
container that map to corresponding CUSE nodes is what I thought of. For
example, "echo 29:0 > /proc/<pid>/devices" would prepare a virtual
framebuffer
(based on real fb0 SCREENINFO properties) for this process provided
permissions
allow this operation. To view the framebuffer, the CUSE based virtual
device
would talk to the actual hardware. Since namespaces would have different
view of
the underlying devices, "sysfs" has to made aware of this as well.
Please let me know your inputs. Thanks again!
The solution hugely depends on what you are trying to do with it.
The situation today is that device nodes are slowly fading out. In
another 20 years linux may not have any device nodes at all.
Therefore the question becomes what are you trying to support.
If it is just filtering of existing device nodes. We can do a pretty
good approximation with bind mounts.
If you want to emulate a device you can use normal fuse (not cuse).
As normal fuse file will support arbitrary ioctls.
There are a few cases where it is desirable to emulate what devpts
does for allowing arbitrary users to creating virtual devices in the
kernel. Loop devices in particular.
Ultimately given the existence of device hotplug I don't see any call
for being able to create device nodes with well known device numbers
(fundamentally what a device namespace would be about).
The conversation last year was about people wanting to multiplex devices
that don't have multiplexer support in the kernel. If that is your
desire I think it is entirely reasonable to device type by device type
add support for multiplexing that device type to the kernel, or
potentially just use fuse or cuse to implement your multiplexer in
userspace but that has the potential to be unusably slow.
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Serge Hallyn
2014-09-24 16:40:03 UTC
Permalink
Post by Eric W. Biederman
(Please pardon multiple emails, artifact of merging all separate conversations)
Thanks for your feedback!
Letting the kernel know about what devices a container could access (based on
device cgroups) and having devtmpfs in the kernel create device nodes for a
container that map to corresponding CUSE nodes is what I thought of. For
example, "echo 29:0 > /proc/<pid>/devices" would prepare a virtual framebuffer
(based on real fb0 SCREENINFO properties) for this process provided permissions
allow this operation. To view the framebuffer, the CUSE based virtual device
would talk to the actual hardware. Since namespaces would have different view of
the underlying devices, "sysfs" has to made aware of this as well.
Please let me know your inputs. Thanks again!
The solution hugely depends on what you are trying to do with it.
The situation today is that device nodes are slowly fading out. In
another 20 years linux may not have any device nodes at all.
Therefore the question becomes what are you trying to support.
If it is just filtering of existing device nodes. We can do a pretty
good approximation with bind mounts.
If you want to emulate a device you can use normal fuse (not cuse).
As normal fuse file will support arbitrary ioctls.
There are a few cases where it is desirable to emulate what devpts
does for allowing arbitrary users to creating virtual devices in the
kernel. Loop devices in particular.
Ultimately given the existence of device hotplug I don't see any call
for being able to create device nodes with well known device numbers
(fundamentally what a device namespace would be about).
The conversation last year was about people wanting to multiplex devices
that don't have multiplexer support in the kernel. If that is your
desire I think it is entirely reasonable to device type by device type
add support for multiplexing that device type to the kernel, or
potentially just use fuse or cuse to implement your multiplexer in
userspace but that has the potential to be unusably slow.
It would be helpful to have a list of devices that may want that
multiplexing. Is it really just loop and graphics drivers?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Continue reading on narkive:
Loading...