An integrated photonics computing system implements a residue number system (RNS) to achieve orders of magnitude improvements in computational speed per watt over the current state-of-the-art. RNS and nanophotonics have a natural affinity where most operations can be achieved as spatial routing using electrically controlled directional coupler switches, thereby giving rise to an innovative processing-in-network (PIN) paradigm. The system provides a path for attojoule-per-bit efficient and fast electro-optic switching devices, and uses them to develop optical compute engines based on residue arithmetic leading to multi-purpose nanophotonic computing.
Figure 1 - PRIOR ART

Figure 2
Figure 3

Figure 4
Figure 5

Figure 6

(one hot encoding)

Photonic

Electronic
Examples:
\[ \lambda_1: 1+4 = |5| = 0 \quad \text{green} \]
\[ \lambda_2: 1+4 = |5| = 0 \quad \text{blue} \]
\[ \lambda_n: 0+4 = |4| = 4 \quad \text{purple} \]

**Figure 9**

**Figure 10**
Figure 11

Figure 12
RESIDUE ARITHMETIC NANOPHOTONIC SYSTEM

RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application No. 62/634,658, filed Feb. 23, 2018, the entire contents of which are incorporated herein by reference.

BACKGROUND OF THE INVENTION

Field of the Invention

The present invention relates to integrated photonics computing system (from device to architectures) based on the residue number system (RNS).

Background of the Related Art

Due to the end of Moore's law and Dennard scaling, feature reduction and higher speed of clocking are seizing to be the source for higher computer performance. Therefore, it is of paramount interest to explore alternative technologies and architectures for this post-Moore's law era of computing to maintain the US competitive edge and the U.S. Air Force superiority in all tasks that require computing. The annual R&D priorities memorandum issued by the administration in July of 2018 identifies strategic computing to be among the priorities for the U.S. national security.

SUMMARY OF THE INVENTION

The goal of the present invention is to develop an integrated photonics computing system (from device to architectures) based on the residue number system (RNS) to achieve orders of magnitude improvements in computational speed per watt over the current state-of-the-art. Residue arithmetic is of particular interest as it can represent a large number as a set of smaller numbers, which can be processed individually in parallel. Furthermore, RNS and nanophotonics have a natural affinity where most operations can be achieved as spatial routing using electrically controlled directional coupler ('switches'), thereby giving rise to an innovative processing-in-network (PIN) paradigm. The invention provides a path for attojoule-per-bit efficient and fast electro-optic switching devices, and uses them to develop optical compute engines based on residue arithmetic leading to multi-purpose nanophotonic computing.

The invention has a vertical approach that leverages its synergistic proven record in heterogeneous integrated photonics and light-matter-enhancement techniques with novel circuit and electro-optic hybrid, computer architecture and high-performance architectures for enabling synergistic device-to-architecture co-design. The resulting novel compute engines feature reduced complexity and processing-in-network (PIN) computing schemes, which minimizes overheads. Figure-of-merits (Speed/Energy-Footprint) estimates surpass electronic counterparts by orders-of-magnitude.

These and other objects of the invention, as well as many of the intended advantages thereof, will become more readily apparent when reference is made to the following description, taken in conjunction with the accompanying drawings.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 shows an example of a conventional 2x2 photonic switch that can be utilized with the invention;
FIG. 2 shows an illustrative non-limiting example of how the system scales to a modulo M (M is odd and larger than 3) system;
FIG. 3 is an example of the invention having a modulo-5 with 2x2 switches for addition;
FIG. 4 is an example of the invention having a modulo-5 with 2x2 switches for multiplication;
FIG. 5 shows a basic modulo adder;
FIG. 6 shows a modulo-7 adder for the calculation of 5+2;
FIG. 7 shows a multi-core programmable RNS computing/switching array;
FIG. 8 shows the mapping of a reduction operation SUM onto the RNS array;
FIG. 9 is a system to implement RNS using wave division multiplexing;
FIG. 10 shows a processing device for use with the switching system of the present invention;
FIG. 11 shows a barrier operation;
FIG. 12 shows a discrete-time FIR filter of order N;
FIG. 13 shows the FIR filter for use with the present invention; and
FIG. 14 shows a tunable delay line made up of ring filters.

DETAILED DESCRIPTION OF THE INVENTION

In describing the illustrative, non-limiting embodiments of the invention illustrated in the drawings, specific terminology will be resorted to for the sake of clarity. However, the invention is not intended to be limited to the specific terms so selected, and it is to be understood that each specific term includes all technical equivalents that operate in similar manner to accomplish a similar purpose. Several embodiments of the invention are described for illustrative purposes, it being understood that the invention may be embodied in other forms not specifically shown in the drawings.

Following the established technology roadmap for electronic devices for interconnected manycore chips, the power consumed due to communications as compared to computations continues to dramatically grow and the available bandwidth per compute operation will continue to drop. There are a variety of disruptive routes to enable transformational computing concepts; i) at the device and technology level, switching from electronics to optics, the bosonic character of photons can be used towards massively parallel data routing opportunities, ii) at the chip level a deviation from the memory-centric data-moving hungry standard von Neumann model is inevitable.

With classical integrated photonics (i.e. diffraction limited) being plagued by sizable footprints and power inefficiencies, optical logic appears to be disadvantaged to electronics. However, synergistic opportunities are present by intertwining computing and routing such that many operations can be executed without having to fetch and store lots of intermediate data, leading to new processing-in-network (PIN) concepts for developing post-Moore’s law processors. That is, mapping mathematical arithmetic onto a network (here an optical network) of controllable optical
switches enables a new class of multi-purpose computer that harvests the extreme parallelism and low power of integrated photonics. A key is to execute this vision at compact length scales, high speed, and attojoule per bit (aJ/bit) power budgets.

The invention utilizes a nanophotonic photonic 2x2 switching device as a basic building block (for both processing and routing) in constructing a network of optical processors on the chip, thereby achieving unprecedented high operations/watt. The 2x2 switch is based on a voltage-controlled directional coupler embedding the following key design insights: i) heterogeneous integration of unity-strong optical index modulating materials (i.e. ITO), ii) allowing for micrometer-compact and attojoule efficient switching, while iii) being utilizing silicon photonics as a platform. The invention can utilize any suitable switching device, such as for example as shown in U.S. Patent No. 9,529,158 and U.S. Published Patent Application No. 2018/0246391, the entire contents of which are hereby incorporated by reference. One such switching device is shown, for example, in FIG. 1.

Using the nanophotonic switch into a crossbar fabric, processors can be built as extended residue arithmetic engine. This further enables intelligent crossbar architectures via interconnecting multiple residue processors. This allows performing many parallel computations using residue arithmetic and easy to leverage mathematical operations. For instance, collective operations such as reduction, as well as barrier synchronization as a part of its basic switching functionality, which allows for energy and buffering savings at the network level. We therefore provide such tight coupling of routing and computations to lead to a) high performance/cost functions, b) new paradigms of mapping an algorithm onto hardware, and c) novel computer designs deviating from von Neumann.

The integrated photonic switches are used as a basic building block to provide a general-purpose processor using residue-arithmetic complemented by other principles as necessary, and provide a chip-wide intelligent nanophotonic crossbar and various networks connecting all processors and enabling the mapping of some of the collective operations onto the cross bar. The invention can also include architectural level using FPGAs.

As such the present invention offers transformative insights by exploring transparent conductive oxides (TCO) for strong refractive index modulation via strong enhancements of light-matter-interactions. It also provides attojoule per bit efficient and GHz-fast optical switching devices. The compact 2x2 switches form basic building blocks for optical residue arithmetic functions. Devices are cascade-able and yield compute-performance related figure-of-merit: Latency/Energy-Footprint (GHz/μJ-cm2) that is significantly higher compared to electronic counterparts. Three SOI waveguides based mode-eliminating switch can have nanometer scale metal heaters as switching control by using self-aligned fabrication method. And high index tuning TCO (i.e. Indium Tin Oxide) can provide switch tuning.

The invention enables a novel approach to the design and evaluation of an entire class of optical compute engines based on residue arithmetic leading to multi-purpose computing. And it enables massively parallel and in-the-network computing designs, thus creating a path to deviate from the problematic von Neumann architecture. It also provides co-design principles that relate device technology to the switch, the network architecture and the routing algorithm and methodology. The invention can emulate and evaluate the performance and accuracy using well-accepted community benchmarks, and gain implementation insights with FPGA prototyping. The invention provides rapid and agile prototyping with enabling insights for advanced manufacturing on a silicon photonics platform. It also provides collective synergistic experiences of the PI’s, who are well established in their fields, to explore innovative nanophotonic computing paradigms.

Atto-Joule Nanophotonics and Electro-Optic Switching

Using the nanophotonic switch into a crossbar fabric, processors can be built as extended residue arithmetic engine. This further enables intelligent crossbar architectures via interconnecting multiple residue processors. This allows performing many parallel computations using residue arithmetic and easy to leverage mathematical operations. For instance, collective operations such as reduction, as well as barrier synchronization as a part of its basic switching functionality, which allows for energy and buffering savings at the network level. We therefore provide such tight coupling of routing and computations to lead to a) high performance/cost functions, b) new paradigms of mapping an algorithm onto hardware, and c) novel computer designs deviating from von Neumann.

The integrated photonic switches are used as a basic building block to provide a general-purpose processor using residue-arithmetic complemented by other principles as necessary, and provide a chip-wide intelligent nanophotonic crossbar and various networks connecting all processors and enabling the mapping of some of the collective operations onto the cross bar. The invention can also include architectural level using FPGAs.

As such the present invention offers transformative insights by exploring transparent conductive oxides (TCO) for strong refractive index modulation via strong enhancements of light-matter-interactions. It also provides attojoule per bit efficient and GHz-fast optical switching devices. The compact 2x2 switches form basic building blocks for optical residue arithmetic functions. Devices are cascade-able and yield compute-performance related figure-of-merit: Latency/Energy-Footprint (GHz/μJ-cm2) that is significantly higher compared to electronic counterparts. Three SOI waveguides based mode-eliminating switch can have nanometer scale metal heaters as switching control by using self-aligned fabrication method. And high index tuning TCO (i.e. Indium Tin Oxide) can provide switch tuning.

The invention enables a novel approach to the design and evaluation of an entire class of optical compute engines based on residue arithmetic leading to multi-purpose computing. And it enables massively parallel and in-the-network computing designs, thus creating a path to deviate from the problematic von Neumann architecture. It also provides co-design principles that relate device technology to the switch, the network architecture and the routing algorithm and methodology. The invention can emulate and evaluate the performance and accuracy using well-accepted community benchmarks, and gain implementation insights with FPGA prototyping. The invention provides rapid and agile prototyping with enabling insights for advanced manufacturing on a silicon photonics platform. It also provides collective synergistic experiences of the PI’s, who are well established in their fields, to explore innovative nanophotonic computing paradigms.

Atto-Joule Nanophotonics and Electro-Optic Switching

Using the nanophotonic switch into a crossbar fabric, processors can be built as extended residue arithmetic engine. This further enables intelligent crossbar architectures via interconnecting multiple residue processors. This allows performing many parallel computations using residue arithmetic and easy to leverage mathematical operations. For instance, collective operations such as reduction, as well as barrier synchronization as a part of its basic switching functionality, which allows for energy and buffering savings at the network level. We therefore provide such tight coupling of routing and computations to lead to a) high performance/cost functions, b) new paradigms of mapping an algorithm onto hardware, and c) novel computer designs deviating from von Neumann.

The invention utilizes a nanophotonic photonic 2x2 switching device as a basic building block (for both processing and routing) in constructing a network of optical processors on the chip, thereby achieving unprecedented high operations/watt. The 2x2 switch is based on a voltage-controlled directional coupler embedding the following key design insights: i) heterogeneous integration of unity-strong optical index modulating materials (i.e. ITO), ii) allowing for micrometer-compact and attojoule efficient switching, while iii) being utilizing silicon photonics as a platform. The invention can utilize any suitable switching device, such as for example as shown in U.S. Patent No. 9,529,158 and U.S. Published Patent Application No. 2018/0246391, the entire contents of which are hereby incorporated by reference. One such switching device is shown, for example, in FIG. 1.

Using the nanophotonic switch into a crossbar fabric, processors can be built as extended residue arithmetic engine. This further enables intelligent crossbar architectures via interconnecting multiple residue processors. This allows performing many parallel computations using residue arithmetic and easy to leverage mathematical operations. For instance, collective operations such as reduction, as well as barrier synchronization as a part of its basic switching functionality, which allows for energy and buffering savings at the network level. We therefore provide such tight coupling of routing and computations to lead to a) high performance/cost functions, b) new paradigms of mapping an algorithm onto hardware, and c) novel computer designs deviating from von Neumann.

The integrated photonic switches are used as a basic building block to provide a general-purpose processor using residue-arithmetic complemented by other principles as necessary, and provide a chip-wide intelligent nanophotonic crossbar and various networks connecting all processors and enabling the mapping of some of the collective operations onto the cross bar. The invention can also include architectural level using FPGAs.

The invention utilizes a nanophotonic photonic 2x2 switching device as a basic building block (for both processing and routing) in constructing a network of optical processors on the chip, thereby achieving unprecedented high operations/watt. The 2x2 switch is based on a voltage-controlled directional coupler embedding the following key design insights: i) heterogeneous integration of unity-strong optical index modulating materials (i.e. ITO), ii) allowing for micrometer-compact and attojoule efficient switching, while iii) being utilizing silicon photonics as a platform. The invention can utilize any suitable switching device, such as for example as shown in U.S. Patent No. 9,529,158 and U.S. Published Patent Application No. 2018/0246391, the entire contents of which are hereby incorporated by reference. One such switching device is shown, for example, in FIG. 1.

Using the nanophotonic switch into a crossbar fabric, processors can be built as extended residue arithmetic engine. This further enables intelligent crossbar architectures via interconnecting multiple residue processors. This allows performing many parallel computations using residue arithmetic and easy to leverage mathematical operations. For instance, collective operations such as reduction, as well as barrier synchronization as a part of its basic switching functionality, which allows for energy and buffering savings at the network level. We therefore provide such tight coupling of routing and computations to lead to a) high performance/cost functions, b) new paradigms of mapping an algorithm onto hardware, and c) novel computer designs deviating from von Neumann.

The invention utilizes a nanophotonic photonic 2x2 switching device as a basic building block (for both processing and routing) in constructing a network of optical processors on the chip, thereby achieving unprecedented high operations/watt. The 2x2 switch is based on a voltage-controlled directional coupler embedding the following key design insights: i) heterogeneous integration of unity-strong optical index modulating materials (i.e. ITO), ii) allowing for micrometer-compact and attojoule efficient switching, while iii) being utilizing silicon photonics as a platform. The invention can utilize any suitable switching device, such as for example as shown in U.S. Patent No. 9,529,158 and U.S. Published Patent Application No. 2018/0246391, the entire contents of which are hereby incorporated by reference. One such switching device is shown, for example, in FIG. 1.
where \( f_{pe} = 1/(2\pi\nu_{pe}), f_{pc} = 1/(2\pi(R_c + R_p)\epsilon) \). \( R_c \) is the modulator series resistance, \( R_p \) is the driver impedance, and \( C \) is the junction capacitance, here \( C = \epsilon_0 \epsilon_r \) (w/\( \epsilon \)), where \( h \) is the thickness of a device volume, where \( \epsilon \) is the vacuum permittivity, \( \epsilon_r \) is the relative permittivity of the photonic material. Eqn. 3 indicates that the modulation bandwidth is limited by Q factor. For the comparison of EOM energy efficiency and modulation speed, we configure an EOM cavity with cavity enhanced by ring resonator, Fabry-Perot cavity, and plasmonic particles cavity, respectively (FIG. 2) [LIU16]. As one can see that it is possible to design (sub)micrometer short electro-optic switching and modulation devices operating at 10-100’s of aJ/bit approaching ps timescales. Efficient 2x2 switching elements serve as building blocks for residue arithmetic in optical general-purpose computing.

Switchable Materials: Transparent Conductive Oxides (TCO)

[0032] A key design choice of EO active devices is the material whose refractive index is actively modulated. A promising modulation mechanism- material combination is free carrier dispersive index-tuning in Indium-Tin-Oxide (ITO) or Aluminum Zinc Oxide (AZO). Both belong to the family of transparent conducting oxides (TCO), which traditionally are deployed in the solar industry as low light absorbing electrical contacts. The ITO can alter its refractive index significantly upon charge accumulation in MOS-like structures in the near IR frequency range. Here we incorporate the ITO into Silicon photonic 2x2 switches and their respective dispersive index tuning. Note, the resistive well characteristics of ITO depends on both the oxygen concentration during deposition and tin doping activation (e.g. via temperature).

Nanophotonic 2x2 Switch Operating Principles

[0033] The elemental building block of the general-purpose computing engine based on residue arithmetic pursued in this effort is based on a nonlinear active switching element, which relies on altering the refractive index of a nanometer thin TCO layer sandwiched in a hybrid-plasmon polariton mode. The device enables control of the index by both electrostatic and pico-second fast operation due to a short capacitance, and short cavity lifetime (Table 1). A design is the 3-waveguide directional coupler where an optical signal is switched from the input BAR waveguide (Silicon on insulator SOI) to the CROSS port (the middle part of the 3 supermodes). On the other hand, if the light injected into the middle island only stays in the plasmonic metal which performs double duty here.

\[
L_c = \left( \frac{\lambda}{(TM_1 + TM_2) - 2TM_2} \right) = \frac{\lambda}{2\Delta n_{eff}}. \tag{4}
\]

where \( \Delta n_{eff} \) is the bias dependent index difference TM waveguide modes inside the island section of the device. The dramatic ITO index shift along with the strong light-matter-interaction of the plasmonic hybrid mode enable very efficient modulation of the supermode, and thus the single 2x2 switching element is ~5 um short resuling in 10-100’s of aJ capacitances and hence operations in the deep sub \( \Omega/\)bit range. The extinction ratio (ER) and insertion loss (IL) are both measured as the power ratio between the two output states showing a higher performance for the CROSS state (Table 1).

[0034] The reason for the more lossy BAR state is that the ITO middle waveguide (‘island’) is biased to become quasi metallic making the island reflective. However, a small portion interacts with the island and suffers optical attenuation of about 2 dB per switch, and is about 1 dB for the CROSS state. This results in operations that can be as low as Table 1, which shows quantitative performance estimates for the compact plasmonic EO switch, where the device is operated at the wavelength of 1.55 µm. The gate oxide thickness varies from 5 to 25 nm. The Energy per bit (E/bit) is calculated by E/bit = \( 1/C \) V2, where \( C \) is the device capacitance, \( V \) is the driving voltage, and \( \Delta V_{bias} = 1-2 \) V for ITO. Restistance 50-500Ω. The response time of the switch is expected to be rather fast mainly due to a low electrical capacitance, and low-quality factor cavities (i.e. no cavity deployed). While the mobility of ITO is usually low, this carrier-based switching effect is actually not limited by such mobility; the formation of the accumulation layer is equivalent to a time of flight is and is sub ps, which is \( 1/3 \)rd of the Fermi velocity. The device can be biased by a metallic via from the top to the plasmonic metal which performs double duty here.

<table>
<thead>
<tr>
<th>FOOTPRINT</th>
<th>CROSS</th>
<th>BAR</th>
<th>CROSS</th>
<th>BAR</th>
<th>IL</th>
<th>ER</th>
<th>E/bit</th>
<th>Response time</th>
</tr>
</thead>
<tbody>
<tr>
<td>( \mu m^2 )</td>
<td>5-8</td>
<td>1.3</td>
<td>2.4</td>
<td>17.6</td>
<td>7.2</td>
<td>0-32</td>
<td>( 1-10 )</td>
<td></td>
</tr>
</tbody>
</table>

[0035] A silicon-based device can be fabricated with similar design and the same switching concept of the switch 50. And instead of using the electro-optical index tuning of IITO, the middle island of the 3-waveguide silicon coupler can be thermally tuned by using a metal heater strip on top of it with oxide cladding sandwiched in between to avoid high loss plasmonic mode. To verify the design, measuring the light output from the two outer waveguides and the light intensity coming from the middle island, we can quantitatively evaluate the tunability of the metal heater. For example, if the light is injected from one side of the bus and been detected from the other side of the bus waveguides, then we could assume that this switch is still in the critical coupling state (the middle part of the three supermodes). On the other hand, if the light injected into the middle island only stays in the middle island, then proves that the switch is at the mode-eliminating state, in which the middle waveguide is isolated from the system (the right part of the three supermodes).

[0036] Moreover, due to the complexity (over 4 critical variables and 5 more related variables all related to the tuning ability and the final performance) of this switch design, we implement an integrated a script-based solver with automatic performance evaluation system into Lumerical Mode and Interconnect software to increase our simulation speed and efficiency. With such automatic solver, we are able to map out the entire relationship between the effective index changes and the variables. As a result, this complete mapping is able to reveal all the connections.
between every two variables as a trade-off and help to enhance the switch performance at both the critical coupling state and the mode-eliminating state. [0037] Instead of using the metal heater in the device 50, the two states (OFF and ON states) are achieved by varying the width of the middle waveguide and the corresponding gaps in between the outer waveguide. By injecting the light from either the outer waveguides or the middle waveguide, the light will be coupled into its adjacent waveguides in the critical coupling case. However, if it is in the mode-eliminating case, the light will only be propagated within the same waveguide without coupling since the center supermode is isolated from the other two supermodes which support the light coupler between two outer waveguides. Based on our preliminary measurement results, the average loss of a 750 µm long switch is 34 dB (with 5 mW input power and 2.3 µW output power). However, the averaged loss of a 750 µm long waveguide on the chip with the same fabrication process is 30 dB (with 5 mW input power and 5 µW output power). Therefore, most of the loss in the measurement is due to the optical probe scattering and reflection, and the loss caused by the AE structure is about 4 dB, which yields a 0.005 dB/µm propagation loss that matches with our simulation result.

[0038] To enable active tuning, a metal heater strip can be fabricated on top of the center waveguide after depositing a layer of oxide cladding. With the temperature-dependent refractive index of silicon, the switch can be turned into or out of the AE state by changing the refractive index of the middle silicon waveguide. The key of this thermal tuning design is to create enough temperature difference between the outer and the middle waveguides. Thus, we proposed a 3D heater-sink design that provides more heating pointing towards the middle waveguide while the sinks are put closer to the outer waveguide to absorb the heat propagates to the left and the right side. Based on the thermal simulation, the heat generated from the heater in the middle can create over 200K temperature difference and partially shift the switch from its original states. In addition, narrower heater width and closer heat sinks are two possible options to achieve a complete tuning. Also, higher melting temperature material (e.g. tungsten) could also be used to replace gold to apply a higher voltage and create higher temperature difference for better tuning. The small feature of the metal heater and heat sink gaps requires precise fabrication alignment, and therefore we developed a new self-aligned fabrication process which only requires one time alignment with high yield.

[0039] Due to the intrinsic low response speed of the thermal tuning, the three-waveguide switch with metal heater might only achieve kHz level switching speed. Other index tuning materials (e.g. ITO) could also be used to replace this thermal tuning design, to boost the switching speed to 100GHz. Transferring ITO on top of the middle switching island and sandwich it between two thin oxide layers needs very precise alignment and deposition control. Moreover, all the electrical biasing circuit should be carefully designed using vertical interconnect access (via).

Processing In Network (PIN)

[0040] Next, we discuss how the photonic 2x2 switches 50 can be utilized to create functional networks such as crossbars. When combined with an algorithm such as residue arithmetic high figure-of-merit (FOM=(Latency x Energy Consumption x Footprint)-1) multi-purpose compute engines can be created (see Table 2 below). Here the biasing scheme of the 2x2 switch is important to improve the FOM; the device default CROSS state is for zero applied voltage (i.e. Vbias=0V). Thus, only for an applied bias the device draws power. Furthermore, the switch operates over a spectrum more than 200 nm wide (broadband). This allows for simultaneous wavelength usage towards massively parallel computing architectures such as explored here.

### TABLE 2

<table>
<thead>
<tr>
<th>Building Block</th>
<th>Electronic NoC (22 nm)</th>
<th>Proposed RNS Array</th>
</tr>
</thead>
<tbody>
<tr>
<td>Performance</td>
<td>Area (µm²)</td>
<td>Energy (fJ/-operation)</td>
</tr>
<tr>
<td>Computation</td>
<td>181</td>
<td>171</td>
</tr>
<tr>
<td>Adder (16-bit binary, 46-bit RNS)</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Communication</td>
<td>199</td>
<td>400</td>
</tr>
<tr>
<td>4 x 4 Crossbar (16-bit binary, 46-bit RNS)</td>
<td>787</td>
<td>6920</td>
</tr>
<tr>
<td>4-port router (16-bit binary)</td>
<td>986</td>
<td>7320</td>
</tr>
<tr>
<td>Route + cross</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

With the 2x2 switch as building blocks, it is possible to create structures for many computational primitives required for scientific computing. One of the primary approaches that we propose here is the use of residue arithmetic. A brief overview of the residue number system is given as follows; an integer number X is represented by its residues M, representing as r=IXIM". For instance, consider the number 96. The residue of N=96 using a modulus M=11 is 8, which could be represented as [96]₁₁=8. That is, when 96 is divided by 11, the remainder is 8 (i.e., 96 goes into 96 eight times (−88), with a reminder of 8 (i.e., 96−88)). Thus, the number 96 can be represented as the number 8, which is much simpler and reduces computational processing and storage requirements.

[0041] However, since the residue is always an integer from 0 to M−1, the representation is not unique. If multiple moduli are used, then a given number can be uniquely represented, as captured by the Chinese Remainder Theorem. In our case, we could use moduli M={11, 16, 19} to obtain a representation in the residue number system X={8,
We represent it here as \(X = [8, 0, 1]_{[11, 16, 19]}\), using the subscript for the moduli. The only requirement is that the moduli \(M_i\) should be relatively prime; in other words, every pair of moduli \(M_i\) and \(M_j\) (for \(i \neq j\)) do not have any prime factors in common. The largest number that can be represented using this number system is equal to the product of the \(m\) moduli, \(M_1 \times M_2 \times \ldots \times M_m\). In our example, it is \(11 \times 16 \times 19 = 3344\).

[0043] The use of residue number systems (RNS) potentially offers substantial improvements in performance and power consumption, by enabling carry-free arithmetic. As an example, consider the addition of \(X = 96\) and \(Y = 205\). Using the moduli from our previous example, these numbers can be represented as \([8, 0, 1]_{[11, 16, 19]}\) and \([7, 13, 15]_{[11, 16, 19]}\) respectively. Addition in RNS is simply the addition of the respective residues, \([8+7, 0+13, 1+15]_{[11, 16, 19]}\). We can verify that the result, \(X + Y = 301\), is in fact \([4, 3, 16]_{[11, 16, 19]}\). For long integers, this represents a substantial parallelization due to the removal of carry propagation. Similarly, multiplication also sees benefits by yielding smaller partial products [GAR59]. Note that addition of the individual residues is cyclic, and remains within the range \(0\) to \(M-1\) for modulus \(M_i\).

[0044] FIG. 5 captures the required routing to realize a basic adder with modulus \(M = 7\). Here we use one-hot encoding to represent the residue, which means \(M\) bits are used to represent the number, out of which only one bit is set to ‘1’ corresponding to the number it represents. FIG. 5 shows how the switch 50 can implement addition operations. For example for a modulus of 7, there are seven waveguides. Using a one-hot encoding, the number \(X\) is represented by a “1” in position \(x\), and “0” everywhere else.

[0045] As shown in FIG. 6, addition with a fixed integer \(N\) is obtained by rotating the bits to the right by \(N\) bits. For example, if the input is 5, then it is represented by the one-hot bit pattern encoding \(<0000010>\), which in FIG. 6 represents the signal at the 6th position, which is 5 in the example shown. On adding the number 2, the result of the addition is 1 because the modulus is 7. The corresponding bit pattern is \(<1000000>\), a rotation of the input by 2 bits. The addition operation can thus be achieved by a programmable rotate operation. As reflected in FIG. 6, the electronic components are for state control only (bar/cross), and the photonic paths are the actual data channels.

[0046] FIG. 6 shows a modulo-7 RNS Adder system or residue arithmetic nanophotonic system 100, which is also referred to here as a router 100, using phonic devices 110 such as the 2x2 switches 110. This can be realized, for example, using the design shown in FIG. 3, shown for modulus 7. The addition operation is specified electronically by the control lines, which are by default, at ‘0’. The modulus number can be selected depending on how large the data is, with a larger modulus being used for larger data. For example, a modulo-3, -4, -5 (\(M\) needs to be co-prime to each other) combined RNS system, can be used to represent a number below \(3 \times 4 \times 5 = 60\). To assign B/C states to different switches, we need to apply a bias voltage/no bias voltage to each switch individually, following the look-up table that we calculated for different \(M\) and \(+N\). The number of switched scales with \((M-1)^2/2 + 2\), where \(M\) is the modulo value. Table 3 shows a comparison of different modulus architecture [PENG18].

### TABLE 3

<table>
<thead>
<tr>
<th>Mesh RNS Model</th>
<th>ASD RNS Model</th>
</tr>
</thead>
<tbody>
<tr>
<td>Parameters</td>
<td>MRR</td>
</tr>
<tr>
<td># of Optical Components</td>
<td>20</td>
</tr>
<tr>
<td># of Control Circuit</td>
<td>4</td>
</tr>
<tr>
<td># of Look-up Table (LUT)</td>
<td>-</td>
</tr>
<tr>
<td>Energy/op.</td>
<td>Thermal</td>
</tr>
<tr>
<td>Component</td>
<td></td>
</tr>
<tr>
<td>Area</td>
<td></td>
</tr>
<tr>
<td>Component</td>
<td>3200 µm^2</td>
</tr>
<tr>
<td>Control Circuit</td>
<td>&lt;1 µm^2</td>
</tr>
<tr>
<td>Response Time</td>
<td>40 ps</td>
</tr>
<tr>
<td>Propagation Time/Device</td>
<td>0.8 ps</td>
</tr>
</tbody>
</table>

[0047] By searching the look-up table, control signals of each switch 110 adapt to corresponding states. An example of a look-up table is shown below in Table 4 for the modulo-5 addition system 100 of FIG. 3. The look-up table provides the optimal possible states for the system 100 with the lowest power consumption, where \(C\) represents a cross (switch) and \(B\) represents a bar (not switched). The B/C values can be determined based on various factors, such as for example, that i) the path that provides the minimum switching loss (B/C states have different loss), while ii) the all-to-all connectivity still holds after the switches on a path have been set. Here, the cross state doesn’t require power to be applied to the switch, which is substantially less than the bar state, which requires power to be applied to the switch. Accordingly, the B/C states in the lookup table are configured to include as many cross-states as possible. In addition, the all-to-all connectivity ensures that the B/C states in the lookup table account for all possible permutations of the inputs and outputs. Each summand has an optimal setting regarding the lowest loss. For a single summand, one setting satisfies all inputs.

### TABLE 4

<table>
<thead>
<tr>
<th>S1</th>
<th>S2</th>
<th>S3</th>
<th>S4</th>
<th>S5</th>
<th>S6</th>
<th>S7</th>
<th>S8</th>
<th>S9</th>
<th>S10</th>
</tr>
</thead>
<tbody>
<tr>
<td>+0</td>
<td>C</td>
<td>B</td>
<td>C</td>
<td>C</td>
<td>B</td>
<td>C</td>
<td>C</td>
<td>C</td>
<td>C</td>
</tr>
<tr>
<td>+1</td>
<td>B</td>
<td>C</td>
<td>C</td>
<td>B</td>
<td>C</td>
<td>C</td>
<td>C</td>
<td>C</td>
<td>C</td>
</tr>
<tr>
<td>+2</td>
<td>C</td>
<td>B</td>
<td>C</td>
<td>C</td>
<td>C</td>
<td>C</td>
<td>C</td>
<td>B</td>
<td>C</td>
</tr>
<tr>
<td>+3</td>
<td>C</td>
<td>C</td>
<td>C</td>
<td>C</td>
<td>C</td>
<td>B</td>
<td>C</td>
<td>B</td>
<td>C</td>
</tr>
<tr>
<td>+4</td>
<td>C</td>
<td>C</td>
<td>C</td>
<td>B</td>
<td>B</td>
<td>C</td>
<td>C</td>
<td>C</td>
<td>C</td>
</tr>
</tbody>
</table>
less. For example, to scale up the modulus, a schematic plot is shown in FIG. 2 for building an RNS adder with design strategy of the present invention for size of M. The invention can be scaled since it only cost (M−1)²/2+2 switches when it scales with M. Thus, input light propagates to the expected output. The light lines show the light path for a specific example.

Thus, FIG. 2 shows the general configuration of the switches, wherein the consecutively numbered switches are arranged in a diagonal fashion with respect to one another. That is, for a modulo-5 system, as shown in FIG. 3, switches S1, S2 are arranged diagonally upward by one line, whereby the bottom input to S2 is coupled with the top output of S1. Then the next switch S3 is arranged diagonally downward from the first switch S1 by one line, whereby the top input to S3 is connected to the lower output of S1. Then S4, S5 are arranged diagonally upward from S3. Then S6 is arranged diagonally downward from the lowest switch S3, and S7, S8, S9 are arranged diagonally upward from S6. Switch S10 is then placed on the bottom two output lines, as in FIG. 2. That configuration allows for all possible permutations of inputs and outputs to be achieved with a suitable B/C state lookup table.

This B/C sequence enables all the light paths for adding 4 (e.g. 1->5, 2->1, 3->2, etc.). Thus, the light passes from input 2 to the second input of the first switch S1. The first switch S1 has the state “C” and outputs the signal on line 3 to the second input of the second switch S2. The second switch S2 has the state “C”, and so the light crosses to line 0. Thus, the second switch S2 outputs the light to the first input of the fifth switch S5. The fifth switch S5 has the state “B”, and so the light passes straight through switch S5 uncoupled, to the first input of the ninth switch S9. The ninth switch S9 has the state “C”, so the light couples to line 1 and reaches the output 1 port. Accordingly, because the final output is on line 1, the RNS results is 1 for [12+4]. This example shows that any number can be represented as a modulo 5 number, or any other suitable modulus operation. The system 100 applies a modulus of 5 in FIG. 3 by having 5 waveguides, namely five inputs 0-4 and five outputs 0-4.

Recollect that without an applied control voltage, the switches are in their ‘Cross’ state. To add ‘N’, the control line ‘+1’ is asserted to a ‘1’ state. This directs the switches in the corresponding row to operate in the ‘Bar’ state, and transmit the light directly without coupling. This circuit automatically achieves the required bit rotation. As part of this invention, circuits can be provided for different computational primitives, including subtraction, multiplication, and division [TAI79]. Division is known to be difficult with RNS, but division operations that yield only a quotient and remainder are still possible [TAI79]. Note, scaling and fixed-point arithmetic, which will be explored as part of this work (Section 3.3). Very recently residue arithmetic using ultimate fast optical switch [BAK115] and ring resonators [BAK116] were explored.
throughout the array are controlled electronically, with their multiple wavelengths, which can be achieved via nanoscale to handle fixed-point numbers [ANDR96], as well as floating-point arithmetic. The conversions factored into the measurements [CHOK09]. This is significant considering that the implementation was carried out in software on a DSP ARM core. Custom hardware as well as attojoule nanophotonics can naturally bring in substantial improvements (See Table 2).

Fixed-point and floating-point arithmetic: While integer arithmetic covers a wide range of applications, even wider applicability demands the use of fractional numbers through fixed-point as well as floating-point arithmetic. Number representations and circuit designs can be provided to handle fixed-point numbers [ANDR96], as well as floating-point numbers. All number representation can include number scaling and rounding issues.

Designs for multiple moduli: The invention can provide an adder for modulus of 7. A unified design can be provided for several different moduli. In addition, the ASD residue computing engine design of FIGS. 3, 4 is an adder and a multiplier for modulus of 5.

Large moduli: circuits can be provided with small moduli. In addition, circuits can be provided with larger moduli using one-hot encoding. Here, the term large does not have a specific value, but rather represents by integrating WDM or other mechanism, a modulo-M system could represent a system larger than M. For instance, if the modulus were 357, we would need 357 separate waveguides at the input, in addition to 357x356 switches. To limit the number of elements, the invention can utilize, for instance, wavelength division multiplexing (WDM), in order to accommodate a group of bits within a single waveguide. The 2x2 switch design can be changed accordingly to support multiple wavelengths, which can be achieved via nanoscale waveguide-in-line cavities. Alternatively, the system can adopt time-division multiplexing—using the circuit for a smaller modulus but deploying buffers at the input to feed data in multiple parts.

Referring to FIG. 7, a multi-core programmable RNS computing/switching array is shown, using the RNS core explained here and the switching crossbar above. Each of the cores has photonic inputs, while also incorporating electronic inputs that are required in the RNS core (see, for example, the adder in FIG. 3). The crossbars distributed throughout the array are controlled electronically, with their control lines originating from storage elements/buffers (not shown). The crossbars enable a multitude of connections to allow the RNS cores to communicate according to the requirements of the operation being executed. Furthermore, nested residue number system can be provided by separating the large moduli into a smaller residue number system [NAKA15]. The small moduli systems are practical and allow having the same hardware reutilized. Thus, optical resources could be saved, and larger moduli might be implemented in a feasible way. The small modulus residue module can be used to combine a large modulus residue system.

As shown in FIG. 7, the Collective Device has twelve (12) RNS switching systems (e.g., as shown in FIG. 2 or 3) X1-X12. Each switching system X1-X12 performs an additive computation. For example, the input can be 0, as shown, and X1 can be used to add a number to that input. The summation of those two numbers passes through a to the second system X2, where a third number is added. The summation of those three numbers then passes to the third system X3, and so on down the line until the final summation \( \Sigma X_i \) of all the numbers exits from the final system X12. The electronic input to each system X1-X12 controls operation of the respective switch to perform bar or cross state switching.

Execution Model and Supporting Infrastructure

As an example application, a Collective Operations (Reduction) device is provided in FIG. 8. Here, FIG. 8 is a generic representation of the example of FIG. 7, to show that the system can be configured in any number of ways with different modulus switching devices, for a computer application. Here, the systems X1-X12 are nanophotonic residue arithmetic cores, and the summation passes to a nanophotonic crossbar that controls which system X1-X12 receives that summation value. Collective operations are an integral part of parallel computing paradigms, and involve the participation of all the nodes in the parallel program. For instance, the reduction operation in MPI can compute an associative operation (such as addition, multiplication, min, max) across all cores [HEM94]. Due to the participation of all nodes, reduction operations can be expensive and are often optimized in the software library as well as in hardware [ALM05]. FIG. 8 shows the mapping of a reduction operation SUM onto the RNS array (FIG. 7). Each core is configured as an adder, and the input operand at each core is provided through an electronic input port.

At the top left, a ‘0’ is provided as the input in RNS format. The addition operation is performed entirely in photonics, and the total sum appears in the output of the last core in the RNS format, in the optical domain. There are no intermediate electronic-optic-electronic conversions required before/after the addition operation at each core. Each addition is carried out on the fly along with the data...
routing. Once the inputs are set up, the time to completion depends entirely on the speed of light alone. This example demonstrates the synergistic benefits by incorporating computing within the switching/routing operation.

FIG. 7 depicts a multicore programmable RNS computing switch. Each of the cores has photonic inputs, while also incorporating electronic inputs that are required in our RNS core (the adder or the multiplier). The crossbars distributed throughout the array are also controlled electronically, with their control lines originating from storage elements/buffers (not shown). The crossbars will enable a multitude of connections to allow the RNS cores to communicate according to the requirements of the operation being executed.

Parallel Operations Using WDM

Turning to FIG. 9, another example of the system of the invention uses wavelength division multiplexing (WDM) to allow multiple inputs to be processed in parallel. This requires extensions to the design of the 2x2 switches to switch different wavelengths independently in the RNS core. Multiplication-accumulation (MAC) operations are ideal to be fed into this model, especially in the convolutional neural network, which contains more than 90% MAC operations [NAKA15]. A weight matrix is applied millions of times as a multiplicand or a summand repetitively, thus a photonic RNS engine can be set once as the constant input every million times calculation and operates rapidly by taking advantage of the short optical propagation delay.

Spectral selectivity can be provided by ring-drop filters back-end, as shown in FIG. 9. Thus, for example, the system can process light having different wavelengths (e.g., different colors) in parallel using the passive rings as filters. Accordingly, one or more filters, such as rings, are provided following the switching network or array. The filters determine the wavelength of the light that was processed by the system. Photo-detectors can identify corresponding results. Accordingly, a set of filters is provided for each output 0-4. Each filter can detect the wavelength of the light that is transmitted at that respective output. In the example shown for FIG. 9, a green wavelength light and a blue wavelength light are both received at the input 1 and pass through to residual output line 1. The first filter in the set of n filters can detect the green light at a first wavelength $\lambda_1$, and the second filter in the set of n filters can detect the blue light at a second wavelength $\lambda_2$. As further shown, a red wavelength light signal is transmitted from input line 0 to output line 4, where the red filter detects the light on an nth frequency.

This module allows multiple operations simultaneously by allocating one modulus to one wavelength, thus increasing the system efficiency. For example, if one of the summand is 4, the other summands are (1) same input with different wavelength $\lambda_1$ and $\lambda_2$. The MRR with photodetector recognizes the result of both operation 1 (green) and operation 2 (blue) are 0. (2) different input summand $\lambda_3$ and $\lambda_4$. Operation 3 “0+4” (purple) finally obtains result of 4. Multiple operations can be executed at the same time given by the number of available wavelengths. [PENG18].

The efficiency of the RNS-based approach involves a simplified calculation of the energy, delay, and area. The operation is a 16-bit reduction with K numbers. Reductions are commutative operations defined over n number to integrate then via performing operation such as sum, multiply, logic_and, logical_or, min, max, and the like. Here we consider an addition. Note that a 16-bit number has a range 0-65535, so we choose moduli to cover this range in the RNS format. In other words, a 16-bit binary number is represented as a -40-bit residue number using our one-hot encoding, for all the three moduli.

We compare the RNS array against an electronic network-on-chip (NoC) implementation. The architectures compared are similar to FIG. 7, with an array of adders, along with an array of routers. The cores are spaced 2 mm apart, which is normal for a NoC [KURI10]. Area estimates for the 16-bit binary adders are from the literature [MOLAA14], scaled down to 22 nm, whereas the corresponding RNS adder estimates are based on our 2x2 switch parameters (Table 1). Energy is in pJ per number operated on. The estimates for the other components—Binary Crossbar, NoC Router, and Electronic Link are obtained using DSENT [SUN12]. The NoC Router is decomposed into crossbar+additional infrastructure such as buffers and routing logic, shown on two rows. Numbers for the photonic links are from our paper [SUN18]. The estimates are in Table 2. The total area, energy, and latency for a reduction of K-numbers is directly K times the sum of values provided in the first and last row. We define the figure of merit (FOM) as the ratio of (1/latency) and product of energy and area; units are GHz-mm2. The FOM values for the adder and router+crossbar are shown separately.

As we can see from the above estimates, energy reduces by a factor of 24x using the RNS array. Latency improvement factor is 110x, because the RNS latency is very small and is predominantly the light propagation delay across the cores that are 2 mm apart each. On the other hand, the latency for the electronic NoC case is composed of 4-clock cycles overhead for each pass through the router and 1 clock cycle on each 2 mm electronic link traversed. However, the area of RNS is significantly larger due to multiple circuits and crossbars for the three moduli (as well as a large value of the modulus yielding a large number of 2x2 switches), which underscores the need for optimizations using WDM or TDM. The overall FOM shows a factor of 4x improvement for addition and 20000x improvement for routing.

Nanophotonic Barriers for Extreme Scale Computing

Synchronization operations in large-scale systems can consume a lot of power and incur performance penalties due to the need for all cores to communicate with each other [LIJ04, ANBA11]. One common synchronization operation is the barrier, which requires all the participating cores to stop execution and wait until all cores have arrived at the barrier, before advancing any further in executing the rest of the program. Nanophotonics provides a viable means for integrating barriers within the communication network, at very high performance. Our proposed 2x2 switches are particularly useful with respect to barrier implementation. The invention adopts the following approach for a ‘lean’ barrier implementation [BINKO9].

FIGS. 10-14 are examples showing applications for the RNS system in accordance with the current invention. For example, FIG. 10 shows the RNS system for use with any general-purpose manycore system. Each core controls a 2x2 switch that diverts light from its input waveguide by default (control voltage=“0”). However, once the processor arrives at the barrier, it asserts the control voltage for the respective switch, which allows light to pass through. The
waveguide wraps around at the top as shown, and through appropriate coupling each processor senses the presence of light on the waveguide. When a processor sees right at the receiving end, it can safely infer that all processors have arrived at the barrier.

[0077] The barrier operation can also be utilized with the RNS compute/switching array, as shown in FIG. 11. The signal Bi indicates the barrier operation at core i. The crossbars are configured to create the waveguide loop in accordance with FIG. 10. To sense the barrier operation based on the presence of light on its reverse path (towards the ‘Done’ output), appropriate waveguide couplers/splitters will need to be incorporated within each core so that the light is sensed non-destructively on its return path.

Extensions for Final Impulse Response (FIR) Filtering

[0078] Residue number systems have been popular for digital signal processing (DSP) systems [CHOK90], and RNS implementations for digital filters have also been reported [ANDR91]. Finite impulse response (FIR) filters potentially lend themselves well for our proposed optical residue number processing. The constant coefficients can serve as one of the fixed inputs that drive the switches in the RNS cores. Data input samples can be clocked into the optical port, and it would remain in the optical domain until the output.

[0079] However, the RNS cores may need some additional components, explained as follows. FIG. 12 shows the schematic of an FIR filter; x[n] is the input that is streamed into the filter, the b values are the filter coefficients, and y[n] is the streaming output. In the figure, z^{-1} represents a unit delay corresponding to one sample period, or one clock cycle. The mapping of the FIR filter structure onto our RNS compute/switching array is straightforward, as depicted in FIG. 13. Each RNS core is programmed to carry out a multiply as well as an add operation. However, the one-clock cycle delay will need an additional component such as a tunable delay line made up of ring filters [MORI08], FIG. 14.

[0080] Furthermore, even though the filter coefficients b, are constant and are readily multiplied with input data using an RNS multiplier, the addition operation can be carried out on two data items which are both in the optical domain. Since our RNS adder cannot handle this case, one of the inputs has to be converted into the electronic domain as shown, using a photodetector. This doesn’t need any storage element that is typical in a conventional receiver, but instead the photodetector output would feed the adder directly, thus saving some energy. However, to compensate for the photodetector delay a small delay is introduced on the other input of the adder, as shown. This is just an example to make the cores widely applicable for applications. In this example, there are energy savings as the opto-electric conversion is carried out on only one of the data lines.

[0081] The Residue number systems of the invention can be utilized for neural networks and deep learning applications based on convolutional neural networks, for example. In addition, the photonic devices need not be switches, but can be other suitable components such as, for example, spatial light modulators (SLM) and/or digital mirror displays (DMD). DMDs can be light amplitude controlling. In general, any light amplitude and phase controlling device can be used. In this sense RNS is essentially a form data encoding/modulation. The read-out is always ‘one-hot’ meaning where light comes out (in the amplitude scheme of the N×M router is the answer to the RNS addition or multiplication.

[0082] It is further noted that the invention is shown and described utilizing 2×2 switches. The 2×2 switch can be utilized for any components with two inputs and two outputs and the switching mechanism (i.e. add-drop rings, MZIs, etc.). But to have the WDM capability, this component needs to be broadband as well. However, other suitable devices can be utilized.


The residue photonic system, comprising:

1. A residue photonic system, comprising: an array of a plurality of 2×2 photonic switches, said array having M modulus inputs and M modulus outputs and receiving a light signal at one input of the M modulus inputs, said plurality of photonic switches having a bar state and a cross state and arranged to indicate a residual value of the received light signal as an output at one output of the M modulus outputs, whereby the one of the M modulus outputs reflects a first value and the bar state and cross state reflect a second value, and the one output reflects an arithmetic operation of the first value and the second value.

2. The residue photonic system of claim 1, wherein the arithmetic operation comprises an addition of the first value and the second value.

3. The residue photonic system of claim 1, wherein the one output reflects a residue value of an M modulus.

4. The residue photonic system of claim 1, wherein the arithmetic operation comprises multiplication.

5. The residue photonic system of claim 1, wherein the light signal comprises a plurality of light signals each at different wavelengths and received simultaneously at the plurality of inputs.

6. The residue photonic system of claim 5, further comprising a set of one or more filters at each of the plurality of outputs, said set of filters determining a wavelength of the light signal at that output.

7. The residue photonic system of claim 1, wherein the light signal has a single wavelength.

8. A residue photonic system, comprising: an array of a plurality of photonic devices, said array having M modulus inputs and M modulus outputs and receiving a light signal at one input of the M modulus inputs, said plurality of photonic devices having arranged to indicate a residual value of the received light signal as an output at one output of the M modulus outputs, whereby the one of the M modulus outputs reflects a first value and a state of the photonic device reflects a second value, and the one output reflects an arithmetic operation of the first value and the second value.

9. The residue photonic system of claim 8, wherein the arithmetic operation comprises an addition of the first value and the second value.

10. The residue photonic system of claim 8, wherein the one output reflects a residue value of an M modulus.

11. The residue photonic system of claim 8, wherein the arithmetic operation comprises multiplication.

12. The residue photonic system of claim 8, wherein the light signal comprises a plurality of light signals each at different wavelengths and received simultaneously at the plurality of inputs.

13. The residue photonic system of claim 12, further comprising a set of one or more filters at each of the plurality of outputs, said set of filters determining a wavelength of the light signal at that output.

14. The residue photonic system of claim 8, wherein the light signal has a single wavelength.

15. The residue photonic system of claim 8, wherein said photonic device comprises a spatial light modulators (SLM) and/or digital mirror displays (DMD).